date:20140328

Re: Need help get the hadoop cluster started in EC2

2014-03-28 Thread Yusaku Sako

Hi Max,

Not sure if you have already, but you might also want to look into
Apache Ambari [1] for provisioning, managing, and monitoring Hadoop
clusters.
Many have successfully deployed Hadoop clusters on EC2 using Ambari.

[1] http://ambari.apache.org/

Yusaku

On Fri, Mar 28, 2014 at 7:07 PM, Max Zhao  wrote:
> Hi Everybody,
>
> I am trying to get my first hadoop cluster started using the Amazon EC2. I
> tried quite a few times and searched the web for the solutions, yet I still
> cannot get it up. I hope somebody can help out here.
>
> Here is what I did based on the Apache Whirr Quick Guide
> (http://whirr.apache.org/docs/0.8.1/quick-start-guide.html):
>
> 1) I downloaded a Whirr tar ball and installed it.
> bin/whirr version shows the following:  Apache Whirr
> 0.8.2  &  jclouds 1.5.8
> 2) I created the ./whirr directory and edit the credential file with my
> Amazon "PROVIDER", "IDENTITY" and "CREDENTIAL"
>"IDENTITY=AAS", with no extra quotes or curly
> quotes around the actual key_id
> 3) I used the following command to creat the key pair for whirr and stored
> it at the folder .ssh
>ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
> 4) I think I am ready to use one of the properties file provided with whirr
> in the recipes folder. Here is the command I ran:
> bin/whirr launch-cluster --config
> recipes/hadoop-yarn-ec2.properties --private-key-file ~/.ssh/id_rsa_whi
> The command ran into the error and did not bring up the hadoop.  My questin
> is: Do we need to change anything the default properties provided in the
> recipes folder in the "whirr-0.8.2" folder, such as the
> "hadoop-yarn-ec2.properties " I used?
>
> Here are the error messages:
>
> ---
> [ec2-user@ip-172-31-20-120 whirr-0.8.2]$ bin/whirr launch-cluster --config
> recipes/hadoop-yarn-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr
> Running on provider aws-ec2 using identity AKIAJLFVRARQ3IZE3KGF
> Unable to start the cluster. Terminating all nodes.
> com.google.common.util.concurrent.UncheckedExecutionException:
> com.google.inject.CreationException: Guice creation errors:
> 1) org.jclouds.rest.RestContext cannot
> be used as a key; It is not fully specified.
> 1 error
> at
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
> at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
> at
> com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
> at
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
> at
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4884)
> at org.apache.whirr.service.ComputeCache.apply(ComputeCache.java:88)
> at org.apache.whirr.service.ComputeCache.apply(ComputeCache.java:80)
> at
> org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:110)
> at
> org.apache.whirr.ClusterController.bootstrapCluster(ClusterController.java:137)
> at
> org.apache.whirr.ClusterController.launchCluster(ClusterController.java:113)
> at
> org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:69)
> at
> org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:59)
> at org.apache.whirr.cli.Main.run(Main.java:69)
> at org.apache.whirr.cli.Main.main(Main.java:102)
> Caused by: com.google.inject.CreationException: Guice creation errors:
> 1) org.jclouds.rest.RestContext cannot
> be used as a key; It is not fully specified.
> 1 error
> at
> com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:435)
> at
> com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:154)
> at
> com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106)
> at com.google.inject.Guice.createInjector(Guice.java:95)
> at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:401)
> at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:325)
> at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:600)
> at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:580)
> at
> org.apache.whirr.service.ComputeCache$1.load(ComputeCache.java:119)
> at
> org.apache.whirr.service.ComputeCache$1.load(ComputeCache.java:98)
> at
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
> at
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
> at
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
> at
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
> ... 13 more
> Unable to load cluster state, assuming it ha

Need help get the hadoop cluster started in EC2

2014-03-28 Thread Max Zhao

Hi Everybody,

I am trying to get my first hadoop cluster started using the Amazon EC2. I
tried quite a few times and searched the web for the solutions, yet I still
cannot get it up. I hope somebody can help out here.

Here is what I did based on the Apache Whirr Quick Guide (
http://whirr.apache.org/docs/0.8.1/quick-start-guide.html):

1) I downloaded a Whirr tar ball and installed it.
bin/whirr version shows the following:  Apache Whirr
0.8.2  &  jclouds 1.5.8
2) I created the ./whirr directory and edit the credential file with my
Amazon "PROVIDER", "IDENTITY" and "CREDENTIAL"
   "IDENTITY=AAS", with no extra quotes or curly
quotes around the actual key_id
3) I used the following command to creat the key pair for whirr and stored
it at the folder .ssh
   ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
4) I think I am ready to use one of the properties file provided with whirr
in the recipes folder. Here is the command I ran:
bin/whirr launch-cluster --config
recipes/hadoop-yarn-ec2.properties --private-key-file ~/.ssh/id_rsa_whi
The command ran into the error and did not bring up the hadoop.  My questin
is: Do we need to change anything the default properties provided in the
recipes folder in the "whirr-0.8.2" folder, such as the
"hadoop-yarn-ec2.properties
" I used?

Here are the error messages:

---
[ec2-user@ip-172-31-20-120 whirr-0.8.2]$ bin/whirr launch-cluster --config
recipes/hadoop-yarn-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr
Running on provider aws-ec2 using identity AKIAJLFVRARQ3IZE3KGF
Unable to start the cluster. Terminating all nodes.
com.google.common.util.concurrent.UncheckedExecutionException:
com.google.inject.CreationException: Guice creation errors:
1) org.jclouds.rest.RestContext cannot
be used as a key; It is not fully specified.
1 error
at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
at
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
at
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
at
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4884)
at org.apache.whirr.service.ComputeCache.apply(ComputeCache.java:88)
at org.apache.whirr.service.ComputeCache.apply(ComputeCache.java:80)
at
org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:110)
at
org.apache.whirr.ClusterController.bootstrapCluster(ClusterController.java:137)
at
org.apache.whirr.ClusterController.launchCluster(ClusterController.java:113)
at
org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:69)
at
org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:59)
at org.apache.whirr.cli.Main.run(Main.java:69)
at org.apache.whirr.cli.Main.main(Main.java:102)
Caused by: com.google.inject.CreationException: Guice creation errors:
1) org.jclouds.rest.RestContext cannot
be used as a key; It is not fully specified.
1 error
at
com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:435)
at
com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:154)
at
com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106)
at com.google.inject.Guice.createInjector(Guice.java:95)
at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:401)
at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:325)
at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:600)
at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:580)
at
org.apache.whirr.service.ComputeCache$1.load(ComputeCache.java:119)
at
org.apache.whirr.service.ComputeCache$1.load(ComputeCache.java:98)
at
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
at
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
at
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
... 13 more
Unable to load cluster state, assuming it has no running nodes.
java.io.FileNotFoundException: /home/ec2-user/.whirr/hadoop-yarn/instances
(No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:140)
at com.google.common.io.Files$1.getInput(Files.java:109)
at com.google.common.io.Files$1.getInput(Files.java:106)
at com.google.common.io.CharStreams$2.getInput(CharStreams.java:93)
at com.google.common.io.CharStreams$2

Re: How to find generated mapreduce code for pig/hive query

2014-03-28 Thread Shahab Yunus

You can use ILLUSTRATE and EXPLAIN commands to see the execution plan, if
you mean that by 'under the hood algorithm'

http://pig.apache.org/docs/r0.11.1/test.html

Regards,
Shahab


On Fri, Mar 28, 2014 at 5:51 PM, Spark Storm  wrote:

> hello experts,
>
> am really new to hadoop - Is it possible to find out based on pig or hive
> query to find out under the hood map reduce algorithm??
>
> thanks
>

How to find generated mapreduce code for pig/hive query

2014-03-28 Thread Spark Storm

hello experts,

am really new to hadoop - Is it possible to find out based on pig or hive
query to find out under the hood map reduce algorithm??

thanks

Re: How check sum are generated for blocks in data node

2014-03-28 Thread Wellington Chevreuil

Hi Reena,

the pipeline is per block. If you have half of your file in data node A only, 
that means the pipeline had only one node (node A, in this case, probably 
because replication factor is set to 1) and then, data node A has the checksums 
for its block. The same applies to data node B.  

All nodes will have checksums for the blocks they own. Checksums is passed 
together with the block, as it goes through the pipeline, but as the last node 
on the pipeline receives the original checksums along with the block from 
previous nodes, its only needed to make the validation on this last one, 
because if it passes there, it means the file was not corrupted in any of the 
previous nodes as well.

Cheers.

On 28 Mar 2014, at 10:28, reena upadhyay  wrote:

> I was going through this link 
> http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum
>  . Its written that in recent version of hadoop only the last data node 
> verifies the checksum as the write happens in a pipeline fashion. 
> Now I have a question:
> Assuming my cluster has two data nodes A and B cluster, I have a file, half 
> of the file content is written on first data node A and the other remaining 
> half is written on the second data node B to take advantage of parallelism.  
> My question is:  Will data node A will not store the check sum for the blocks 
> stored on it. 
> 
> As per the line "only the last data node verifies the checksum", it looks 
> like only the  last data node in my case it will be data node B, will 
> generate the checksum. But if only data node B generates checksum, then it 
> will generate the check sum only for the blocks stored on data node B. What 
> about the checksum for the data blocks on data node  machine A?

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-28 Thread Kim Chew

None of that.

I checked the the input file's SequenceFile Header and it says
"org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater"

Kim


On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya wrote:

> what is your compression format gzip, lzo or snappy
>
> for lzo final output
>
> FileOutputFormat.setCompressOutput(conf, true);
> FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
>
> In addition, to make LZO splittable, you need to make a LZO index file.
>
>
> On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew  wrote:
>
>> Thanks folks.
>>
>> I am not awared my input data file has been compressed.
>> FileOutputFromat.setCompressOutput() is set to true when the file is
>> written. 8-(
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead wrote:
>>
>>> The following might answer you partially:
>>>
>>> Input key is not read from HDFS, it is auto generated as the offset of
>>> the input value in the input file. I think that is (partially) why read
>>> hdfs bytes is smaller than written hdfs bytes.
>>>  On Mar 27, 2014 1:34 PM, "Kim Chew"  wrote:
>>>
 I am also wondering if, say, I have two identical timestamp so they are
 going to be written to the same file. Does MulitpleOutputs handle 
 appending?

 Thanks.

 Kim


 On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen  wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of
> 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> > long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> > String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> > mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> > }
> >
> > private String generateFileName(String key) {
> > return outputDir+"/"+key+"/raw-vectors";
> > }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent
> (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

>>
>

Re: when it's safe to read map-reduce result?

2014-03-28 Thread Hardik Pandya

if the job complets without any failures exitCode should be 0 and safe
to read the result
public class MyApp extends Configured implements Tool {

   public int run(String[] args) throws Exception {
 // Configuration processed by ToolRunner
 Configuration conf = getConf();

 // Create a JobConf using the processed conf
 JobConf job = new JobConf(conf, MyApp.class);

 // Process custom command-line options
 Path in = new Path(args[1]);
 Path out = new Path(args[2]);

 // Specify various job-specific parameters
 job.setJobName("my-app");
 job.setInputPath(in);
 job.setOutputPath(out);
 job.setMapperClass(MyMapper.class);
 job.setReducerClass(MyReducer.class);

 // Submit the job, then poll for progress until the job is complete
 JobClient.runJob(job);
 return 0;
   }

   public static void main(String[] args) throws Exception {
 // Let ToolRunner handle generic command-line options
 int res = ToolRunner.run(new Configuration(), new MyApp(), args);

 System.exit(res);
   }
 }



On Fri, Mar 28, 2014 at 4:41 AM, Li Li  wrote:

> thanks. is the following codes safe?
> int exitCode=ToolRunner.run()
> if(exitCode==0){
>//safe to read result
> }
>
> On Fri, Mar 28, 2014 at 4:36 PM, Dieter De Witte 
> wrote:
> > _SUCCES implies that the job has succesfully terminated, so this seems
> like
> > a reasonable criterion.
> >
> > Regards, Dieter
> >
> >
> > 2014-03-28 9:33 GMT+01:00 Li Li :
> >
> >> I have a program that do some map-reduce job and then read the result
> >> of the job.
> >> I learned that hdfs is not strong consistent. when it's safe to read the
> >> result?
> >> as long as output/_SUCCESS exist?
> >
> >
>

Re: Replication HDFS

2014-03-28 Thread Wellington Chevreuil

Hi Victor,

if by replication you mean copy from one cluster to other, you can use the 
distcp command.

Cheers.

On 28 Mar 2014, at 16:30, Serge Blazhievsky  wrote:

> You mean replication between two different hadoop cluster or you just need 
> data to be replicated between two different nodes? 
> 
> Sent from my iPhone
> 
> On Mar 28, 2014, at 8:10 AM, Victor Belizário  
> wrote:
> 
>> Hey,
>> 
>> I did look in HDFS for replication in filesystem master x slave.
>> 
>> Have any way to do master x master?
>> 
>> I just have 1 TB of files in a server and i want to replicate to another 
>> server, in real time sync.
>> 
>> Thanks !

Re: Hadoop documentation: control flow and FSM diagrams

2014-03-28 Thread Hardik Pandya

Very helpful indeed Emillio, thanks!


On Fri, Mar 28, 2014 at 12:58 PM, Emilio Coppa  wrote:

> Hi All,
>
> I have created a wiki on github:
>
> https://github.com/ercoppa/HadoopDiagrams/wiki
>
> This is an effort to provide an updated documentation of how the internals
> of Hadoop work.  The main idea is to help the user understand the "big
> picture" without removing too much internal details. You can find several
> diagrams (e.g. Finite State Machine and control flow). They are based on
> Hadoop 2.3.0.
>
> Notice that:
>
> - they are not specified in any formal language (e.g., UML) but they
> should easy to understand (Do you agree?)
> - they cover only some aspects of Hadoop but I am improving them day after
> day
> - they are not always correct but I am trying to fix errors,
> remove ambiguities, etc
>
> I hope this can be helpful to somebody out there. Any feedback from you
> may be valuable for me.
>
> Emilio.
>

Re: reducing HDFS FS connection timeouts

2014-03-28 Thread Hardik Pandya

how about adding

ipc.client.connect.max.retries.on.timeouts
*2 (default is 45)*Indicates the number of retries a client will make on
socket timeout to establish a server connection.
does that help?

On Thu, Mar 27, 2014 at 4:23 PM, John Lilley wrote:

>  It seems to take a very long time to timeout a connection to an invalid
> NN URI.  Our application is interactive so the defaults of taking many
> minutes don't work well.  I've tried setting:
>
> conf.set("ipc.client.connect.max.retries", "2");
>
> conf.set("ipc.client.connect.timeout", "7000");
>
> before calling FileSystem.get() but it doesn't seem to matter.
>
> What is the prescribed technique for lowering connection timeout to HDFS?
>
> Thanks
>
> john
>
>
>

Re: How to get locations of blocks programmatically?

2014-03-28 Thread Hardik Pandya

have you looked into FileSystem API this is hadoop v2.2.0

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html

does not exist in
http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/fs/FileSystem.html

 
org.apache.hadoop.fs.RemoteIteratorhttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/LocatedFileStatus.html>
> *listFiles
*
(Path
f,
boolean recursive)
  List the statuses and block locations of the files in the given
path.   
org.apache.hadoop.fs.RemoteIteratorhttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/LocatedFileStatus.html>
> *listLocatedStatus
*
(Path
 f)
  List the statuses of the files/directories in the given path if
the path is a directory.


On Thu, Mar 27, 2014 at 10:03 PM, Libo Yu  wrote:

> Hi all,
>
> "hadoop path fsck -files -block -locations" can list locations for all
> blocks in the path.
> Is it possible to list all blocks and the block locations for a given path
> programmatically?
> Thanks,
>
> Libo
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-28 Thread Hardik Pandya

what is your compression format gzip, lzo or snappy

for lzo final output

FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);

In addition, to make LZO splittable, you need to make a LZO index file.


On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew  wrote:

> Thanks folks.
>
> I am not awared my input data file has been compressed.
> FileOutputFromat.setCompressOutput() is set to true when the file is
> written. 8-(
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead wrote:
>
>> The following might answer you partially:
>>
>> Input key is not read from HDFS, it is auto generated as the offset of
>> the input value in the input file. I think that is (partially) why read
>> hdfs bytes is smaller than written hdfs bytes.
>>  On Mar 27, 2014 1:34 PM, "Kim Chew"  wrote:
>>
>>> I am also wondering if, say, I have two identical timestamp so they are
>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>
>>> Thanks.
>>>
>>> Kim
>>>
>>>
>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen  wrote:
>>>
 Have you checked the content of the files you write?


 /th

 On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
 > I have a simple M/R job using Mapper only thus no reducer. The mapper
 > read a timestamp from the value, generate a path to the output file
 > and writes the key and value to the output file.
 >
 >
 > The input file is a sequence file, not compressed and stored in the
 > HDFS, it has a size of 162.68 MB.
 >
 >
 > Output also is written as a sequence file.
 >
 >
 >
 > However, after I ran my job, I have two output part files from the
 > mapper. One has a size of 835.12 MB and the other has a size of 224.77
 > MB. So why is the total outputs size is so much larger? Shouldn't it
 > be more or less equal to the input's size of 162.68MB since I just
 > write the key and value passed to mapper to the output?
 >
 >
 > Here is the mapper code snippet,
 >
 > public void map(BytesWritable key, BytesWritable value, Context
 > context) throws IOException, InterruptedException {
 >
 > long timestamp = bytesToInt(value.getBytes(),
 > TIMESTAMP_INDEX);;
 > String tsStr = sdf.format(new Date(timestamp * 1000L));
 >
 > mos.write(key, value, generateFileName(tsStr)); // mos is a
 > MultipleOutputs object.
 > }
 >
 > private String generateFileName(String key) {
 > return outputDir+"/"+key+"/raw-vectors";
 > }
 >
 >
 > And here are the job outputs,
 >
 > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2
 > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2
 > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
 > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
 > Counters
 > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0
 > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
 > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386
 > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272
 > 14/03/27 11:00:56 INFO mapred.JobClient:
 > HDFS_BYTES_WRITTEN=374798
 > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
 > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415
 > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
 > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547
 > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes)
 > snapshot=166428672
 > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0
 > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap
 > usage (bytes)=38351872
 > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080
 > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes)
 > snapshot=1240104960
 > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286
 > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0
 >
 >
 > TIA,
 >
 >
 > Kim
 >



>>>
>

Re: how to be assignee ?

2014-03-28 Thread Azuryy Yu

Hi Avin,

You should be added as an sub-project's contributor, then you can be an
assignee. so you can find how to be an contributor on the Wiki.

On Fri, Mar 28, 2014 at 6:50 PM, Avinash Kujur  wrote:

> hi,
>
> how can i be assignee fro a particular issue?
> i can't see any option for being assignee on the page.
>
> Thanks.
>

Hadoop documentation: control flow and FSM diagrams

2014-03-28 Thread Emilio Coppa

Hi All,

I have created a wiki on github:

https://github.com/ercoppa/HadoopDiagrams/wiki

This is an effort to provide an updated documentation of how the internals
of Hadoop work.  The main idea is to help the user understand the "big
picture" without removing too much internal details. You can find several
diagrams (e.g. Finite State Machine and control flow). They are based on
Hadoop 2.3.0.

Notice that:

- they are not specified in any formal language (e.g., UML) but they should
easy to understand (Do you agree?)
- they cover only some aspects of Hadoop but I am improving them day after
day
- they are not always correct but I am trying to fix errors,
remove ambiguities, etc

I hope this can be helpful to somebody out there. Any feedback from you may
be valuable for me.

Emilio.

RE: R on hadoop

2014-03-28 Thread Martin, Nick

If you’re spitballing options might also look at Pattern 
http://www.cascading.org/projects/pattern/

Has some nuances so be sure to spend the time to vet your specific use case 
(i.e. what you’re actually doing in R and what you want to accomplish 
leveraging data in Hadoop).

From: Sri [mailto:hadoop...@gmail.com]
Sent: Thursday, March 27, 2014 2:51 AM
To: user@hadoop.apache.org
Cc: user@hadoop.apache.org
Subject: Re: R on hadoop

Try OpenSource h2o.ai - a cran-style package that allows fast & scalable R on 
Hadoop in-Memory.
One can invoke single threaded R from h2o package and the runtime on clusters 
is Java (not R!) - So you get better memory management.

http://docs.0xdata.com/deployment/hadoop.html

http://docs.0xdata.com/Ruser/Rpackage.html

Sri

On Mar 26, 2014, at 6:53, Saravanan Nagarajan 
mailto:saravanan.nagarajan...@gmail.com>> 
wrote:
HI Jay,

Below is my understanding of Hadoop+R environment.

1. R contain Many data mining algorithm, to re-use this we have many tools like 
RHIPE,RHAdoop,etc
2.This tools will convert R algorithm and run in Hadoop map Reduce  using 
RMR,But i am not sure whether it will work for all algorithms in R.

Please let me know if you have any other points.

Thanks,
Saravanan
linkedin.com/in/saravanan303

On Wed, Mar 26, 2014 at 5:35 PM, Jay Vyas 
mailto:jayunit...@gmail.com>> wrote:
Do you mean
(1) running mapreduce jobs from R ?

(2) Running R from a mapreduce job ?
Without much extra ceremony, for the latter, you could use either MapReduce 
streaming or pig to call a custom program, as long as R is installed on every 
node of the cluster itself

On Wed, Mar 26, 2014 at 6:39 AM, Saravanan Nagarajan 
mailto:saravanan.nagarajan...@gmail.com>> 
wrote:
HI Siddharth,

You can try "Big Data Analytics with R and Hadoop " Book, it gives many options 
and detailed steps to integrate Hadoop and R.

If you need this book then mail me to 
saravanan.nagarajan...@gmail.com.

Thanks,
Saravanan
linkedin.com/in/saravanan303

On Tue, Mar 25, 2014 at 2:04 AM, Jagat Singh 
mailto:jagatsi...@gmail.com>> wrote:
Hi,
Please see RHadoop and RMR

https://www.google.com.au/search?q=rhadoop+installation
Thanks,
Jagat Singh

On Tue, Mar 25, 2014 at 7:19 AM, Siddharth Tiwari 
mailto:siddharth.tiw...@live.com>> wrote:
Hi team any docummentation around installing r on hadoop

Sent from my iPhone

--
Jay Vyas
http://jayunit100.blogspot.com

Re: Replication HDFS

2014-03-28 Thread Serge Blazhievsky

You mean replication between two different hadoop cluster or you just need data 
to be replicated between two different nodes? 

Sent from my iPhone

> On Mar 28, 2014, at 8:10 AM, Victor Belizário  
> wrote:
> 
> Hey,
> 
> I did look in HDFS for replication in filesystem master x slave.
> 
> Have any way to do master x master?
> 
> I just have 1 TB of files in a server and i want to replicate to another 
> server, in real time sync.
> 
> Thanks !

Replication HDFS

2014-03-28 Thread Victor Belizário

Hey,
I did look in HDFS for replication in filesystem master x slave.
Have any way to do master x master?
I just have 1 TB of files in a server and i want to replicate to another 
server, in real time sync.
Thanks !

Re: YarnException: Unauthorized request to start container. This token is expired.

2014-03-28 Thread Leibnitz

no doubt

Sent from my iPhone 6

> On Mar 23, 2014, at 17:37, Fengyun RAO  wrote:
> 
> What does this exception mean? I googled a lot, all the results tell me it's 
> because the time is not synchronized between datanode and namenode.
> However, I checked all the servers, that the ntpd service is on, and the time 
> differences are less than 1 second.
> What's more, the tasks are not always failing on certain datanodes. 
> It fails and then it restarts and succeeds. If it were the time problem, I 
> guess it would always fail.
> 
> My hadoop version is CDH5 beta. Below is the detailed log:
> 
> 14/03/23 14:57:06 INFO mapreduce.Job: Running job: job_1394434496930_0032
> 14/03/23 14:57:17 INFO mapreduce.Job: Job job_1394434496930_0032 running in 
> uber mode : false
> 14/03/23 14:57:17 INFO mapreduce.Job:  map 0% reduce 0%
> 14/03/23 15:08:01 INFO mapreduce.Job: Task Id : 
> attempt_1394434496930_0032_m_34_0, Status : FAILED
> Container launch failed for container_1394434496930_0032_01_41 : 
> org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to 
> start container.
> This token is expired. current time is 1395558481146 found 1395558443384
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>at 
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
>at 
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
>at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
>at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:370)
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:724)
> 
> 14/03/23 15:08:02 INFO mapreduce.Job:  map 1% reduce 0%
> 14/03/23 15:09:36 INFO mapreduce.Job: Task Id : 
> attempt_1394434496930_0032_m_36_0, Status : FAILED
> Container launch failed for container_1394434496930_0032_01_38 : 
> org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to 
> start container.
> This token is expired. current time is 1395558575889 found 1395558443245
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>at 
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
>at 
> org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
>at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
>at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:370)
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:724)
>

how to be assignee ?

2014-03-28 Thread Avinash Kujur

hi,

how can i be assignee fro a particular issue?
i can't see any option for being assignee on the page.

Thanks.

How check sum are generated for blocks in data node

2014-03-28 Thread reena upadhyay

I was going through this link 
http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum
 . Its written that in recent version of hadoop only the last data node 
verifies the checksum as the write happens in a pipeline fashion. 
Now I have a question:
Assuming my cluster has two data nodes A and B cluster, I have a file, half of 
the file content is written on first data node A and the other remaining half 
is written on the second data node B to take advantage of parallelism.  My 
question is:  Will data node A will not store the check sum for the blocks 
stored on it. 

As per the line "only the last data node verifies the checksum", it looks like 
only the  last data node in my case it will be data node B, will generate the 
checksum. But if only data node B generates checksum, then it will generate the 
check sum only for the blocks stored on data node B. What about the checksum 
for the data blocks on data node  machine A?

Re: Does hadoop depends on ecc memory to generate checksum for data stored in HDFS

2014-03-28 Thread Harsh J

While the HDFS functionality of computing, storing and validating
checksums for block files does not specifically _require_ ECC, you do
_want_ ECC to avoid frequent checksum failures.

This is noted in Tom's book as well, in the chapter that discusses
setting up your own cluster:
"ECC memory is strongly recommended, as several Hadoop users have
reported seeing many checksum errors when using non-ECC memory on
Hadoop clusters."

On Fri, Mar 28, 2014 at 3:15 PM, reena upadhyay  wrote:
> To ensure data I/O integrity,  hadoop uses CRC 32 mechanism  to generate
> checksum for the data stored on hdfs . But suppose I have a data node
> machine that does not have ecc(error correcting code) type of memory, So
> will hadoop hdfs will be able to generate checksum for data blocks when
> read/write will happen in hdfs?
>
> Or In simple words, Does hadoop depends on ecc memory to generate checksum
> for data stored in HDFS?
>
>



-- 
Harsh J

Re: How to run data node block scanner on data node in a cluster from a remote machine?

2014-03-28 Thread Harsh J

Hello Reena,

No there isn't a programmatic way to invoke the block scanner. Note
though that the property to control its period is DN-local, so you can
change it on DNs and do a DN rolling restart to make it take effect
without requiring a HDFS downtime.

On Fri, Mar 28, 2014 at 3:07 PM, reena upadhyay  wrote:
> How to run data node block scanner on data node in a cluster from a remote
> machine?
> By default data node executes block scanner in 504 hours. This is the
> default value of dfs.datanode.scan.period . If I want to run the data node
> block scanner then one way is to configure the property of
> dfs.datanode.scan.period in hdfs-site.xml but is there any other other way.
> Is it possible to run data node block scanner on data node either through
> command or pragmatically.

-- 
Harsh J

Does hadoop depends on ecc memory to generate checksum for data stored in HDFS

2014-03-28 Thread reena upadhyay

To ensure data I/O integrity,  hadoop uses CRC 32 mechanism  to generate 
checksum for the data stored on hdfs . But suppose I have a data node machine 
that does not have ecc(error correcting code) type of memory, So will hadoop 
hdfs will be able to generate checksum for data blocks when read/write will 
happen in hdfs?

Or In simple words, Does hadoop depends on ecc memory to generate checksum for 
data stored in HDFS?

How to run data node block scanner on data node in a cluster from a remote machine?

2014-03-28 Thread reena upadhyay

How to run data node block scanner on data node in a cluster from a remote 
machine?


By default data node executes block scanner in 504 hours. This is the
 default value of dfs.datanode.scan.period . If I want to run the data 
node block scanner then  one way is to configure the property of 
dfs.datanode.scan.period in hdfs-site.xml but is there any other other 
way.
 Is it possible to run data node block scanner on data node either 
through command or pragmatically.

How to run data node block scanner on data node in a cluster from a remote machine?

2014-03-28 Thread reena upadhyay



How to run data node block scanner on data node in a cluster from a 
remote machine?


By default data node executes block scanner in 504 hours. This is the
 default value of dfs.datanode.scan.period . If I want to run the data 
node block scanner then  one way is to configure the property of 
dfs.datanode.scan.period in hdfs-site.xml but is there any other other 
way.
 Is it possible to run data node block scanner on data node either 
through command or pragmatically.

Re: when it's safe to read map-reduce result?

2014-03-28 Thread Li Li

thanks. is the following codes safe?
int exitCode=ToolRunner.run()
if(exitCode==0){
   //safe to read result
}

On Fri, Mar 28, 2014 at 4:36 PM, Dieter De Witte  wrote:
> _SUCCES implies that the job has succesfully terminated, so this seems like
> a reasonable criterion.
>
> Regards, Dieter
>
>
> 2014-03-28 9:33 GMT+01:00 Li Li :
>
>> I have a program that do some map-reduce job and then read the result
>> of the job.
>> I learned that hdfs is not strong consistent. when it's safe to read the
>> result?
>> as long as output/_SUCCESS exist?
>
>

Re: when it's safe to read map-reduce result?

2014-03-28 Thread Dieter De Witte

_SUCCES implies that the job has succesfully terminated, so this seems like
a reasonable criterion.

Regards, Dieter


2014-03-28 9:33 GMT+01:00 Li Li :

> I have a program that do some map-reduce job and then read the result
> of the job.
> I learned that hdfs is not strong consistent. when it's safe to read the
> result?
> as long as output/_SUCCESS exist?
>

when it's safe to read map-reduce result?

2014-03-28 Thread Li Li

I have a program that do some map-reduce job and then read the result
of the job.
I learned that hdfs is not strong consistent. when it's safe to read the result?
as long as output/_SUCCESS exist?

Re: Maps stuck on Pending

2014-03-28 Thread Dieter De Witte

There's is a big chance that your map output is being copied to your
reducer, this could take quite some time if you have a lot of data and
could be resolved by:

1) having more reducers
2) adjust the slowstart parameter so that the copying can start while the
map tasks are still running

Regards, Dieter


2014-03-27 20:42 GMT+01:00 Clay McDonald :

> Thanks Serge, looks like I need to at memory to my datanodes.
>
> Clay McDonald
> Cell: 202.560.4101
> Direct: 202.747.5962
>
> -Original Message-
> From: Serge Blazhievsky [mailto:hadoop...@gmail.com]
> Sent: Thursday, March 27, 2014 2:16 PM
> To: user@hadoop.apache.org
> Cc: 
> Subject: Re: Maps stuck on Pending
>
> Next step would be to look in the logs under userlog directory for that job
>
> Sent from my iPhone
>
> > On Mar 27, 2014, at 11:08 AM, Clay McDonald <
> stuart.mcdon...@bateswhite.com> wrote:
> >
> > Hi all, I have a job running with 1750 maps and 1 reduce and the status
> has been the same for the last two hours. Any thoughts?
> >
> > Thanks, Clay
>

Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread divye sheth

Hi Avinash,

The export command you can execute on any one machine in the cluster as of
now. Once you have executed the export command i.e. export
HADOOP_MAPRED_HOME=/path/to/your/hadoop/installation you can then execute
the mapred job -list command from that very same machine.

Thanks
Divye Sheth

On Fri, Mar 28, 2014 at 12:57 PM, Avinash Kujur  wrote:

> i am not getting where to set HADOOP_MAPRED_HOME and how to set.
>
> thanks
>
>
> On Fri, Mar 28, 2014 at 12:06 AM, divye sheth wrote:
>
>> You can execute this command on any machine where you have set the
>> HADOOP_MAPRED_HOME
>>
>> Thanks
>> Divye Sheth
>>
>>
>> On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur wrote:
>>
>>> we can execute the above command anywhere or do i need to execute it in
>>> any particular directory?
>>>
>>> thanks
>>>
>>>
>>> On Thu, Mar 27, 2014 at 11:41 PM, divye sheth wrote:
>>>
 I believe you are using Hadoop 2. In order to get the mapred working
 you need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or
 .bashrc file or you can use the command given below to temporarily set the
 variable.

 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

 $HADOOP_INSTALL is the location where the hadoop tar ball is extracted.

 This should work for you.

 Thanks
 Divye Sheth

 On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh <
 smart.rahul.i...@gmail.com> wrote:

> Try adding the hadoop bin path to system path.
>
>
> -Rahul Singh
>
>
> On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu wrote:
>
>> it was defined at hadoop-config.sh
>>
>>
>>
>> On Fri, Mar 28, 2014 at 1:19 PM, divye sheth wrote:
>>
>>> Which version of hadoop are u using? AFAIK the hadoop mapred home is
>>> the directory where hadoop is installed or in other words untarred.
>>>
>>> Thanks
>>> Divye Sheth
>>> On Mar 28, 2014 10:43 AM, "Avinash Kujur"  wrote:
>>>
 hi,

 when i am trying to execute this command:
 hadoop job -history ~/1
 its giving error like:
 DEPRECATED: Use of this script to execute mapred command is
 deprecated.
 Instead use the mapred command for it.

 HADOOP_MAPRED_HOME not found!

 from where can i get HADOOP_MAPRED_HOME?

 thanks.

>>>
>>
>

>>>
>>
>

Re: mapred job -list error

2014-03-28 Thread Harsh J

Please also indicate your exact Hadoop version in use.

On Fri, Mar 28, 2014 at 9:04 AM, haihong lu  wrote:
> dear all:
>
> I had a problem today, when i executed the command "mapred job
> -list" on a slave, an error came out. show the message as below:
>
> 14/03/28 11:18:47 INFO Configuration.deprecation: session.id is deprecated.
> Instead, use dfs.metrics.session-id
> 14/03/28 11:18:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.hadoop.mapreduce.tools.CLI.listJobs(CLI.java:504)
> at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:312)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237)
>
> when i executed the same command yesterday, it was ok.
> Thanks for any help



-- 
Harsh J

Re: How to get locations of blocks programmatically?

2014-03-28 Thread Harsh J

Yes, use 
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.Path,
long, long)

On Fri, Mar 28, 2014 at 7:33 AM, Libo Yu  wrote:
> Hi all,
>
> "hadoop path fsck -files -block -locations" can list locations for all
> blocks in the path.
> Is it possible to list all blocks and the block locations for a given path
> programmatically?
> Thanks,
>
> Libo



-- 
Harsh J

Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread Avinash Kujur

i am not getting where to set HADOOP_MAPRED_HOME and how to set.

thanks


On Fri, Mar 28, 2014 at 12:06 AM, divye sheth  wrote:

> You can execute this command on any machine where you have set the
> HADOOP_MAPRED_HOME
>
> Thanks
> Divye Sheth
>
>
> On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur  wrote:
>
>> we can execute the above command anywhere or do i need to execute it in
>> any particular directory?
>>
>> thanks
>>
>>
>> On Thu, Mar 27, 2014 at 11:41 PM, divye sheth wrote:
>>
>>> I believe you are using Hadoop 2. In order to get the mapred working you
>>> need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or
>>> .bashrc file or you can use the command given below to temporarily set the
>>> variable.
>>>
>>> export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
>>>
>>> $HADOOP_INSTALL is the location where the hadoop tar ball is extracted.
>>>
>>> This should work for you.
>>>
>>> Thanks
>>> Divye Sheth
>>>
>>>
>>>
>>> On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh <
>>> smart.rahul.i...@gmail.com> wrote:
>>>
 Try adding the hadoop bin path to system path.


 -Rahul Singh


 On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu  wrote:

> it was defined at hadoop-config.sh
>
>
>
> On Fri, Mar 28, 2014 at 1:19 PM, divye sheth wrote:
>
>> Which version of hadoop are u using? AFAIK the hadoop mapred home is
>> the directory where hadoop is installed or in other words untarred.
>>
>> Thanks
>> Divye Sheth
>> On Mar 28, 2014 10:43 AM, "Avinash Kujur"  wrote:
>>
>>> hi,
>>>
>>> when i am trying to execute this command:
>>> hadoop job -history ~/1
>>> its giving error like:
>>> DEPRECATED: Use of this script to execute mapred command is
>>> deprecated.
>>> Instead use the mapred command for it.
>>>
>>> HADOOP_MAPRED_HOME not found!
>>>
>>> from where can i get HADOOP_MAPRED_HOME?
>>>
>>> thanks.
>>>
>>
>

>>>
>>
>

Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread divye sheth

You can execute this command on any machine where you have set the
HADOOP_MAPRED_HOME

Thanks
Divye Sheth


On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur  wrote:

> we can execute the above command anywhere or do i need to execute it in
> any particular directory?
>
> thanks
>
>
> On Thu, Mar 27, 2014 at 11:41 PM, divye sheth wrote:
>
>> I believe you are using Hadoop 2. In order to get the mapred working you
>> need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or
>> .bashrc file or you can use the command given below to temporarily set the
>> variable.
>>
>> export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
>>
>> $HADOOP_INSTALL is the location where the hadoop tar ball is extracted.
>>
>> This should work for you.
>>
>> Thanks
>> Divye Sheth
>>
>>
>>
>> On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh > > wrote:
>>
>>> Try adding the hadoop bin path to system path.
>>>
>>>
>>> -Rahul Singh
>>>
>>>
>>> On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu  wrote:
>>>
 it was defined at hadoop-config.sh



 On Fri, Mar 28, 2014 at 1:19 PM, divye sheth wrote:

> Which version of hadoop are u using? AFAIK the hadoop mapred home is
> the directory where hadoop is installed or in other words untarred.
>
> Thanks
> Divye Sheth
> On Mar 28, 2014 10:43 AM, "Avinash Kujur"  wrote:
>
>> hi,
>>
>> when i am trying to execute this command:
>> hadoop job -history ~/1
>> its giving error like:
>> DEPRECATED: Use of this script to execute mapred command is
>> deprecated.
>> Instead use the mapred command for it.
>>
>> HADOOP_MAPRED_HOME not found!
>>
>> from where can i get HADOOP_MAPRED_HOME?
>>
>> thanks.
>>
>

>>>
>>
>

Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread Avinash Kujur

we can execute the above command anywhere or do i need to execute it in any
particular directory?

thanks


On Thu, Mar 27, 2014 at 11:41 PM, divye sheth  wrote:

> I believe you are using Hadoop 2. In order to get the mapred working you
> need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or
> .bashrc file or you can use the command given below to temporarily set the
> variable.
>
> export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
>
> $HADOOP_INSTALL is the location where the hadoop tar ball is extracted.
>
> This should work for you.
>
> Thanks
> Divye Sheth
>
>
>
> On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh 
> wrote:
>
>> Try adding the hadoop bin path to system path.
>>
>>
>> -Rahul Singh
>>
>>
>> On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu  wrote:
>>
>>> it was defined at hadoop-config.sh
>>>
>>>
>>>
>>> On Fri, Mar 28, 2014 at 1:19 PM, divye sheth wrote:
>>>
 Which version of hadoop are u using? AFAIK the hadoop mapred home is
 the directory where hadoop is installed or in other words untarred.

 Thanks
 Divye Sheth
 On Mar 28, 2014 10:43 AM, "Avinash Kujur"  wrote:

> hi,
>
> when i am trying to execute this command:
> hadoop job -history ~/1
> its giving error like:
> DEPRECATED: Use of this script to execute mapred command is deprecated.
> Instead use the mapred command for it.
>
> HADOOP_MAPRED_HOME not found!
>
> from where can i get HADOOP_MAPRED_HOME?
>
> thanks.
>

>>>
>>
>

Re: Need help get the hadoop cluster started in EC2

Need help get the hadoop cluster started in EC2

Re: How to find generated mapreduce code for pig/hive query

How to find generated mapreduce code for pig/hive query

Re: How check sum are generated for blocks in data node

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Re: when it's safe to read map-reduce result?

Re: Replication HDFS

Re: Hadoop documentation: control flow and FSM diagrams

Re: reducing HDFS FS connection timeouts

Re: How to get locations of blocks programmatically?

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Re: how to be assignee ?

Hadoop documentation: control flow and FSM diagrams

RE: R on hadoop

Re: Replication HDFS

Replication HDFS

Re: YarnException: Unauthorized request to start container. This token is expired.

how to be assignee ?

How check sum are generated for blocks in data node

Re: Does hadoop depends on ecc memory to generate checksum for data stored in HDFS

Re: How to run data node block scanner on data node in a cluster from a remote machine?

Does hadoop depends on ecc memory to generate checksum for data stored in HDFS

How to run data node block scanner on data node in a cluster from a remote machine?

How to run data node block scanner on data node in a cluster from a remote machine?

Re: when it's safe to read map-reduce result?

Re: when it's safe to read map-reduce result?

when it's safe to read map-reduce result?

Re: Maps stuck on Pending

Re: HADOOP_MAPRED_HOME not found!

Re: mapred job -list error

Re: How to get locations of blocks programmatically?

Re: HADOOP_MAPRED_HOME not found!

Re: HADOOP_MAPRED_HOME not found!

Re: HADOOP_MAPRED_HOME not found!

35 matches

Site Navigation

Mail list logo

Footer information