Re: HADOOP_MAPRED_HOME not found!
it was defined at hadoop-config.sh On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.com wrote: Which version of hadoop are u using? AFAIK the hadoop mapred home is the directory where hadoop is installed or in other words untarred. Thanks Divye Sheth On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote: hi, when i am trying to execute this command: hadoop job -history ~/1 its giving error like: DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. HADOOP_MAPRED_HOME not found! from where can i get HADOOP_MAPRED_HOME? thanks.
Re: HADOOP_MAPRED_HOME not found!
Try adding the hadoop bin path to system path. -Rahul Singh On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.com wrote: it was defined at hadoop-config.sh On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.com wrote: Which version of hadoop are u using? AFAIK the hadoop mapred home is the directory where hadoop is installed or in other words untarred. Thanks Divye Sheth On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote: hi, when i am trying to execute this command: hadoop job -history ~/1 its giving error like: DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. HADOOP_MAPRED_HOME not found! from where can i get HADOOP_MAPRED_HOME? thanks.
Re: HADOOP_MAPRED_HOME not found!
we can execute the above command anywhere or do i need to execute it in any particular directory? thanks On Thu, Mar 27, 2014 at 11:41 PM, divye sheth divs.sh...@gmail.com wrote: I believe you are using Hadoop 2. In order to get the mapred working you need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or .bashrc file or you can use the command given below to temporarily set the variable. export HADOOP_MAPRED_HOME=$HADOOP_INSTALL $HADOOP_INSTALL is the location where the hadoop tar ball is extracted. This should work for you. Thanks Divye Sheth On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh smart.rahul.i...@gmail.comwrote: Try adding the hadoop bin path to system path. -Rahul Singh On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.com wrote: it was defined at hadoop-config.sh On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.comwrote: Which version of hadoop are u using? AFAIK the hadoop mapred home is the directory where hadoop is installed or in other words untarred. Thanks Divye Sheth On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote: hi, when i am trying to execute this command: hadoop job -history ~/1 its giving error like: DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. HADOOP_MAPRED_HOME not found! from where can i get HADOOP_MAPRED_HOME? thanks.
Re: HADOOP_MAPRED_HOME not found!
You can execute this command on any machine where you have set the HADOOP_MAPRED_HOME Thanks Divye Sheth On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur avin...@gmail.com wrote: we can execute the above command anywhere or do i need to execute it in any particular directory? thanks On Thu, Mar 27, 2014 at 11:41 PM, divye sheth divs.sh...@gmail.comwrote: I believe you are using Hadoop 2. In order to get the mapred working you need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or .bashrc file or you can use the command given below to temporarily set the variable. export HADOOP_MAPRED_HOME=$HADOOP_INSTALL $HADOOP_INSTALL is the location where the hadoop tar ball is extracted. This should work for you. Thanks Divye Sheth On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh smart.rahul.i...@gmail.com wrote: Try adding the hadoop bin path to system path. -Rahul Singh On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.com wrote: it was defined at hadoop-config.sh On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.comwrote: Which version of hadoop are u using? AFAIK the hadoop mapred home is the directory where hadoop is installed or in other words untarred. Thanks Divye Sheth On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote: hi, when i am trying to execute this command: hadoop job -history ~/1 its giving error like: DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. HADOOP_MAPRED_HOME not found! from where can i get HADOOP_MAPRED_HOME? thanks.
Re: HADOOP_MAPRED_HOME not found!
i am not getting where to set HADOOP_MAPRED_HOME and how to set. thanks On Fri, Mar 28, 2014 at 12:06 AM, divye sheth divs.sh...@gmail.com wrote: You can execute this command on any machine where you have set the HADOOP_MAPRED_HOME Thanks Divye Sheth On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur avin...@gmail.com wrote: we can execute the above command anywhere or do i need to execute it in any particular directory? thanks On Thu, Mar 27, 2014 at 11:41 PM, divye sheth divs.sh...@gmail.comwrote: I believe you are using Hadoop 2. In order to get the mapred working you need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or .bashrc file or you can use the command given below to temporarily set the variable. export HADOOP_MAPRED_HOME=$HADOOP_INSTALL $HADOOP_INSTALL is the location where the hadoop tar ball is extracted. This should work for you. Thanks Divye Sheth On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh smart.rahul.i...@gmail.com wrote: Try adding the hadoop bin path to system path. -Rahul Singh On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.com wrote: it was defined at hadoop-config.sh On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.comwrote: Which version of hadoop are u using? AFAIK the hadoop mapred home is the directory where hadoop is installed or in other words untarred. Thanks Divye Sheth On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote: hi, when i am trying to execute this command: hadoop job -history ~/1 its giving error like: DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. HADOOP_MAPRED_HOME not found! from where can i get HADOOP_MAPRED_HOME? thanks.
Re: How to get locations of blocks programmatically?
Yes, use http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.Path, long, long) On Fri, Mar 28, 2014 at 7:33 AM, Libo Yu yu_l...@hotmail.com wrote: Hi all, hadoop path fsck -files -block -locations can list locations for all blocks in the path. Is it possible to list all blocks and the block locations for a given path programmatically? Thanks, Libo -- Harsh J
Re: mapred job -list error
Please also indicate your exact Hadoop version in use. On Fri, Mar 28, 2014 at 9:04 AM, haihong lu ung3...@gmail.com wrote: dear all: I had a problem today, when i executed the command mapred job -list on a slave, an error came out. show the message as below: 14/03/28 11:18:47 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/03/28 11:18:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= Exception in thread main java.lang.NullPointerException at org.apache.hadoop.mapreduce.tools.CLI.listJobs(CLI.java:504) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:312) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237) when i executed the same command yesterday, it was ok. Thanks for any help -- Harsh J
Re: HADOOP_MAPRED_HOME not found!
Hi Avinash, The export command you can execute on any one machine in the cluster as of now. Once you have executed the export command i.e. export HADOOP_MAPRED_HOME=/path/to/your/hadoop/installation you can then execute the mapred job -list command from that very same machine. Thanks Divye Sheth On Fri, Mar 28, 2014 at 12:57 PM, Avinash Kujur avin...@gmail.com wrote: i am not getting where to set HADOOP_MAPRED_HOME and how to set. thanks On Fri, Mar 28, 2014 at 12:06 AM, divye sheth divs.sh...@gmail.comwrote: You can execute this command on any machine where you have set the HADOOP_MAPRED_HOME Thanks Divye Sheth On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur avin...@gmail.comwrote: we can execute the above command anywhere or do i need to execute it in any particular directory? thanks On Thu, Mar 27, 2014 at 11:41 PM, divye sheth divs.sh...@gmail.comwrote: I believe you are using Hadoop 2. In order to get the mapred working you need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or .bashrc file or you can use the command given below to temporarily set the variable. export HADOOP_MAPRED_HOME=$HADOOP_INSTALL $HADOOP_INSTALL is the location where the hadoop tar ball is extracted. This should work for you. Thanks Divye Sheth On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh smart.rahul.i...@gmail.com wrote: Try adding the hadoop bin path to system path. -Rahul Singh On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.comwrote: it was defined at hadoop-config.sh On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.comwrote: Which version of hadoop are u using? AFAIK the hadoop mapred home is the directory where hadoop is installed or in other words untarred. Thanks Divye Sheth On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote: hi, when i am trying to execute this command: hadoop job -history ~/1 its giving error like: DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. HADOOP_MAPRED_HOME not found! from where can i get HADOOP_MAPRED_HOME? thanks.
Re: Maps stuck on Pending
There's is a big chance that your map output is being copied to your reducer, this could take quite some time if you have a lot of data and could be resolved by: 1) having more reducers 2) adjust the slowstart parameter so that the copying can start while the map tasks are still running Regards, Dieter 2014-03-27 20:42 GMT+01:00 Clay McDonald stuart.mcdon...@bateswhite.com: Thanks Serge, looks like I need to at memory to my datanodes. Clay McDonald Cell: 202.560.4101 Direct: 202.747.5962 -Original Message- From: Serge Blazhievsky [mailto:hadoop...@gmail.com] Sent: Thursday, March 27, 2014 2:16 PM To: user@hadoop.apache.org Cc: user@hadoop.apache.org Subject: Re: Maps stuck on Pending Next step would be to look in the logs under userlog directory for that job Sent from my iPhone On Mar 27, 2014, at 11:08 AM, Clay McDonald stuart.mcdon...@bateswhite.com wrote: Hi all, I have a job running with 1750 maps and 1 reduce and the status has been the same for the last two hours. Any thoughts? Thanks, Clay
when it's safe to read map-reduce result?
I have a program that do some map-reduce job and then read the result of the job. I learned that hdfs is not strong consistent. when it's safe to read the result? as long as output/_SUCCESS exist?
Re: when it's safe to read map-reduce result?
_SUCCES implies that the job has succesfully terminated, so this seems like a reasonable criterion. Regards, Dieter 2014-03-28 9:33 GMT+01:00 Li Li fancye...@gmail.com: I have a program that do some map-reduce job and then read the result of the job. I learned that hdfs is not strong consistent. when it's safe to read the result? as long as output/_SUCCESS exist?
Re: when it's safe to read map-reduce result?
thanks. is the following codes safe? int exitCode=ToolRunner.run() if(exitCode==0){ //safe to read result } On Fri, Mar 28, 2014 at 4:36 PM, Dieter De Witte drdwi...@gmail.com wrote: _SUCCES implies that the job has succesfully terminated, so this seems like a reasonable criterion. Regards, Dieter 2014-03-28 9:33 GMT+01:00 Li Li fancye...@gmail.com: I have a program that do some map-reduce job and then read the result of the job. I learned that hdfs is not strong consistent. when it's safe to read the result? as long as output/_SUCCESS exist?
How to run data node block scanner on data node in a cluster from a remote machine?
How to run data node block scanner on data node in a cluster from a remote machine? By default data node executes block scanner in 504 hours. This is the default value of dfs.datanode.scan.period . If I want to run the data node block scanner then one way is to configure the property of dfs.datanode.scan.period in hdfs-site.xml but is there any other other way. Is it possible to run data node block scanner on data node either through command or pragmatically.
How to run data node block scanner on data node in a cluster from a remote machine?
How to run data node block scanner on data node in a cluster from a remote machine? By default data node executes block scanner in 504 hours. This is the default value of dfs.datanode.scan.period . If I want to run the data node block scanner then one way is to configure the property of dfs.datanode.scan.period in hdfs-site.xml but is there any other other way. Is it possible to run data node block scanner on data node either through command or pragmatically.
Does hadoop depends on ecc memory to generate checksum for data stored in HDFS
To ensure data I/O integrity, hadoop uses CRC 32 mechanism to generate checksum for the data stored on hdfs . But suppose I have a data node machine that does not have ecc(error correcting code) type of memory, So will hadoop hdfs will be able to generate checksum for data blocks when read/write will happen in hdfs? Or In simple words, Does hadoop depends on ecc memory to generate checksum for data stored in HDFS?
Re: How to run data node block scanner on data node in a cluster from a remote machine?
Hello Reena, No there isn't a programmatic way to invoke the block scanner. Note though that the property to control its period is DN-local, so you can change it on DNs and do a DN rolling restart to make it take effect without requiring a HDFS downtime. On Fri, Mar 28, 2014 at 3:07 PM, reena upadhyay reena2...@outlook.com wrote: How to run data node block scanner on data node in a cluster from a remote machine? By default data node executes block scanner in 504 hours. This is the default value of dfs.datanode.scan.period . If I want to run the data node block scanner then one way is to configure the property of dfs.datanode.scan.period in hdfs-site.xml but is there any other other way. Is it possible to run data node block scanner on data node either through command or pragmatically. -- Harsh J
Re: Does hadoop depends on ecc memory to generate checksum for data stored in HDFS
While the HDFS functionality of computing, storing and validating checksums for block files does not specifically _require_ ECC, you do _want_ ECC to avoid frequent checksum failures. This is noted in Tom's book as well, in the chapter that discusses setting up your own cluster: ECC memory is strongly recommended, as several Hadoop users have reported seeing many checksum errors when using non-ECC memory on Hadoop clusters. On Fri, Mar 28, 2014 at 3:15 PM, reena upadhyay reena2...@outlook.com wrote: To ensure data I/O integrity, hadoop uses CRC 32 mechanism to generate checksum for the data stored on hdfs . But suppose I have a data node machine that does not have ecc(error correcting code) type of memory, So will hadoop hdfs will be able to generate checksum for data blocks when read/write will happen in hdfs? Or In simple words, Does hadoop depends on ecc memory to generate checksum for data stored in HDFS? -- Harsh J
How check sum are generated for blocks in data node
I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node verifies the checksum as the write happens in a pipeline fashion. Now I have a question: Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file content is written on first data node A and the other remaining half is written on the second data node B to take advantage of parallelism. My question is: Will data node A will not store the check sum for the blocks stored on it. As per the line only the last data node verifies the checksum, it looks like only the last data node in my case it will be data node B, will generate the checksum. But if only data node B generates checksum, then it will generate the check sum only for the blocks stored on data node B. What about the checksum for the data blocks on data node machine A?
how to be assignee ?
hi, how can i be assignee fro a particular issue? i can't see any option for being assignee on the page. Thanks.
Re: YarnException: Unauthorized request to start container. This token is expired.
no doubt Sent from my iPhone 6 On Mar 23, 2014, at 17:37, Fengyun RAO raofeng...@gmail.com wrote: What does this exception mean? I googled a lot, all the results tell me it's because the time is not synchronized between datanode and namenode. However, I checked all the servers, that the ntpd service is on, and the time differences are less than 1 second. What's more, the tasks are not always failing on certain datanodes. It fails and then it restarts and succeeds. If it were the time problem, I guess it would always fail. My hadoop version is CDH5 beta. Below is the detailed log: 14/03/23 14:57:06 INFO mapreduce.Job: Running job: job_1394434496930_0032 14/03/23 14:57:17 INFO mapreduce.Job: Job job_1394434496930_0032 running in uber mode : false 14/03/23 14:57:17 INFO mapreduce.Job: map 0% reduce 0% 14/03/23 15:08:01 INFO mapreduce.Job: Task Id : attempt_1394434496930_0032_m_34_0, Status : FAILED Container launch failed for container_1394434496930_0032_01_41 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is 1395558481146 found 1395558443384 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:370) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 14/03/23 15:08:02 INFO mapreduce.Job: map 1% reduce 0% 14/03/23 15:09:36 INFO mapreduce.Job: Task Id : attempt_1394434496930_0032_m_36_0, Status : FAILED Container launch failed for container_1394434496930_0032_01_38 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is 1395558575889 found 1395558443245 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:370) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)
Replication HDFS
Hey, I did look in HDFS for replication in filesystem master x slave. Have any way to do master x master? I just have 1 TB of files in a server and i want to replicate to another server, in real time sync. Thanks !
Hadoop documentation: control flow and FSM diagrams
Hi All, I have created a wiki on github: https://github.com/ercoppa/HadoopDiagrams/wiki This is an effort to provide an updated documentation of how the internals of Hadoop work. The main idea is to help the user understand the big picture without removing too much internal details. You can find several diagrams (e.g. Finite State Machine and control flow). They are based on Hadoop 2.3.0. Notice that: - they are not specified in any formal language (e.g., UML) but they should easy to understand (Do you agree?) - they cover only some aspects of Hadoop but I am improving them day after day - they are not always correct but I am trying to fix errors, remove ambiguities, etc I hope this can be helpful to somebody out there. Any feedback from you may be valuable for me. Emilio.
RE: R on hadoop
If you’re spitballing options might also look at Pattern http://www.cascading.org/projects/pattern/ Has some nuances so be sure to spend the time to vet your specific use case (i.e. what you’re actually doing in R and what you want to accomplish leveraging data in Hadoop). From: Sri [mailto:hadoop...@gmail.com] Sent: Thursday, March 27, 2014 2:51 AM To: user@hadoop.apache.org Cc: user@hadoop.apache.org Subject: Re: R on hadoop Try OpenSource h2o.ai - a cran-style package that allows fast scalable R on Hadoop in-Memory. One can invoke single threaded R from h2o package and the runtime on clusters is Java (not R!) - So you get better memory management. http://docs.0xdata.com/deployment/hadoop.html http://docs.0xdata.com/Ruser/Rpackage.html Sri On Mar 26, 2014, at 6:53, Saravanan Nagarajan saravanan.nagarajan...@gmail.commailto:saravanan.nagarajan...@gmail.com wrote: HI Jay, Below is my understanding of Hadoop+R environment. 1. R contain Many data mining algorithm, to re-use this we have many tools like RHIPE,RHAdoop,etc 2.This tools will convert R algorithm and run in Hadoop map Reduce using RMR,But i am not sure whether it will work for all algorithms in R. Please let me know if you have any other points. Thanks, Saravanan linkedin.com/in/saravanan303http://linkedin.com/in/saravanan303 On Wed, Mar 26, 2014 at 5:35 PM, Jay Vyas jayunit...@gmail.commailto:jayunit...@gmail.com wrote: Do you mean (1) running mapreduce jobs from R ? (2) Running R from a mapreduce job ? Without much extra ceremony, for the latter, you could use either MapReduce streaming or pig to call a custom program, as long as R is installed on every node of the cluster itself On Wed, Mar 26, 2014 at 6:39 AM, Saravanan Nagarajan saravanan.nagarajan...@gmail.commailto:saravanan.nagarajan...@gmail.com wrote: HI Siddharth, You can try Big Data Analytics with R and Hadoop Book, it gives many options and detailed steps to integrate Hadoop and R. If you need this book then mail me to saravanan.nagarajan...@gmail.commailto:saravanan.nagarajan...@gmail.com. Thanks, Saravanan linkedin.com/in/saravanan303http://linkedin.com/in/saravanan303 On Tue, Mar 25, 2014 at 2:04 AM, Jagat Singh jagatsi...@gmail.commailto:jagatsi...@gmail.com wrote: Hi, Please see RHadoop and RMR https://www.google.com.au/search?q=rhadoop+installation Thanks, Jagat Singh On Tue, Mar 25, 2014 at 7:19 AM, Siddharth Tiwari siddharth.tiw...@live.commailto:siddharth.tiw...@live.com wrote: Hi team any docummentation around installing r on hadoop Sent from my iPhone -- Jay Vyas http://jayunit100.blogspot.com
Re: Replication HDFS
You mean replication between two different hadoop cluster or you just need data to be replicated between two different nodes? Sent from my iPhone On Mar 28, 2014, at 8:10 AM, Victor Belizário victor_beliza...@hotmail.com wrote: Hey, I did look in HDFS for replication in filesystem master x slave. Have any way to do master x master? I just have 1 TB of files in a server and i want to replicate to another server, in real time sync. Thanks !
Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
what is your compression format gzip, lzo or snappy for lzo final output FileOutputFormat.setCompressOutput(conf, true); FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class); In addition, to make LZO splittable, you need to make a LZO index file. On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew kchew...@gmail.com wrote: Thanks folks. I am not awared my input data file has been compressed. FileOutputFromat.setCompressOutput() is set to true when the file is written. 8-( Kim On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead mostafa.g@gmail.comwrote: The following might answer you partially: Input key is not read from HDFS, it is auto generated as the offset of the input value in the input file. I think that is (partially) why read hdfs bytes is smaller than written hdfs bytes. On Mar 27, 2014 1:34 PM, Kim Chew kchew...@gmail.com wrote: I am also wondering if, say, I have two identical timestamp so they are going to be written to the same file. Does MulitpleOutputs handle appending? Thanks. Kim On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen t...@bentzn.com wrote: Have you checked the content of the files you write? /th On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: I have a simple M/R job using Mapper only thus no reducer. The mapper read a timestamp from the value, generate a path to the output file and writes the key and value to the output file. The input file is a sequence file, not compressed and stored in the HDFS, it has a size of 162.68 MB. Output also is written as a sequence file. However, after I ran my job, I have two output part files from the mapper. One has a size of 835.12 MB and the other has a size of 224.77 MB. So why is the total outputs size is so much larger? Shouldn't it be more or less equal to the input's size of 162.68MB since I just write the key and value passed to mapper to the output? Here is the mapper code snippet, public void map(BytesWritable key, BytesWritable value, Context context) throws IOException, InterruptedException { long timestamp = bytesToInt(value.getBytes(), TIMESTAMP_INDEX);; String tsStr = sdf.format(new Date(timestamp * 1000L)); mos.write(key, value, generateFileName(tsStr)); // mos is a MultipleOutputs object. } private String generateFileName(String key) { return outputDir+/+key+/raw-vectors; } And here are the job outputs, 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format Counters 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=374798 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) snapshot=166428672 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap usage (bytes)=38351872 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1240104960 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 TIA, Kim
Re: How to get locations of blocks programmatically?
have you looked into FileSystem API this is hadoop v2.2.0 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html does not exist in http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/fs/FileSystem.html org.apache.hadoop.fs.RemoteIteratorLocatedFileStatushttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/LocatedFileStatus.html *listFiles http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#listFiles%28org.apache.hadoop.fs.Path,%20boolean%29* (Pathhttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/Path.html f, boolean recursive) List the statuses and block locations of the files in the given path. org.apache.hadoop.fs.RemoteIteratorLocatedFileStatushttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/LocatedFileStatus.html *listLocatedStatus http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#listLocatedStatus%28org.apache.hadoop.fs.Path%29* (Pathhttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/Path.html f) List the statuses of the files/directories in the given path if the path is a directory. On Thu, Mar 27, 2014 at 10:03 PM, Libo Yu yu_l...@hotmail.com wrote: Hi all, hadoop path fsck -files -block -locations can list locations for all blocks in the path. Is it possible to list all blocks and the block locations for a given path programmatically? Thanks, Libo
Re: reducing HDFS FS connection timeouts
how about adding ipc.client.connect.max.retries.on.timeouts *2 (default is 45)*Indicates the number of retries a client will make on socket timeout to establish a server connection. does that help? On Thu, Mar 27, 2014 at 4:23 PM, John Lilley john.lil...@redpoint.netwrote: It seems to take a very long time to timeout a connection to an invalid NN URI. Our application is interactive so the defaults of taking many minutes don't work well. I've tried setting: conf.set(ipc.client.connect.max.retries, 2); conf.set(ipc.client.connect.timeout, 7000); before calling FileSystem.get() but it doesn't seem to matter. What is the prescribed technique for lowering connection timeout to HDFS? Thanks john
Re: how to be assignee ?
Hi Avin, You should be added as an sub-project's contributor, then you can be an assignee. so you can find how to be an contributor on the Wiki. On Fri, Mar 28, 2014 at 6:50 PM, Avinash Kujur avin...@gmail.com wrote: hi, how can i be assignee fro a particular issue? i can't see any option for being assignee on the page. Thanks.
Re: Hadoop documentation: control flow and FSM diagrams
Very helpful indeed Emillio, thanks! On Fri, Mar 28, 2014 at 12:58 PM, Emilio Coppa erco...@gmail.com wrote: Hi All, I have created a wiki on github: https://github.com/ercoppa/HadoopDiagrams/wiki This is an effort to provide an updated documentation of how the internals of Hadoop work. The main idea is to help the user understand the big picture without removing too much internal details. You can find several diagrams (e.g. Finite State Machine and control flow). They are based on Hadoop 2.3.0. Notice that: - they are not specified in any formal language (e.g., UML) but they should easy to understand (Do you agree?) - they cover only some aspects of Hadoop but I am improving them day after day - they are not always correct but I am trying to fix errors, remove ambiguities, etc I hope this can be helpful to somebody out there. Any feedback from you may be valuable for me. Emilio.
Re: when it's safe to read map-reduce result?
if the job complets without any failures exitCode should be 0 and safe to read the result public class MyApp extends Configured implements Tool { public int run(String[] args) throws Exception { // Configuration processed by ToolRunner Configuration conf = getConf(); // Create a JobConf using the processed conf JobConf job = new JobConf(conf, MyApp.class); // Process custom command-line options Path in = new Path(args[1]); Path out = new Path(args[2]); // Specify various job-specific parameters job.setJobName(my-app); job.setInputPath(in); job.setOutputPath(out); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); // Submit the job, then poll for progress until the job is complete JobClient.runJob(job); return 0; } public static void main(String[] args) throws Exception { // Let ToolRunner handle generic command-line options int res = ToolRunner.run(new Configuration(), new MyApp(), args); System.exit(res); } } On Fri, Mar 28, 2014 at 4:41 AM, Li Li fancye...@gmail.com wrote: thanks. is the following codes safe? int exitCode=ToolRunner.run() if(exitCode==0){ //safe to read result } On Fri, Mar 28, 2014 at 4:36 PM, Dieter De Witte drdwi...@gmail.com wrote: _SUCCES implies that the job has succesfully terminated, so this seems like a reasonable criterion. Regards, Dieter 2014-03-28 9:33 GMT+01:00 Li Li fancye...@gmail.com: I have a program that do some map-reduce job and then read the result of the job. I learned that hdfs is not strong consistent. when it's safe to read the result? as long as output/_SUCCESS exist?
Re: Replication HDFS
Hi Victor, if by replication you mean copy from one cluster to other, you can use the distcp command. Cheers. On 28 Mar 2014, at 16:30, Serge Blazhievsky hadoop...@gmail.com wrote: You mean replication between two different hadoop cluster or you just need data to be replicated between two different nodes? Sent from my iPhone On Mar 28, 2014, at 8:10 AM, Victor Belizário victor_beliza...@hotmail.com wrote: Hey, I did look in HDFS for replication in filesystem master x slave. Have any way to do master x master? I just have 1 TB of files in a server and i want to replicate to another server, in real time sync. Thanks !
Re: How check sum are generated for blocks in data node
Hi Reena, the pipeline is per block. If you have half of your file in data node A only, that means the pipeline had only one node (node A, in this case, probably because replication factor is set to 1) and then, data node A has the checksums for its block. The same applies to data node B. All nodes will have checksums for the blocks they own. Checksums is passed together with the block, as it goes through the pipeline, but as the last node on the pipeline receives the original checksums along with the block from previous nodes, its only needed to make the validation on this last one, because if it passes there, it means the file was not corrupted in any of the previous nodes as well. Cheers. On 28 Mar 2014, at 10:28, reena upadhyay reena2...@outlook.com wrote: I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node verifies the checksum as the write happens in a pipeline fashion. Now I have a question: Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file content is written on first data node A and the other remaining half is written on the second data node B to take advantage of parallelism. My question is: Will data node A will not store the check sum for the blocks stored on it. As per the line only the last data node verifies the checksum, it looks like only the last data node in my case it will be data node B, will generate the checksum. But if only data node B generates checksum, then it will generate the check sum only for the blocks stored on data node B. What about the checksum for the data blocks on data node machine A?
How to find generated mapreduce code for pig/hive query
hello experts, am really new to hadoop - Is it possible to find out based on pig or hive query to find out under the hood map reduce algorithm?? thanks
Re: How to find generated mapreduce code for pig/hive query
You can use ILLUSTRATE and EXPLAIN commands to see the execution plan, if you mean that by 'under the hood algorithm' http://pig.apache.org/docs/r0.11.1/test.html Regards, Shahab On Fri, Mar 28, 2014 at 5:51 PM, Spark Storm using.had...@gmail.com wrote: hello experts, am really new to hadoop - Is it possible to find out based on pig or hive query to find out under the hood map reduce algorithm?? thanks
Re: Need help get the hadoop cluster started in EC2
Hi Max, Not sure if you have already, but you might also want to look into Apache Ambari [1] for provisioning, managing, and monitoring Hadoop clusters. Many have successfully deployed Hadoop clusters on EC2 using Ambari. [1] http://ambari.apache.org/ Yusaku On Fri, Mar 28, 2014 at 7:07 PM, Max Zhao gz123forhad...@gmail.com wrote: Hi Everybody, I am trying to get my first hadoop cluster started using the Amazon EC2. I tried quite a few times and searched the web for the solutions, yet I still cannot get it up. I hope somebody can help out here. Here is what I did based on the Apache Whirr Quick Guide (http://whirr.apache.org/docs/0.8.1/quick-start-guide.html): 1) I downloaded a Whirr tar ball and installed it. bin/whirr version shows the following: Apache Whirr 0.8.2jclouds 1.5.8 2) I created the ./whirr directory and edit the credential file with my Amazon PROVIDER, IDENTITY and CREDENTIAL IDENTITY=AAS, with no extra quotes or curly quotes around the actual key_id 3) I used the following command to creat the key pair for whirr and stored it at the folder .ssh ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr 4) I think I am ready to use one of the properties file provided with whirr in the recipes folder. Here is the command I ran: bin/whirr launch-cluster --config recipes/hadoop-yarn-ec2.properties --private-key-file ~/.ssh/id_rsa_whi The command ran into the error and did not bring up the hadoop. My questin is: Do we need to change anything the default properties provided in the recipes folder in the whirr-0.8.2 folder, such as the hadoop-yarn-ec2.properties I used? Here are the error messages: --- [ec2-user@ip-172-31-20-120 whirr-0.8.2]$ bin/whirr launch-cluster --config recipes/hadoop-yarn-ec2.properties --private-key-file ~/.ssh/id_rsa_whirr Running on provider aws-ec2 using identity AKIAJLFVRARQ3IZE3KGF Unable to start the cluster. Terminating all nodes. com.google.common.util.concurrent.UncheckedExecutionException: com.google.inject.CreationException: Guice creation errors: 1) org.jclouds.rest.RestContextorg.jclouds.aws.ec2.AWSEC2Client, A cannot be used as a key; It is not fully specified. 1 error at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258) at com.google.common.cache.LocalCache.get(LocalCache.java:3990) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878) at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4884) at org.apache.whirr.service.ComputeCache.apply(ComputeCache.java:88) at org.apache.whirr.service.ComputeCache.apply(ComputeCache.java:80) at org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:110) at org.apache.whirr.ClusterController.bootstrapCluster(ClusterController.java:137) at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:113) at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:69) at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:59) at org.apache.whirr.cli.Main.run(Main.java:69) at org.apache.whirr.cli.Main.main(Main.java:102) Caused by: com.google.inject.CreationException: Guice creation errors: 1) org.jclouds.rest.RestContextorg.jclouds.aws.ec2.AWSEC2Client, A cannot be used as a key; It is not fully specified. 1 error at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:435) at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:154) at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106) at com.google.inject.Guice.createInjector(Guice.java:95) at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:401) at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:325) at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:600) at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:580) at org.apache.whirr.service.ComputeCache$1.load(ComputeCache.java:119) at org.apache.whirr.service.ComputeCache$1.load(ComputeCache.java:98) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252) ... 13 more Unable to load cluster state, assuming it has no