Re: Deduplication Effort in Hadoop
Hi, I guess by "system" you meant HDFS. In that case HBase might help. HBase needs to have unique keys. They are just bytes, so I guess you can just concatenate multiple columns in your primary key ( if you have a primary key spanning >1 column) to have a key for HBase, so that duplicates dont exist. So, data can be stored in HBase rather than in files and everything else is still the same. I dont know about Hive though. Thanks, Krishnakumar. On Jul 14, 2011, at 9:18 AM, Michael Segel wrote: > You don't have dupes because the key has to be unique. > > > > Sent from my Palm Pre on AT&T > On Jul 14, 2011 11:00 AM, jonathan.hw...@accenture.com >wrote: > > Hi All, > > In databases you can be able to define primary keys to ensure no duplicate > data get loaded into the system. Let say I have a lot of 1 billion records > flowing into my system everyday and some of these are repeated data (Same > records). I can use 2-3 columns in the record to match and look for > duplicates. What is the best strategy of de-duplication? The duplicated > records should only appear within the last 2 weeks.I want a fast way to > get the data into the system without much delay. Anyway HBase or Hive can > help? > > > > Thanks! > > Jonathan > > > > > > This message is for the designated recipient only and may contain privileged, > proprietary, or otherwise private information. If you have received it in > error, please notify the sender immediately and delete the original. Any > other use of the email by you is prohibited. > > > >
Command Line Arguments for Client
Hi, Could anyone tell how we could set the commandline arguments ( like -Xmx and -Xms) for the client (not for the map/reduce tasks) from the command that is usually used to launch the job? Thanks, Krishnakumar
Re: libjars options
Hi, Thanks a lot Alex! using GenericOptionsParser solved the issue. Previously I had used Tool and had assumed that it would take care of this. Regards, Krishna. On Jan 11, 2011, at 12:48 PM, Alex Kozlov wrote: > There is also a blog that I recently wrote, if it helps > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job > > On Tue, Jan 11, 2011 at 12:33 PM, Alex Kozlov wrote: > >> Have you implemented GenericOptionsParser? Do you see your jar in the * >> mapred.cache.files* or *tmpjars* parameter in your job.xml file (can view >> via a JT Web UI)? >> >> -- >> Alex Kozlov >> Solutions Architect >> Cloudera, Inc >> twitter: alexvk2009 >> <http://www.cloudera.com/company/press-center/hadoop-world-nyc/> >> >> >> On Tue, Jan 11, 2011 at 11:49 AM, C.V.Krishnakumar Iyer < >> f2004...@gmail.com> wrote: >> >>> Hi, >>> >>> I have tried that as well, using -files But it still gives the >>> exact same error. Any other thing that I could try? >>> >>> Thanks, >>> Krishna. >>> >>> On Jan 11, 2011, at 10:23 AM, Ted Yu wrote: >>> >>>> Refer to Alex Kozlov's answer on 12/11/10 >>>> >>>> On Tue, Jan 11, 2011 at 10:10 AM, C.V.Krishnakumar Iyer >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Could anyone please guide me as to how to use the -libjars option in >>> HDFS? >>>>> >>>>> I have added the necessary jar file (the hbase jar - to be precise) to >>> the >>>>> classpath of the node where I am starting the job. >>>>> >>>>> The following is the format that i am invoking: >>>>> bin/hadoop jar -libjars >> (separated by >>>>> Commas)> >>>>> >>>>> bin/hadoop jar /Users/hdp/cvk/myjob.jar mr2.mr2a.MR2ADriver -libjars >>>>> /Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a outputmr2a >>>>> >>>>> Despite this, I find that I get the java.lang.ClassNotFoundException >>>>> error! :( >>>>> java.lang.RuntimeException: java.lang.RuntimeException: >>>>> java.lang.ClassNotFoundException: >>>>> org.apache.hadoop.hbase.io.ImmutableBytesWritable >>>>> at >>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841) >>>>> at >>>>> >>> org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551) >>>>> at >>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793) >>>>> at >>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524) >>>>> at >>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) >>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>>>> at org.apache.hadoop.mapred.Child.main(Child.java:170) >>>>> Caused by: java.lang.RuntimeException: >>> java.lang.ClassNotFoundException: >>>>> org.apache.hadoop.hbase.io.ImmutableBytesWritable >>>>> at >>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809) >>>>> at >>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833) >>>>> >>>>> The strange thing is that there is another MR job I have that runs >>>>> perfectly with the libjars option! Could anybody tell me what I am >>> doing >>>>> wrong? One more thing - not sure if it is relevant : I am using the new >>>>> Hadoop MapReduce API. >>>>> >>>>> Thanks in advance! >>>>> >>>>> Regards, >>>>> Krishnakumar. >>> >>> >>
Re: libjars options
Hi, Thanks a lot! I shall try this once and let you know! Regards, Krishna. On Jan 11, 2011, at 12:48 PM, Alex Kozlov wrote: > There is also a blog that I recently wrote, if it helps > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job > > On Tue, Jan 11, 2011 at 12:33 PM, Alex Kozlov wrote: > >> Have you implemented GenericOptionsParser? Do you see your jar in the * >> mapred.cache.files* or *tmpjars* parameter in your job.xml file (can view >> via a JT Web UI)? >> >> -- >> Alex Kozlov >> Solutions Architect >> Cloudera, Inc >> twitter: alexvk2009 >> <http://www.cloudera.com/company/press-center/hadoop-world-nyc/> >> >> >> On Tue, Jan 11, 2011 at 11:49 AM, C.V.Krishnakumar Iyer < >> f2004...@gmail.com> wrote: >> >>> Hi, >>> >>> I have tried that as well, using -files But it still gives the >>> exact same error. Any other thing that I could try? >>> >>> Thanks, >>> Krishna. >>> >>> On Jan 11, 2011, at 10:23 AM, Ted Yu wrote: >>> >>>> Refer to Alex Kozlov's answer on 12/11/10 >>>> >>>> On Tue, Jan 11, 2011 at 10:10 AM, C.V.Krishnakumar Iyer >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Could anyone please guide me as to how to use the -libjars option in >>> HDFS? >>>>> >>>>> I have added the necessary jar file (the hbase jar - to be precise) to >>> the >>>>> classpath of the node where I am starting the job. >>>>> >>>>> The following is the format that i am invoking: >>>>> bin/hadoop jar -libjars >> (separated by >>>>> Commas)> >>>>> >>>>> bin/hadoop jar /Users/hdp/cvk/myjob.jar mr2.mr2a.MR2ADriver -libjars >>>>> /Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a outputmr2a >>>>> >>>>> Despite this, I find that I get the java.lang.ClassNotFoundException >>>>> error! :( >>>>> java.lang.RuntimeException: java.lang.RuntimeException: >>>>> java.lang.ClassNotFoundException: >>>>> org.apache.hadoop.hbase.io.ImmutableBytesWritable >>>>> at >>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841) >>>>> at >>>>> >>> org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551) >>>>> at >>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793) >>>>> at >>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524) >>>>> at >>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) >>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>>>> at org.apache.hadoop.mapred.Child.main(Child.java:170) >>>>> Caused by: java.lang.RuntimeException: >>> java.lang.ClassNotFoundException: >>>>> org.apache.hadoop.hbase.io.ImmutableBytesWritable >>>>> at >>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809) >>>>> at >>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833) >>>>> >>>>> The strange thing is that there is another MR job I have that runs >>>>> perfectly with the libjars option! Could anybody tell me what I am >>> doing >>>>> wrong? One more thing - not sure if it is relevant : I am using the new >>>>> Hadoop MapReduce API. >>>>> >>>>> Thanks in advance! >>>>> >>>>> Regards, >>>>> Krishnakumar. >>> >>> >>
Re: libjars options
Hi, I have tried that as well, using -files But it still gives the exact same error. Any other thing that I could try? Thanks, Krishna. On Jan 11, 2011, at 10:23 AM, Ted Yu wrote: > Refer to Alex Kozlov's answer on 12/11/10 > > On Tue, Jan 11, 2011 at 10:10 AM, C.V.Krishnakumar Iyer > wrote: > >> Hi, >> >> Could anyone please guide me as to how to use the -libjars option in HDFS? >> >> I have added the necessary jar file (the hbase jar - to be precise) to the >> classpath of the node where I am starting the job. >> >> The following is the format that i am invoking: >> bin/hadoop jar -libjars > Commas)> >> >> bin/hadoop jar /Users/hdp/cvk/myjob.jar mr2.mr2a.MR2ADriver -libjars >> /Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a outputmr2a >> >> Despite this, I find that I get the java.lang.ClassNotFoundException >> error! :( >> java.lang.RuntimeException: java.lang.RuntimeException: >> java.lang.ClassNotFoundException: >> org.apache.hadoop.hbase.io.ImmutableBytesWritable >> at >> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841) >> at >> org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793) >> at >> org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >> at org.apache.hadoop.mapred.Child.main(Child.java:170) >> Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: >> org.apache.hadoop.hbase.io.ImmutableBytesWritable >> at >> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809) >> at >> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833) >> >> The strange thing is that there is another MR job I have that runs >> perfectly with the libjars option! Could anybody tell me what I am doing >> wrong? One more thing - not sure if it is relevant : I am using the new >> Hadoop MapReduce API. >> >> Thanks in advance! >> >> Regards, >> Krishnakumar.
libjars options
Hi, Could anyone please guide me as to how to use the -libjars option in HDFS? I have added the necessary jar file (the hbase jar - to be precise) to the classpath of the node where I am starting the job. The following is the format that i am invoking: bin/hadoop jar -libjars bin/hadoop jar /Users/hdp/cvk/myjob.jar mr2.mr2a.MR2ADriver -libjars /Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a outputmr2a Despite this, I find that I get the java.lang.ClassNotFoundException error! :( java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.io.ImmutableBytesWritable at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841) at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.io.ImmutableBytesWritable at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833) The strange thing is that there is another MR job I have that runs perfectly with the libjars option! Could anybody tell me what I am doing wrong? One more thing - not sure if it is relevant : I am using the new Hadoop MapReduce API. Thanks in advance! Regards, Krishnakumar.
-libjars option
Hi, Could anyone please guide me as to how to use the -libjars option in HDFS? I have added the necessary jar file (the hbase jar - to be precise) to the classpath of the node where I am starting the job. The following is the format that i am invoking: bin/hadoop jar -libjars bin/hadoop jar /Users/hdp/cvk/myjob.jar mr2.mr2a.MR2ADriver -libjars /Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a outputmr2a Despite this, I find that I get the java.lang.ClassNotFoundException error! :( java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.io.ImmutableBytesWritable at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841) at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.io.ImmutableBytesWritable at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833) The strange thing is that there is another MR job I have that runs perfectly with the libjars option! Could anybody tell me what I am doing wrong? One more thing - not sure if it is relevant : I am using the new Hadoop MapReduce API. Thanks in advance! Regards, Krishnakumar.
Re: IOException in TaskRunner (Error Code :134)
Hi, Thanks a lot ! I just removed the elements in the cache and the incomplete blocks, and it worked. Regards, Krishnakumar. On Sep 21, 2010, at 12:54 PM, Allen Wittenauer wrote: > > It is a "not enough information" error. > > Check the tasks, jobtracker, tasktracker, datanode, and namenode logs. > > On Sep 21, 2010, at 12:30 PM, C.V.Krishnakumar wrote: > >> >> Hi, >> Just wanted to know if anyone has any idea about this one? This happens >> every time I run a job. >> Is this issue hardware related? >> >> Thanks in advance, >> Krishnakumar. >> >> Begin forwarded message: >> >>> From: "C.V.Krishnakumar" >>> Date: September 17, 2010 1:32:49 PM PDT >>> To: common-user@hadoop.apache.org >>> Subject: Tasks Failing : IOException in TaskRunner (Error Code :134) >>> Reply-To: common-user@hadoop.apache.org >>> >>> Hi all, >>> >>> I am facing a problem with the TaskRunner. I have a small hadoop cluster >>> in the fully distributed mode. However when I submit a job, the job never >>> seems to proceed beyond the "map 0% reduce 0%" stage. Soon after I get this >>> error: >>> >>> java.io.IOException: Task process exit with nonzero status of 134. at >>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) >>> >>> When I googled this issue, I found this link >>> http://markmail.org/message/lvefqbcboaqzfazt but did not completely >>> understand the issue. The cluster, in this same configuration, was working >>> fine a few days back. So, I am confused as to what could have changed to >>> cause this error. >>> >>> Have any of you faced similar problems too? I would be really grateful if >>> you could let me know if I am missing something very obvious. >>> >>> Thanks a lot! >>> >>> Regards, >>> Krishna. >>> >> >
IOException in TaskRunner (Error Code :134)
Hi, Just wanted to know if anyone has any idea about this one? This happens every time I run a job. Is this issue hardware related? Thanks in advance, Krishnakumar. Begin forwarded message: > From: "C.V.Krishnakumar" > Date: September 17, 2010 1:32:49 PM PDT > To: common-user@hadoop.apache.org > Subject: Tasks Failing : IOException in TaskRunner (Error Code :134) > Reply-To: common-user@hadoop.apache.org > > Hi all, > > I am facing a problem with the TaskRunner. I have a small hadoop cluster in > the fully distributed mode. However when I submit a job, the job never seems > to proceed beyond the "map 0% reduce 0%" stage. Soon after I get this error: > > java.io.IOException: Task process exit with nonzero status of 134. at > org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) > > When I googled this issue, I found this link > http://markmail.org/message/lvefqbcboaqzfazt but did not completely > understand the issue. The cluster, in this same configuration, was working > fine a few days back. So, I am confused as to what could have changed to > cause this error. > > Have any of you faced similar problems too? I would be really grateful if you > could let me know if I am missing something very obvious. > > Thanks a lot! > > Regards, > Krishna. >
Tasks Failing : IOException in TaskRunner (Error Code :134)
Hi all, I am facing a problem with the TaskRunner. I have a small hadoop cluster in the fully distributed mode. However when I submit a job, the job never seems to proceed beyond the "map 0% reduce 0%" stage. Soon after I get this error: java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) When I googled this issue, I found this link http://markmail.org/message/lvefqbcboaqzfazt but did not completely understand the issue. The cluster, in this same configuration, was working fine a few days back. So, I am confused as to what could have changed to cause this error. Have any of you faced similar problems too? I would be really grateful if you could let me know if I am missing something very obvious. Thanks a lot! Regards, Krishna.
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Hi Deepak, Maybe I did not make my mail clear. I had tried the instructions in the blog you mentioned. They are working for me. Did you change the /etc/hosts file at any point of time? Regards, Krishna On Jul 27, 2010, at 2:30 PM, C.V.Krishnakumar wrote: > Hi Deepak, > > YOu could refer this too : > http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results > > I tried those instructions and it is working for me. > Regards, > Krishna > On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote: > >> Hey friends, >> >> I got stuck on setting up hdfs cluster and getting this error while running >> simple wordcount example(I did that 2 yrs back not had any problem). >> >> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from >> ( >> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 >> ). >> >> I checked the firewall settings and /etc/hosts there is no issue there. >> Also master and slave are accessible both ways. >> >> Also the input size very low ~ 3 MB and hence there shouldn't be no issue >> because ulimit(its btw of 4096). >> >> Would be really thankful if anyone can guide me to resolve this. >> >> Thanks & regards, >> - Deepak Diwakar, >> >> >> >> >> On 28 June 2010 18:39, bmdevelopment wrote: >> >>> Hi, Sorry for the cross-post. But just trying to see if anyone else >>> has had this issue before. >>> Thanks >>> >>> >>> -- Forwarded message -- >>> From: bmdevelopment >>> Date: Fri, Jun 25, 2010 at 10:56 AM >>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; >>> bailing-out. >>> To: mapreduce-u...@hadoop.apache.org >>> >>> >>> Hello, >>> Thanks so much for the reply. >>> See inline. >>> >>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala >>> wrote: >>>> Hi, >>>> >>>>> I've been getting the following error when trying to run a very simple >>>>> MapReduce job. >>>>> Map finishes without problem, but error occurs as soon as it enters >>>>> Reduce phase. >>>>> >>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >>>>> attempt_201006241812_0001_r_00_0, Status : FAILED >>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>>>> >>>>> I am running a 5 node cluster and I believe I have all my settings >>> correct: >>>>> >>>>> * ulimit -n 32768 >>>>> * DNS/RDNS configured properly >>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW >>>>> >>>>> The program is very simple - just counts a unique string in a log file. >>>>> See here: http://pastebin.com/5uRG3SFL >>>>> >>>>> When I run, the job fails and I get the following output. >>>>> http://pastebin.com/AhW6StEb >>>>> >>>>> However, runs fine when I do *not* use substring() on the value (see >>>>> map function in code above). >>>>> >>>>> This runs fine and completes successfully: >>>>> String str = val.toString(); >>>>> >>>>> This causes error and fails: >>>>> String str = val.toString().substring(0,10); >>>>> >>>>> Please let me know if you need any further information. >>>>> It would be greatly appreciated if anyone could shed some light on this >>> problem. >>>> >>>> It catches attention that changing the code to use a substring is >>>> causing a difference. Assuming it is consistent and not a red herring, >>> >>> Yes, this has been consistent over the last week. I was running 0.20.1 >>> first and then >>> upgrade to 0.20.2 but results have been exactly the same. >>> >>>> can you look at the counters for the two jobs using the JobTracker web >>>> UI - things like map records, bytes etc and see if there is a >>>> noticeable difference ? >>> >>> Ok, so here is the first job using write.set(value.toString()); having >>> *no* errors: >>> http://pastebin.com/xvy0iGwL >>> >>> And here is the second job using >>
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Hi Deepak, YOu could refer this too : http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results I tried those instructions and it is working for me. Regards, Krishna On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote: > Hey friends, > > I got stuck on setting up hdfs cluster and getting this error while running > simple wordcount example(I did that 2 yrs back not had any problem). > > Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from > ( > http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 > ). > > I checked the firewall settings and /etc/hosts there is no issue there. > Also master and slave are accessible both ways. > > Also the input size very low ~ 3 MB and hence there shouldn't be no issue > because ulimit(its btw of 4096). > > Would be really thankful if anyone can guide me to resolve this. > > Thanks & regards, > - Deepak Diwakar, > > > > > On 28 June 2010 18:39, bmdevelopment wrote: > >> Hi, Sorry for the cross-post. But just trying to see if anyone else >> has had this issue before. >> Thanks >> >> >> -- Forwarded message -- >> From: bmdevelopment >> Date: Fri, Jun 25, 2010 at 10:56 AM >> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; >> bailing-out. >> To: mapreduce-u...@hadoop.apache.org >> >> >> Hello, >> Thanks so much for the reply. >> See inline. >> >> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala >> wrote: >>> Hi, >>> I've been getting the following error when trying to run a very simple MapReduce job. Map finishes without problem, but error occurs as soon as it enters Reduce phase. 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : attempt_201006241812_0001_r_00_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. I am running a 5 node cluster and I believe I have all my settings >> correct: * ulimit -n 32768 * DNS/RDNS configured properly * hdfs-site.xml : http://pastebin.com/xuZ17bPM * mapred-site.xml : http://pastebin.com/JraVQZcW The program is very simple - just counts a unique string in a log file. See here: http://pastebin.com/5uRG3SFL When I run, the job fails and I get the following output. http://pastebin.com/AhW6StEb However, runs fine when I do *not* use substring() on the value (see map function in code above). This runs fine and completes successfully: String str = val.toString(); This causes error and fails: String str = val.toString().substring(0,10); Please let me know if you need any further information. It would be greatly appreciated if anyone could shed some light on this >> problem. >>> >>> It catches attention that changing the code to use a substring is >>> causing a difference. Assuming it is consistent and not a red herring, >> >> Yes, this has been consistent over the last week. I was running 0.20.1 >> first and then >> upgrade to 0.20.2 but results have been exactly the same. >> >>> can you look at the counters for the two jobs using the JobTracker web >>> UI - things like map records, bytes etc and see if there is a >>> noticeable difference ? >> >> Ok, so here is the first job using write.set(value.toString()); having >> *no* errors: >> http://pastebin.com/xvy0iGwL >> >> And here is the second job using >> write.set(value.toString().substring(0, 10)); that fails: >> http://pastebin.com/uGw6yNqv >> >> And here is even another where I used a longer, and therefore unique >> string, >> by write.set(value.toString().substring(0, 20)); This makes every line >> unique, similar to first job. >> Still fails. >> http://pastebin.com/GdQ1rp8i >> >>> Also, are the two programs being run against >>> the exact same input data ? >> >> Yes, exactly the same input: a single csv file with 23K lines. >> Using a shorter string leads to more like keys and therefore more >> combining/reducing, but going >> by the above it seems to fail whether the substring/key is entirely >> unique (23000 combine output records) or >> mostly the same (9 combine output records). >> >>> >>> Also, since the cluster size is small, you could also look at the >>> tasktracker logs on the machines where the maps have run to see if >>> there are any failures when the reduce attempts start failing. >> >> Here is the TT log from the last failed job. I do not see anything >> besides the shuffle failure, but there >> may be something I am overlooking or simply do not understand. >> http://pastebin.com/DKFTyGXg >> >> Thanks again! >> >>> >>> Thanks >>> Hemanth >>> >>
Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed
Oh. Thanks for the reply. Regards, Krishna On Jul 13, 2010, at 9:51 AM, Allen Wittenauer wrote: > > When you write on a machine running a datanode process, the data is *always* > written locally first. This is to provide an optimization to the MapReduce > framework. The lesson here is that you should *never* use a datanode > machine to load your data. Always do it outside the grid. > > Additionally, you can use fsck (filename) -files -locations -blocks to see > where those blocks have been written. > > On Jul 13, 2010, at 9:45 AM, Nathan Grice wrote: > >> To test the block distribution, run the same put command from the NameNode >> and then again from the DataNode. >> Check the HDFS filesystem after both commands. In my case, a 2GB file was >> distributed mostly evenly across the datanodes when put was run on the >> NameNode, and then put only on the DataNode where I ran the put command >> >> On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar >> wrote: >> >>> Hi, >>> I am a newbie. I am curious to know how you discovered that all the blocks >>> are written to datanode's hdfs? I thought the replication by namenode was >>> transparent. Am I missing something? >>> Thanks, >>> Krishna >>> On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote: >>> >>>> We are trying to load data into hdfs from one of the slaves and when the >>> put >>>> command is run from a slave(datanode) all of the blocks are written to >>> the >>>> datanode's hdfs, and not distributed to all of the nodes in the cluster. >>> It >>>> does not seem to matter what destination format we use ( /filename vs >>>> hdfs://master:9000/filename) it always behaves the same. >>>> Conversely, running the same command from the namenode distributes the >>> files >>>> across the datanodes. >>>> >>>> Is there something I am missing? >>>> >>>> -Nathan >>> >>> >
Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed
Hi, I am a newbie. I am curious to know how you discovered that all the blocks are written to datanode's hdfs? I thought the replication by namenode was transparent. Am I missing something? Thanks, Krishna On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote: > We are trying to load data into hdfs from one of the slaves and when the put > command is run from a slave(datanode) all of the blocks are written to the > datanode's hdfs, and not distributed to all of the nodes in the cluster. It > does not seem to matter what destination format we use ( /filename vs > hdfs://master:9000/filename) it always behaves the same. > Conversely, running the same command from the namenode distributes the files > across the datanodes. > > Is there something I am missing? > > -Nathan