Re: Deduplication Effort in Hadoop

2011-07-14 Thread C.V.Krishnakumar Iyer
Hi,

I guess by "system" you meant HDFS.

In that case HBase might help. HBase needs to have unique keys. They are just 
bytes, so I guess you can just concatenate multiple columns in your primary key 
( if you have a primary key spanning >1 column)  to have a key for HBase, so 
that duplicates dont exist. 

So, data can be stored in HBase rather than in files and everything else is 
still the same.

I dont know about Hive though.

Thanks,
Krishnakumar.



On Jul 14, 2011, at 9:18 AM, Michael Segel wrote:

> You don't have dupes because the key has to be unique. 
> 
> 
> 
> Sent from my Palm Pre on AT&T
> On Jul 14, 2011 11:00 AM, jonathan.hw...@accenture.com 
>  wrote: 
> 
> Hi All,
> 
> In databases you can be able to define primary keys to ensure no duplicate 
> data get loaded into the system.   Let say I have a lot of 1 billion records 
> flowing into my system everyday and some of these are repeated data (Same 
> records).   I can use 2-3 columns in the record to match and look for 
> duplicates.   What is the best strategy of de-duplication?  The duplicated 
> records should only appear within the last 2 weeks.I want a fast way to 
> get the data into the system without much delay.  Anyway HBase or Hive can 
> help?
> 
> 
> 
> Thanks!
> 
> Jonathan
> 
> 
> 
> 
> 
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise private information. If you have received it in 
> error, please notify the sender immediately and delete the original. Any 
> other use of the email by you is prohibited.
> 
> 
> 
> 



Command Line Arguments for Client

2011-02-22 Thread C.V.Krishnakumar Iyer
Hi,

Could anyone tell how we could set the commandline arguments ( like -Xmx and 
-Xms) for the  client (not for the map/reduce tasks) from the command  that is 
usually used to launch the job? 

Thanks,
Krishnakumar




Re: libjars options

2011-01-11 Thread C.V.Krishnakumar Iyer
Hi,

Thanks a lot Alex! using GenericOptionsParser solved the issue. Previously I 
had used Tool and had assumed that it would take care of this.

Regards,
Krishna.
On Jan 11, 2011, at 12:48 PM, Alex Kozlov wrote:

> There is also a blog that I recently wrote, if it helps
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job
> 
> On Tue, Jan 11, 2011 at 12:33 PM, Alex Kozlov  wrote:
> 
>> Have you implemented GenericOptionsParser?  Do you see your jar in the *
>> mapred.cache.files* or *tmpjars* parameter in your job.xml file (can view
>> via a JT Web UI)?
>> 
>> --
>> Alex Kozlov
>> Solutions Architect
>> Cloudera, Inc
>> twitter: alexvk2009
>> <http://www.cloudera.com/company/press-center/hadoop-world-nyc/>
>> 
>> 
>> On Tue, Jan 11, 2011 at 11:49 AM, C.V.Krishnakumar Iyer <
>> f2004...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I have tried that as well, using -files  But it still gives the
>>> exact same error. Any other thing that I could try?
>>> 
>>> Thanks,
>>> Krishna.
>>> 
>>> On Jan 11, 2011, at 10:23 AM, Ted Yu wrote:
>>> 
>>>> Refer to Alex Kozlov's answer on 12/11/10
>>>> 
>>>> On Tue, Jan 11, 2011 at 10:10 AM, C.V.Krishnakumar Iyer
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Could anyone please guide me as to how to use the -libjars option in
>>> HDFS?
>>>>> 
>>>>> I have added the necessary jar file (the hbase jar - to be precise)  to
>>> the
>>>>> classpath of the node where I am starting the job.
>>>>> 
>>>>> The following is the format that i am invoking:
>>>>> bin/hadoop jar   -libjars >> (separated by
>>>>> Commas)> 
>>>>> 
>>>>> bin/hadoop jar /Users/hdp/cvk/myjob.jar  mr2.mr2a.MR2ADriver -libjars
>>>>> /Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a  outputmr2a
>>>>> 
>>>>> Despite this,  I find that I get the java.lang.ClassNotFoundException
>>>>> error! :(
>>>>> java.lang.RuntimeException: java.lang.RuntimeException:
>>>>> java.lang.ClassNotFoundException:
>>>>> org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>  at
>>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841)
>>>>>  at
>>>>> 
>>> org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524)
>>>>>  at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>>>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>> Caused by: java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException:
>>>>> org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>  at
>>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
>>>>>  at
>>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)
>>>>> 
>>>>> The strange thing is that there is another MR job I have  that runs
>>>>> perfectly with the libjars option! Could anybody tell me what I am
>>> doing
>>>>> wrong? One more thing - not sure if it is relevant : I am using the new
>>>>> Hadoop MapReduce API.
>>>>> 
>>>>> Thanks in advance!
>>>>> 
>>>>> Regards,
>>>>> Krishnakumar.
>>> 
>>> 
>> 



Re: libjars options

2011-01-11 Thread C.V.Krishnakumar Iyer
Hi,

Thanks a lot! I shall try this once and let you know! 

Regards,
Krishna.
On Jan 11, 2011, at 12:48 PM, Alex Kozlov wrote:

> There is also a blog that I recently wrote, if it helps
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job
> 
> On Tue, Jan 11, 2011 at 12:33 PM, Alex Kozlov  wrote:
> 
>> Have you implemented GenericOptionsParser?  Do you see your jar in the *
>> mapred.cache.files* or *tmpjars* parameter in your job.xml file (can view
>> via a JT Web UI)?
>> 
>> --
>> Alex Kozlov
>> Solutions Architect
>> Cloudera, Inc
>> twitter: alexvk2009
>> <http://www.cloudera.com/company/press-center/hadoop-world-nyc/>
>> 
>> 
>> On Tue, Jan 11, 2011 at 11:49 AM, C.V.Krishnakumar Iyer <
>> f2004...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> I have tried that as well, using -files  But it still gives the
>>> exact same error. Any other thing that I could try?
>>> 
>>> Thanks,
>>> Krishna.
>>> 
>>> On Jan 11, 2011, at 10:23 AM, Ted Yu wrote:
>>> 
>>>> Refer to Alex Kozlov's answer on 12/11/10
>>>> 
>>>> On Tue, Jan 11, 2011 at 10:10 AM, C.V.Krishnakumar Iyer
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Could anyone please guide me as to how to use the -libjars option in
>>> HDFS?
>>>>> 
>>>>> I have added the necessary jar file (the hbase jar - to be precise)  to
>>> the
>>>>> classpath of the node where I am starting the job.
>>>>> 
>>>>> The following is the format that i am invoking:
>>>>> bin/hadoop jar   -libjars >> (separated by
>>>>> Commas)> 
>>>>> 
>>>>> bin/hadoop jar /Users/hdp/cvk/myjob.jar  mr2.mr2a.MR2ADriver -libjars
>>>>> /Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a  outputmr2a
>>>>> 
>>>>> Despite this,  I find that I get the java.lang.ClassNotFoundException
>>>>> error! :(
>>>>> java.lang.RuntimeException: java.lang.RuntimeException:
>>>>> java.lang.ClassNotFoundException:
>>>>> org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>  at
>>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841)
>>>>>  at
>>>>> 
>>> org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524)
>>>>>  at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>>>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>> Caused by: java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException:
>>>>> org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>>>>  at
>>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
>>>>>  at
>>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)
>>>>> 
>>>>> The strange thing is that there is another MR job I have  that runs
>>>>> perfectly with the libjars option! Could anybody tell me what I am
>>> doing
>>>>> wrong? One more thing - not sure if it is relevant : I am using the new
>>>>> Hadoop MapReduce API.
>>>>> 
>>>>> Thanks in advance!
>>>>> 
>>>>> Regards,
>>>>> Krishnakumar.
>>> 
>>> 
>> 



Re: libjars options

2011-01-11 Thread C.V.Krishnakumar Iyer
Hi,

I have tried that as well, using -files  But it still gives the exact 
same error. Any other thing that I could try? 

Thanks,
Krishna.

On Jan 11, 2011, at 10:23 AM, Ted Yu wrote:

> Refer to Alex Kozlov's answer on 12/11/10
> 
> On Tue, Jan 11, 2011 at 10:10 AM, C.V.Krishnakumar Iyer
> wrote:
> 
>> Hi,
>> 
>> Could anyone please guide me as to how to use the -libjars option in HDFS?
>> 
>> I have added the necessary jar file (the hbase jar - to be precise)  to the
>> classpath of the node where I am starting the job.
>> 
>> The following is the format that i am invoking:
>> bin/hadoop jar   -libjars > Commas)> 
>> 
>> bin/hadoop jar /Users/hdp/cvk/myjob.jar  mr2.mr2a.MR2ADriver -libjars
>> /Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a  outputmr2a
>> 
>> Despite this,  I find that I get the java.lang.ClassNotFoundException
>> error! :(
>> java.lang.RuntimeException: java.lang.RuntimeException:
>> java.lang.ClassNotFoundException:
>> org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>   at
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841)
>>   at
>> org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551)
>>   at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793)
>>   at
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524)
>>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>   at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
>> org.apache.hadoop.hbase.io.ImmutableBytesWritable
>>   at
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
>>   at
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)
>> 
>> The strange thing is that there is another MR job I have  that runs
>> perfectly with the libjars option! Could anybody tell me what I am doing
>> wrong? One more thing - not sure if it is relevant : I am using the new
>> Hadoop MapReduce API.
>> 
>> Thanks in advance!
>> 
>> Regards,
>> Krishnakumar.



libjars options

2011-01-11 Thread C.V.Krishnakumar Iyer
Hi,

Could anyone please guide me as to how to use the -libjars option in HDFS? 

I have added the necessary jar file (the hbase jar - to be precise)  to the 
classpath of the node where I am starting the job. 

The following is the format that i am invoking: 
bin/hadoop jar   -libjars  

bin/hadoop jar /Users/hdp/cvk/myjob.jar  mr2.mr2a.MR2ADriver -libjars 
/Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a  outputmr2a

Despite this,  I find that I get the java.lang.ClassNotFoundException error! :(
java.lang.RuntimeException: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841)
at 
org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)

 The strange thing is that there is another MR job I have  that runs perfectly 
with the libjars option! Could anybody tell me what I am doing wrong? One more 
thing - not sure if it is relevant : I am using the new Hadoop MapReduce API.

Thanks in advance!

Regards,
Krishnakumar.

-libjars option

2011-01-10 Thread C.V.Krishnakumar Iyer
Hi,

Could anyone please guide me as to how to use the -libjars option in HDFS? 

I have added the necessary jar file (the hbase jar - to be precise)  to the 
classpath of the node where I am starting the job. 

The following is the format that i am invoking: 
bin/hadoop jar   -libjars  

bin/hadoop jar /Users/hdp/cvk/myjob.jar  mr2.mr2a.MR2ADriver -libjars 
/Users/hdp/hadoop/lib/hbase-0.20.6.jar inputmr2a  outputmr2a

Despite this,  I find that I get the java.lang.ClassNotFoundException error! :(
java.lang.RuntimeException: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:841)
at 
org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:551)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:793)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:524)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.io.ImmutableBytesWritable
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:833)

 The strange thing is that there is another MR job I have  that runs perfectly 
with the libjars option! Could anybody tell me what I am doing wrong? One more 
thing - not sure if it is relevant : I am using the new Hadoop MapReduce API.

Thanks in advance!

Regards,
Krishnakumar.







Re: IOException in TaskRunner (Error Code :134)

2010-09-21 Thread C.V.Krishnakumar
Hi,
Thanks a lot !  I just removed the elements in the cache and the incomplete 
blocks, and it worked.
Regards,
Krishnakumar.

On Sep 21, 2010, at 12:54 PM, Allen Wittenauer wrote:

> 
> It is a "not enough information" error.
> 
> Check the tasks, jobtracker, tasktracker, datanode, and namenode logs.
> 
> On Sep 21, 2010, at 12:30 PM, C.V.Krishnakumar wrote:
> 
>> 
>> Hi,
>> Just wanted to know if anyone has any idea about this one? This happens 
>> every time I run a job. 
>> Is this issue hardware related? 
>> 
>> Thanks in advance,
>> Krishnakumar.
>> 
>> Begin forwarded message:
>> 
>>> From: "C.V.Krishnakumar" 
>>> Date: September 17, 2010 1:32:49 PM PDT
>>> To: common-user@hadoop.apache.org
>>> Subject: Tasks Failing : IOException in TaskRunner (Error Code :134)
>>> Reply-To: common-user@hadoop.apache.org
>>> 
>>> Hi all,
>>> 
>>> I am facing a problem with the TaskRunner. I  have a small hadoop cluster 
>>> in the fully distributed mode. However when I submit a job, the job never 
>>> seems to proceed beyond the "map 0% reduce 0%" stage. Soon after I get this 
>>> error:
>>> 
>>> java.io.IOException: Task process exit with nonzero status of 134. at 
>>> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>> 
>>> When I googled this issue, I found this link  
>>> http://markmail.org/message/lvefqbcboaqzfazt  but did not completely 
>>> understand the issue. The cluster, in this same configuration, was working 
>>> fine a few days back. So, I am confused as to what could have changed to 
>>> cause this error. 
>>> 
>>> Have any of you faced similar problems too? I would be really grateful if 
>>> you could let me know if I am missing something very obvious.
>>> 
>>> Thanks a lot!
>>> 
>>> Regards,
>>> Krishna.
>>> 
>> 
> 



IOException in TaskRunner (Error Code :134)

2010-09-21 Thread C.V.Krishnakumar

Hi,
Just wanted to know if anyone has any idea about this one? This happens every 
time I run a job. 
Is this issue hardware related? 

Thanks in advance,
Krishnakumar.

Begin forwarded message:

> From: "C.V.Krishnakumar" 
> Date: September 17, 2010 1:32:49 PM PDT
> To: common-user@hadoop.apache.org
> Subject: Tasks Failing : IOException in TaskRunner (Error Code :134)
> Reply-To: common-user@hadoop.apache.org
> 
> Hi all,
> 
> I am facing a problem with the TaskRunner. I  have a small hadoop cluster in 
> the fully distributed mode. However when I submit a job, the job never seems 
> to proceed beyond the "map 0% reduce 0%" stage. Soon after I get this error:
> 
> java.io.IOException: Task process exit with nonzero status of 134. at 
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> 
> When I googled this issue, I found this link  
> http://markmail.org/message/lvefqbcboaqzfazt  but did not completely 
> understand the issue. The cluster, in this same configuration, was working 
> fine a few days back. So, I am confused as to what could have changed to 
> cause this error. 
> 
> Have any of you faced similar problems too? I would be really grateful if you 
> could let me know if I am missing something very obvious.
> 
> Thanks a lot!
> 
> Regards,
> Krishna.
> 



Tasks Failing : IOException in TaskRunner (Error Code :134)

2010-09-17 Thread C.V.Krishnakumar
Hi all,

I am facing a problem with the TaskRunner. I  have a small hadoop cluster in 
the fully distributed mode. However when I submit a job, the job never seems to 
proceed beyond the "map 0% reduce 0%" stage. Soon after I get this error:

java.io.IOException: Task process exit with nonzero status of 134. at 
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

When I googled this issue, I found this link  
http://markmail.org/message/lvefqbcboaqzfazt  but did not completely understand 
the issue. The cluster, in this same configuration, was working fine a few days 
back. So, I am confused as to what could have changed to cause this error. 

Have any of you faced similar problems too? I would be really grateful if you 
could let me know if I am missing something very obvious.

Thanks a lot!

Regards,
Krishna.



Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2010-07-27 Thread C.V.Krishnakumar
Hi Deepak,

Maybe I did not make my mail clear. I had tried the instructions in the blog 
you mentioned. They are  working for me. 
Did you change the /etc/hosts file at any point of time? 

Regards,
Krishna

On Jul 27, 2010, at 2:30 PM, C.V.Krishnakumar wrote:

> Hi Deepak,
> 
> YOu could refer this too : 
> http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results
>  
> I tried those instructions and it is working for me. 
> Regards,
> Krishna
> On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:
> 
>> Hey friends,
>> 
>> I got stuck on setting up hdfs cluster and getting this error while running
>> simple wordcount example(I did that 2 yrs back not had any problem).
>> 
>> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
>> (
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>> ).
>> 
>> I checked the firewall settings and /etc/hosts there is no issue there.
>> Also master and slave are accessible both ways.
>> 
>> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
>> because ulimit(its btw of 4096).
>> 
>> Would be really thankful  if  anyone can guide me to resolve this.
>> 
>> Thanks & regards,
>> - Deepak Diwakar,
>> 
>> 
>> 
>> 
>> On 28 June 2010 18:39, bmdevelopment  wrote:
>> 
>>> Hi, Sorry for the cross-post. But just trying to see if anyone else
>>> has had this issue before.
>>> Thanks
>>> 
>>> 
>>> -- Forwarded message --
>>> From: bmdevelopment 
>>> Date: Fri, Jun 25, 2010 at 10:56 AM
>>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
>>> bailing-out.
>>> To: mapreduce-u...@hadoop.apache.org
>>> 
>>> 
>>> Hello,
>>> Thanks so much for the reply.
>>> See inline.
>>> 
>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala 
>>> wrote:
>>>> Hi,
>>>> 
>>>>> I've been getting the following error when trying to run a very simple
>>>>> MapReduce job.
>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>> Reduce phase.
>>>>> 
>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>> attempt_201006241812_0001_r_00_0, Status : FAILED
>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>> 
>>>>> I am running a 5 node cluster and I believe I have all my settings
>>> correct:
>>>>> 
>>>>> * ulimit -n 32768
>>>>> * DNS/RDNS configured properly
>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>> 
>>>>> The program is very simple - just counts a unique string in a log file.
>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>> 
>>>>> When I run, the job fails and I get the following output.
>>>>> http://pastebin.com/AhW6StEb
>>>>> 
>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>> map function in code above).
>>>>> 
>>>>> This runs fine and completes successfully:
>>>>>  String str = val.toString();
>>>>> 
>>>>> This causes error and fails:
>>>>>  String str = val.toString().substring(0,10);
>>>>> 
>>>>> Please let me know if you need any further information.
>>>>> It would be greatly appreciated if anyone could shed some light on this
>>> problem.
>>>> 
>>>> It catches attention that changing the code to use a substring is
>>>> causing a difference. Assuming it is consistent and not a red herring,
>>> 
>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>> first and then
>>> upgrade to 0.20.2 but results have been exactly the same.
>>> 
>>>> can you look at the counters for the two jobs using the JobTracker web
>>>> UI - things like map records, bytes etc and see if there is a
>>>> noticeable difference ?
>>> 
>>> Ok, so here is the first job using write.set(value.toString()); having
>>> *no* errors:
>>> http://pastebin.com/xvy0iGwL
>>> 
>>> And here is the second job using
>>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2010-07-27 Thread C.V.Krishnakumar
Hi Deepak,

YOu could refer this too : 
http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results
 
 I tried those instructions and it is working for me. 
Regards,
Krishna
On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:

> Hey friends,
> 
> I got stuck on setting up hdfs cluster and getting this error while running
> simple wordcount example(I did that 2 yrs back not had any problem).
> 
> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
> (
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> ).
> 
> I checked the firewall settings and /etc/hosts there is no issue there.
> Also master and slave are accessible both ways.
> 
> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
> because ulimit(its btw of 4096).
> 
> Would be really thankful  if  anyone can guide me to resolve this.
> 
> Thanks & regards,
> - Deepak Diwakar,
> 
> 
> 
> 
> On 28 June 2010 18:39, bmdevelopment  wrote:
> 
>> Hi, Sorry for the cross-post. But just trying to see if anyone else
>> has had this issue before.
>> Thanks
>> 
>> 
>> -- Forwarded message --
>> From: bmdevelopment 
>> Date: Fri, Jun 25, 2010 at 10:56 AM
>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
>> bailing-out.
>> To: mapreduce-u...@hadoop.apache.org
>> 
>> 
>> Hello,
>> Thanks so much for the reply.
>> See inline.
>> 
>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala 
>> wrote:
>>> Hi,
>>> 
 I've been getting the following error when trying to run a very simple
 MapReduce job.
 Map finishes without problem, but error occurs as soon as it enters
 Reduce phase.
 
 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
 attempt_201006241812_0001_r_00_0, Status : FAILED
 Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
 
 I am running a 5 node cluster and I believe I have all my settings
>> correct:
 
 * ulimit -n 32768
 * DNS/RDNS configured properly
 * hdfs-site.xml : http://pastebin.com/xuZ17bPM
 * mapred-site.xml : http://pastebin.com/JraVQZcW
 
 The program is very simple - just counts a unique string in a log file.
 See here: http://pastebin.com/5uRG3SFL
 
 When I run, the job fails and I get the following output.
 http://pastebin.com/AhW6StEb
 
 However, runs fine when I do *not* use substring() on the value (see
 map function in code above).
 
 This runs fine and completes successfully:
   String str = val.toString();
 
 This causes error and fails:
   String str = val.toString().substring(0,10);
 
 Please let me know if you need any further information.
 It would be greatly appreciated if anyone could shed some light on this
>> problem.
>>> 
>>> It catches attention that changing the code to use a substring is
>>> causing a difference. Assuming it is consistent and not a red herring,
>> 
>> Yes, this has been consistent over the last week. I was running 0.20.1
>> first and then
>> upgrade to 0.20.2 but results have been exactly the same.
>> 
>>> can you look at the counters for the two jobs using the JobTracker web
>>> UI - things like map records, bytes etc and see if there is a
>>> noticeable difference ?
>> 
>> Ok, so here is the first job using write.set(value.toString()); having
>> *no* errors:
>> http://pastebin.com/xvy0iGwL
>> 
>> And here is the second job using
>> write.set(value.toString().substring(0, 10)); that fails:
>> http://pastebin.com/uGw6yNqv
>> 
>> And here is even another where I used a longer, and therefore unique
>> string,
>> by write.set(value.toString().substring(0, 20)); This makes every line
>> unique, similar to first job.
>> Still fails.
>> http://pastebin.com/GdQ1rp8i
>> 
>>> Also, are the two programs being run against
>>> the exact same input data ?
>> 
>> Yes, exactly the same input: a single csv file with 23K lines.
>> Using a shorter string leads to more like keys and therefore more
>> combining/reducing, but going
>> by the above it seems to fail whether the substring/key is entirely
>> unique (23000 combine output records) or
>> mostly the same (9 combine output records).
>> 
>>> 
>>> Also, since the cluster size is small, you could also look at the
>>> tasktracker logs on the machines where the maps have run to see if
>>> there are any failures when the reduce attempts start failing.
>> 
>> Here is the TT log from the last failed job. I do not see anything
>> besides the shuffle failure, but there
>> may be something I am overlooking or simply do not understand.
>> http://pastebin.com/DKFTyGXg
>> 
>> Thanks again!
>> 
>>> 
>>> Thanks
>>> Hemanth
>>> 
>> 



Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

2010-07-13 Thread C.V.Krishnakumar
Oh. Thanks for the reply.
Regards,
Krishna
On Jul 13, 2010, at 9:51 AM, Allen Wittenauer wrote:

> 
> When you write on a machine running a datanode process, the data is *always* 
> written locally first.  This is to provide an optimization to the MapReduce 
> framework.   The lesson here is that you should *never* use a datanode 
> machine to load your data.  Always do it outside the grid.
> 
> Additionally, you can use fsck (filename) -files -locations -blocks to see 
> where those blocks have been written.  
> 
> On Jul 13, 2010, at 9:45 AM, Nathan Grice wrote:
> 
>> To test the block distribution, run the same put command from the NameNode
>> and then again from the DataNode.
>> Check the HDFS filesystem after both commands. In my case, a 2GB file was
>> distributed mostly evenly across the datanodes when put was run on the
>> NameNode, and then put only on the DataNode where I ran the put command
>> 
>> On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar 
>> wrote:
>> 
>>> Hi,
>>> I am a newbie. I am curious to know how you discovered that all the blocks
>>> are written to datanode's hdfs? I thought the replication by namenode was
>>> transparent. Am I missing something?
>>> Thanks,
>>> Krishna
>>> On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote:
>>> 
>>>> We are trying to load data into hdfs from one of the slaves and when the
>>> put
>>>> command is run from a slave(datanode) all of the blocks are written to
>>> the
>>>> datanode's hdfs, and not distributed to all of the nodes in the cluster.
>>> It
>>>> does not seem to matter what destination format we use ( /filename vs
>>>> hdfs://master:9000/filename) it always behaves the same.
>>>> Conversely, running the same command from the namenode distributes the
>>> files
>>>> across the datanodes.
>>>> 
>>>> Is there something I am missing?
>>>> 
>>>> -Nathan
>>> 
>>> 
> 



Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

2010-07-13 Thread C.V.Krishnakumar
Hi,
I am a newbie. I am curious to know how you discovered that all the blocks are 
written to datanode's hdfs? I thought the replication by namenode was 
transparent. Am I missing something?
Thanks,
Krishna
On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote:

> We are trying to load data into hdfs from one of the slaves and when the put
> command is run from a slave(datanode) all of the blocks are written to the
> datanode's hdfs, and not distributed to all of the nodes in the cluster. It
> does not seem to matter what destination format we use ( /filename vs
> hdfs://master:9000/filename) it always behaves the same.
> Conversely, running the same command from the namenode distributes the files
> across the datanodes.
> 
> Is there something I am missing?
> 
> -Nathan