Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
I have a job that is emitting over 3 billion rows from the map to the reduce. The job is configured with 43 reduce tasks. A perfectly even distribution would amount to about 70 million rows per reduce task. However I actually got around 60 million for most of the tasks, one task got over 100

Re: Distributing Keys across Reducers

2012-07-20 Thread syed kather
Dave Shine , Can you share how many data is been taken by map task .If map task is uneven then it might be Hot Spotting Problem. Have an look on http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ . I had also faced same pr

Re: Distributing Keys across Reducers

2012-07-20 Thread John Armstrong
On 07/20/2012 09:20 AM, Dave Shine wrote: I believe this is referred to as a “key skew problem”, which I know is heavily dependent on the actual data being processed. Can anyone point me to any blog posts, white papers, etc. that might give me some options on how to deal with this issue? I don

Re: Distributing Keys across Reducers

2012-07-20 Thread Christoph Schmitz
Hi Dave, I haven't actually done this in practice, so take this with a grain of salt ;-) One way to circumvent your problem might be to add entropy to the keys, i.e., if your keys are "a", "b" etc. and you got too many "a"s and too many "b"s, you could inflate your keys randomly to be (a, 1)

Re: Distributing Keys across Reducers

2012-07-20 Thread David Rosenstrauch
On 07/20/2012 09:20 AM, Dave Shine wrote: I have a job that is emitting over 3 billion rows from the map to the reduce. The job is configured with 43 reduce tasks. A perfectly even distribution would amount to about 70 million rows per reduce task. However I actually got around 60 million f

RE: Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
Thanks Syed. I'm not using HBase, so I don't think this is related to my problem. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com From: syed kather [mailto:in.ab...@gmail.com] Sent

RE: Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
Thanks John. The key is my own WritableComparable object, and I have created custom Comparator, Partitioner, and KeyValueGroupingComparator. However, they are all pretty generic. The Key class is has two properties, a boolean and a string. I'm grouping on just the string, but comparing on bo

Re: Distributing Keys across Reducers

2012-07-20 Thread Harsh J
Does applying a combiner make any difference? Or are these numbers with the combiner included? On Fri, Jul 20, 2012 at 8:46 PM, Dave Shine wrote: > Thanks John. > > The key is my own WritableComparable object, and I have created custom > Comparator, Partitioner, and KeyValueGroupingComparator.

Comparing input hdfs file to a distributed cache files

2012-07-20 Thread Shanu Sushmita
Hi, I am trying to solve a problem where I need to computed frequencies of words occurring in a file1 from file 2. For example: text in file1: hadoop user hello world and text in file2 is: hadoop user hello world hadoop hadoop hadoop user world world world hadoop user hello so the output sh

Re: Comparing input hdfs file to a distributed cache files

2012-07-20 Thread Sriram Ramachandrasekaran
Hello, If I understand right, you are trying to run your map reduce on files that you shared via Distributed cache. Distributed Cache is not generally meant for it. It is available in case your MR needs other reference files, some archives, native libs, etc that needs to be shared across your clust

Re: Comparing input hdfs file to a distributed cache files

2012-07-20 Thread Harsh J
SS, Is your job not progressing at all (i.e. continues to be at 0-progress for hours), or does it fail after zero progress? I'd try adding logging to various important points of the map/reduce functions and see whats taking it so long, or if its getting stuck at something. Logs are observable for

RE: Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
These numbers are with everything I laid out below. The job was running acceptably until a couple of days ago when a change increased the output of the Map phase by about 30%. I don't think there is anything special about those additional keys that would force them all into the same reducer.

Fail to start mapreduce tasks across nodes

2012-07-20 Thread Steve Sonnenberg
I have a 2-node Fedora system and in cluster mode, I have the following issue that I can't resolve. Hadoop 1.0.3 I'm running with filesystem, file:/// and invoking the simple 'grep' example hadoop jar hadoop-examples-1.0.3.jar grep inputdir outputdir simple-pattern The initiator displays Error

Re: Comparing input hdfs file to a distributed cache files

2012-07-20 Thread Shanu Sushmita
Thanks Sriram, Many thanks for your prompt reply. I appreciate you inputs. I am trying to compare two files file1 (3MB) and file2 which is (17G). I need to read each line from file1 and look for it occurrences in file2 and get the frequencies. I am new to hadoop, so am not sure if my approach is

Re: Comparing input hdfs file to a distributed cache files

2012-07-20 Thread Shanu Sushmita
Thanks Harsh, I dont get any error message. It just halts at 0% map 0% reduce for ever. However it works (though still slower) if I reduce the size of the cache file (less than 100 KB). But just halts if I increase the size of the cache file. Even 120KB is not working :-( SS On 20 Jul 20

RE: Distributing Keys across Reducers

2012-07-20 Thread Tim Broberg
Just a thought, but can you deal with the problem with increased granularity by simply making the jobs smaller? If you have enough jobs, when one takes twice as long there will be plenty of other small jobs to employ the other nodes, right? - Tim. F

RE: Distributing Keys across Reducers

2012-07-20 Thread Dave Shine
Yes, that is a possibility, but it will take some significant rearchitecture. I was assuming that was what I was going to have to do until I saw the key distribution problem and though I might be able to buy some relief by addressing that. The job runs once per day, starting at 1:00AM EDT. I

RE: Lines missing from output files (0.20.205.0)

2012-07-20 Thread Berry, Matt
I was operating under the assumption that there is a 1:1 ratio between Reducers and OutputFormats, just as there is a 1:1 correspondence between partitions and Reducers. I found this isn't the case however. I created a file from each output format named in this manner : "OutputFormat_(Partition

Re: Comparing input hdfs file to a distributed cache files

2012-07-20 Thread Shanu Sushmita
Is there any particular thing that I should look at in the logs? here I few things I saw in the beginning of the log file: 2012-07-20 11:00:03,483 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2012-07-20 11:00:03,522 INFO org.apache.hadoop.m

Re: Fail to start mapreduce tasks across nodes

2012-07-20 Thread Steve Sonnenberg
Sorry this is my first posting and I haven't gotten a copy nor any response. Could someone please respond if you are seeing this? Thanks, Newbie On Fri, Jul 20, 2012 at 12:36 PM, Steve Sonnenberg wrote: > I have a 2-node Fedora system and in cluster mode, I have the following > issue that I can'

Re: Fail to start mapreduce tasks across nodes

2012-07-20 Thread Shanu Sushmita
yes we can see it :-) SS On 20 Jul 2012, at 12:15, Steve Sonnenberg wrote: Sorry this is my first posting and I haven't gotten a copy nor any response. Could someone please respond if you are seeing this? Thanks, Newbie On Fri, Jul 20, 2012 at 12:36 PM, Steve Sonnenberg > wrote: I have a 2-

Re: Fail to start mapreduce tasks across nodes

2012-07-20 Thread Harsh J
A 2-node cluster is a fully-distributed cluster and cannot use a file:/// FileSystem as thats not a distributed filesystem (unless its an NFS mount). This explains why some of your tasks aren't able to locate an earlier written file on the /tmp dir thats probably available on the JT node alone, not