RE: secondary sort - number of reducers

java8964 java8964 Fri, 30 Aug 2013 09:26:10 -0700

Well, The reducers normally will take much longer than the mappers stage, 
because the copy/shuffle/sort all happened at this time, and they are the hard 
part.
But before we simply say it is part of life, you need to dig into more of your 
MR jobs to find out if you can make it faster.
You are the person most familiar with your data, and you wrote the code to 
group/partition them, and send them to the reducers. Even you set up 255 
reducers, the question is, do each of them get its fair share?You need to read 
the COUNTER information of each reducer, and found out how many reducer groups 
each reducer gets, and how many input bytes it get, etc.
Simple example, if you send 200G data, and group them by DATE, if all the data 
belongs to 2 days, and one of them contains 90% of data, then in this case, 
giving 255 reducers won't help, as only 2 reducers will consume data, and one 
of them will consume 90% of data, and will finish in a very long time, which 
WILL delay the whole MR job, while the rest reducers will finish within 
seconds. In this case, maybe you need to rethink what should be your key, and 
make sure each reducer get its fair share of volume of data.
After the above fix (in fact, normally it will fix 90% of reducer performance 
problems, especially you have 255 reducer tasks available, so each one average 
will only get 1G data, good for your huge cluster only needs to process 256G 
data :-), if you want to make it even faster, then check you code. Do you have 
to use String.compareTo()? Is it slow?  Google hadoop rawcomparator to see if 
you can do something here.
After that, if you still think the reducer stage slow, check you cluster 
system. Does the reducer spend most time on copy stage, or sort, or in your 
reducer class? Find out the where the time spends, then identify the solution.
Yong


Date: Fri, 30 Aug 2013 11:02:05 -0400
Subject: Re: secondary sort - number of reducers
From: adeelmahm...@gmail.com
To: user@hadoop.apache.org



my secondary sort on multiple keys seem to work fine with smaller data sets but 
with bigger data sets (like 256 gig and 800M+ records) the mapper phase gets 
done pretty quick (about 15 mins) but then the reducer phase seem to take 
forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it which i 
parse in the appropriate comparator classes to perform grouping and sorting .. 
my thinking is that mappers is where most of the work is done 
1. mapper itself (create composite key and value)2. recods sorting3. 
partiotioner
if all this gets done in 15 mins then reducer has the simple task of1. grouping 
comparator
2. reducer itself (simply output records)
should take less time than mappers .. instead it essentially gets stuck in 
reduce phase .. im gonna paste my code here to see if anything stands out as a 
fundamental design issue

//////PARTITIONERpublic int getPartition(Text key, HCatRecord record, int 
numReduceTasks) {             //extract the group key from composite key
                String groupKey = key.toString().split("\\|")[0];               
                return (groupKey.hashCode() & Integer.MAX_VALUE) % 
numReduceTasks;
        }

////////////GROUP COMAPRATORpublic int compare(WritableComparable a, 
WritableComparable b) {            //compare to text objects
                String thisGroupKey = ((Text) a).toString().split("\\|")[0];    
        String otherGroupKey = ((Text) b).toString().split("\\|")[0];
                //extract               return 
thisGroupKey.compareTo(otherGroupKey);   }


////////////SORT COMPARATOR is similar to group comparator and is in map phase 
and gets done quick 


//////////REDUCER
public void reduce(Text key, Iterable<HCatRecord> records, Context context) 
throws IOException, InterruptedException {          log.info("in reducer for 
key " + key.toString());
                Iterator<HCatRecord> recordsIter = records.iterator();          
//we are only interested in the first record after sorting and grouping
                if(recordsIter.hasNext()){                      HCatRecord rec 
= recordsIter.next();                    context.write(nw, rec);
                        log.info("returned record >> " + rec.toString());       
        }       }


On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <adeelmahm...@gmail.com> wrote:

yup it was negative and by doing this now it seems to be working fine


On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <shekhar2...@gmail.com> wrote:


Is the hash code of that key  is negative.?

Do something like this



return groupKey.hashCode() & Integer.MAX_VALUE % numParts;



Regards,

Som Shekhar Sharma

+91-8197243810





On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <adeelmahm...@gmail.com> wrote:

> okay so when i specify the number of reducers e.g. in my example i m using 4

> (for a much smaller data set) it works if I use a single column in my

> composite key .. but if I add multiple columns in the composite key

> separated by a delimi .. it then throws the illegal partition error (keys

> before the pipe are group keys and after the pipe are the sort keys and my

> partioner only uses the group keys

>

> java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel

> (-1)

>         at

> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073)

>         at

> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691)

>         at

> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)

>         at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1)

>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)

>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)

>         at java.security.AccessController.doPrivileged(Native Method)

>         at javax.security.auth.Subject.doAs(Subject.java:396)

>         at

> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)

>         at org.apache.hadoop.mapred.Child.main(Child.java:249)

>

>

> public int getPartition(Text key, HCatRecord record, int numParts) {

> //extract the group key from composite key

> String groupKey = key.toString().split("\\|")[0];

> return groupKey.hashCode() % numParts;

> }

>

>

> On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma <shekhar2...@gmail.com>

> wrote:

>>

>> No...partitionr decides which keys should go to which reducer...and

>> number of reducers you need to decide...No of reducers depends on

>> factors like number of key value pair, use case etc

>> Regards,

>> Som Shekhar Sharma

>> +91-8197243810

>>

>>

>> On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi <adeelmahm...@gmail.com>

>> wrote:

>> > so it cant figure out an appropriate number of reducers as it does for

>> > mappers .. in my case hadoop is using 2100+ mappers and then only 1

>> > reducer

>> > .. since im overriding the partitioner class shouldnt that decide how

>> > manyredeucers there should be based on how many different partition

>> > values

>> > being returned by the custom partiotioner

>> >

>> >

>> > On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley <i...@cloudera.com> wrote:

>> >>

>> >> If you don't specify the number of Reducers, Hadoop will use the

>> >> default

>> >> -- which, unless you've changed it, is 1.

>> >>

>> >> Regards

>> >>

>> >> Ian.

>> >>

>> >> On Aug 29, 2013, at 4:23 PM, Adeel Qureshi <adeelmahm...@gmail.com>

>> >> wrote:

>> >>

>> >> I have implemented secondary sort in my MR job and for some reason if i

>> >> dont specify the number of reducers it uses 1 which doesnt seems right

>> >> because im working with 800M+ records and one reducer slows things down

>> >> significantly. Is this some kind of limitation with the secondary sort

>> >> that

>> >> it has to use a single reducer .. that kind of would defeat the purpose

>> >> of

>> >> having a scalable solution such as secondary sort. I would appreciate

>> >> any

>> >> help.

>> >>

>> >> Thanks

>> >> Adeel

>> >>

>> >>

>> >>

>> >> ---

>> >> Ian Wrigley

>> >> Sr. Curriculum Manager

>> >> Cloudera, Inc

>> >> Cell: (323) 819 4075

>> >>

>> >

>

>

RE: secondary sort - number of reducers

Reply via email to