How to reduce total shuffle time

2012-08-28 Thread Gaurav Dasgupta
Hi, I have run some large and small jobs and calculated the Total Shuffle Time for the jobs. I can see that the Total Shuffle Time is almost half the Total Time which was taken by the full job to complete. My question, here, is that how can we decrease the Total Shuffle Time? And doing so, what w

Re: How to reduce total shuffle time

2012-08-28 Thread Tsuyoshi OZAWA
It depends of workload. Could you tell us more specification about your job? In general case which reducers are bottleneck, there are some tuning techniques as follows: 1. Allocate more memory to reducers. It decreases disk IO of reducers when merging and running reduce functions. 2. Use combine fu

Re: How to reduce total shuffle time

2012-08-28 Thread Minh Duc Nguyen
Without knowing your exact workload, using a Combiner (if possible) as Tsuyoshi recommended should decrease your total shuffle time. You can also try compressing the map output so that there's less disk and network IO. Here's an example configuration using Snappy: conf.set("mapred.compress.map.o

Re: How to reduce total shuffle time

2012-08-28 Thread abhiTowson cal
hi Gaurav, Can you tell me how did calculated total shuffle time ?.Apart from combiners and compression, you can also use some shuffle-sort parameters that might increase the performance, i am not sure exactly which parameters to tweak .Please share if you come across some other techniques , i am

Re: How to reduce total shuffle time

2012-08-28 Thread Gaurav Dasgupta
Hi, Thanks for your replies. I will try working on recommended suggestions and provide feedback. Abhi, In the JobTracker Web UI -> Job Tracker History, go to the specific job. Go to Reduce Task List. Enter into the first reduce task attempt. There you can see the start time. It is the time when