Hi,
I have run some large and small jobs and calculated the Total Shuffle Time
for the jobs. I can see that the Total Shuffle Time is almost half the
Total Time which was taken by the full job to complete.
My question, here, is that how can we decrease the Total Shuffle Time? And
doing so, what w
It depends of workload. Could you tell us more specification about
your job? In general case which reducers are bottleneck, there are
some tuning techniques as follows:
1. Allocate more memory to reducers. It decreases disk IO of reducers
when merging and running reduce functions.
2. Use combine fu
Without knowing your exact workload, using a Combiner (if possible) as
Tsuyoshi recommended should decrease your total shuffle time. You can also
try compressing the map output so that there's less disk and network IO.
Here's an example configuration using Snappy:
conf.set("mapred.compress.map.o
hi Gaurav,
Can you tell me how did calculated total shuffle time ?.Apart from
combiners and compression, you can also use some shuffle-sort
parameters that might increase the performance, i am not sure exactly
which parameters to tweak .Please share if you come across some other
techniques , i am
Hi,
Thanks for your replies. I will try working on recommended suggestions and
provide feedback.
Abhi,
In the JobTracker Web UI -> Job Tracker History, go to the specific job. Go
to Reduce Task List. Enter into the first reduce task attempt. There you
can see the start time. It is the time when