Re: slow shuffle
On Dec 5, 2008, at 2:43 PM, Songting Chen wrote: To summarize the slow shuffle issue: 1. I think one problem is that the Reducer starts very late in the process, slowing the entire job significantly. Is there a way to let reducer start earlier? http://issues.apache.org/jira/browse/HADOOP-3136 should help you there, it's pretty close to getting in to 0.20. Arun 2. Copying 300 files with 30K each took total 3 mins (after all map finished). This really puzzles me what's behind the scene. (note that sorting takes 1 sec) Thanks, -Songting --- On Fri, 12/5/08, Songting Chen [EMAIL PROTECTED] wrote: From: Songting Chen [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 1:27 PM We have 4 testing data nodes with 3 reduce tasks. The parallel.copies parameter has been increased to 20,30, even 50. But it doesn't really help... --- On Fri, 12/5/08, Aaron Kimball [EMAIL PROTECTED] wrote: From: Aaron Kimball [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 12:28 PM How many reduce tasks do you have? Look into increasing mapred.reduce.parallel.copies from the default of 5 to something more like 20 or 30. - Aaron On Fri, Dec 5, 2008 at 10:00 PM, Songting Chen [EMAIL PROTECTED]wrote: A little more information: We optimized our Map process quite a bit that now the Shuffle becomes the bottleneck. 1. There are 300 Map jobs (128M size block), each takes about 13 sec. 2. The Reducer starts running at a very late stage (80% maps are done) 3. copy 300 map outputs (shuffle) takes as long as the entire map process, although each map output is just about 50Kbytes --- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 11:43 AM These configuration options will be useful: property namemapred.job.shuffle.merge.percent/name value0.66/value descriptionThe usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. /description /property property namemapred.job.shuffle.input.buffer.percent/name value0.70/value descriptionThe percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. /description /property property namemapred.job.reduce.input.buffer.percent/name value0.0/value descriptionThe percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. /description /property How long did the shuffle take relative to the rest of the job? Alex On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote: We encountered a bottleneck during the shuffle phase. However, there is not much data to be shuffled across the network at all - total less than 10MBytes (the combiner aggregated most of the data). Are there any parameters or anything we can tune to improve the shuffle performance? Thanks, -Songting
Re: slow shuffle
That's cool. Update on Issue 2: I accidentally changed number of reducer to 1 (from 3). The problem is gone! That one reducer overlaps with Map well and copies 300 small map output pretty fast. So when there are large number of small Map outputs, I'd use only 1 reducer. (That doesn't really make sense - there may be some internal code issues). Thanks, -Songting --- On Sat, 12/6/08, Arun C Murthy [EMAIL PROTECTED] wrote: From: Arun C Murthy [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Saturday, December 6, 2008, 1:20 AM On Dec 5, 2008, at 2:43 PM, Songting Chen wrote: To summarize the slow shuffle issue: 1. I think one problem is that the Reducer starts very late in the process, slowing the entire job significantly. Is there a way to let reducer start earlier? http://issues.apache.org/jira/browse/HADOOP-3136 should help you there, it's pretty close to getting in to 0.20. Arun 2. Copying 300 files with 30K each took total 3 mins (after all map finished). This really puzzles me what's behind the scene. (note that sorting takes 1 sec) Thanks, -Songting --- On Fri, 12/5/08, Songting Chen [EMAIL PROTECTED] wrote: From: Songting Chen [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 1:27 PM We have 4 testing data nodes with 3 reduce tasks. The parallel.copies parameter has been increased to 20,30, even 50. But it doesn't really help... --- On Fri, 12/5/08, Aaron Kimball [EMAIL PROTECTED] wrote: From: Aaron Kimball [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 12:28 PM How many reduce tasks do you have? Look into increasing mapred.reduce.parallel.copies from the default of 5 to something more like 20 or 30. - Aaron On Fri, Dec 5, 2008 at 10:00 PM, Songting Chen [EMAIL PROTECTED]wrote: A little more information: We optimized our Map process quite a bit that now the Shuffle becomes the bottleneck. 1. There are 300 Map jobs (128M size block), each takes about 13 sec. 2. The Reducer starts running at a very late stage (80% maps are done) 3. copy 300 map outputs (shuffle) takes as long as the entire map process, although each map output is just about 50Kbytes --- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 11:43 AM These configuration options will be useful: property namemapred.job.shuffle.merge.percent/name value0.66/value descriptionThe usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. /description /property property namemapred.job.shuffle.input.buffer.percent/name value0.70/value descriptionThe percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. /description /property property namemapred.job.reduce.input.buffer.percent/name value0.0/value descriptionThe percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. /description /property How long did the shuffle take relative to the rest of the job? Alex On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote: We encountered a bottleneck during the shuffle phase. However, there is not much data to be shuffled across the network at all - total less than 10MBytes (the combiner aggregated most of the data). Are there any parameters or anything we can tune to improve the shuffle performance? Thanks, -Songting
Re: slow shuffle
These configuration options will be useful: property namemapred.job.shuffle.merge.percent/name value0.66/value descriptionThe usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. /description /property property namemapred.job.shuffle.input.buffer.percent/name value0.70/value descriptionThe percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. /description /property property namemapred.job.reduce.input.buffer.percent/name value0.0/value descriptionThe percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. /description /property How long did the shuffle take relative to the rest of the job? Alex On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote: We encountered a bottleneck during the shuffle phase. However, there is not much data to be shuffled across the network at all - total less than 10MBytes (the combiner aggregated most of the data). Are there any parameters or anything we can tune to improve the shuffle performance? Thanks, -Songting
Re: slow shuffle
it takes 50% of the total time. --- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 11:43 AM These configuration options will be useful: property namemapred.job.shuffle.merge.percent/name value0.66/value descriptionThe usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. /description /property property namemapred.job.shuffle.input.buffer.percent/name value0.70/value descriptionThe percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. /description /property property namemapred.job.reduce.input.buffer.percent/name value0.0/value descriptionThe percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. /description /property How long did the shuffle take relative to the rest of the job? Alex On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote: We encountered a bottleneck during the shuffle phase. However, there is not much data to be shuffled across the network at all - total less than 10MBytes (the combiner aggregated most of the data). Are there any parameters or anything we can tune to improve the shuffle performance? Thanks, -Songting
Re: slow shuffle
A little more information: We optimized our Map process quite a bit that now the Shuffle becomes the bottleneck. 1. There are 300 Map jobs (128M size block), each takes about 13 sec. 2. The Reducer starts running at a very late stage (80% maps are done) 3. copy 300 map outputs (shuffle) takes as long as the entire map process, although each map output is just about 50Kbytes --- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 11:43 AM These configuration options will be useful: property namemapred.job.shuffle.merge.percent/name value0.66/value descriptionThe usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. /description /property property namemapred.job.shuffle.input.buffer.percent/name value0.70/value descriptionThe percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. /description /property property namemapred.job.reduce.input.buffer.percent/name value0.0/value descriptionThe percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. /description /property How long did the shuffle take relative to the rest of the job? Alex On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote: We encountered a bottleneck during the shuffle phase. However, there is not much data to be shuffled across the network at all - total less than 10MBytes (the combiner aggregated most of the data). Are there any parameters or anything we can tune to improve the shuffle performance? Thanks, -Songting
Re: slow shuffle
We have 4 testing data nodes with 3 reduce tasks. The parallel.copies parameter has been increased to 20,30, even 50. But it doesn't really help... --- On Fri, 12/5/08, Aaron Kimball [EMAIL PROTECTED] wrote: From: Aaron Kimball [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 12:28 PM How many reduce tasks do you have? Look into increasing mapred.reduce.parallel.copies from the default of 5 to something more like 20 or 30. - Aaron On Fri, Dec 5, 2008 at 10:00 PM, Songting Chen [EMAIL PROTECTED]wrote: A little more information: We optimized our Map process quite a bit that now the Shuffle becomes the bottleneck. 1. There are 300 Map jobs (128M size block), each takes about 13 sec. 2. The Reducer starts running at a very late stage (80% maps are done) 3. copy 300 map outputs (shuffle) takes as long as the entire map process, although each map output is just about 50Kbytes --- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 11:43 AM These configuration options will be useful: property namemapred.job.shuffle.merge.percent/name value0.66/value descriptionThe usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. /description /property property namemapred.job.shuffle.input.buffer.percent/name value0.70/value descriptionThe percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. /description /property property namemapred.job.reduce.input.buffer.percent/name value0.0/value descriptionThe percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. /description /property How long did the shuffle take relative to the rest of the job? Alex On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote: We encountered a bottleneck during the shuffle phase. However, there is not much data to be shuffled across the network at all - total less than 10MBytes (the combiner aggregated most of the data). Are there any parameters or anything we can tune to improve the shuffle performance? Thanks, -Songting
Re: slow shuffle
I think one of the issues is that the Reducer starts very late in the process, slowing the entire job significantly. Is there a way to let reducer start earlier? --- On Fri, 12/5/08, Songting Chen [EMAIL PROTECTED] wrote: From: Songting Chen [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 1:27 PM We have 4 testing data nodes with 3 reduce tasks. The parallel.copies parameter has been increased to 20,30, even 50. But it doesn't really help... --- On Fri, 12/5/08, Aaron Kimball [EMAIL PROTECTED] wrote: From: Aaron Kimball [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 12:28 PM How many reduce tasks do you have? Look into increasing mapred.reduce.parallel.copies from the default of 5 to something more like 20 or 30. - Aaron On Fri, Dec 5, 2008 at 10:00 PM, Songting Chen [EMAIL PROTECTED]wrote: A little more information: We optimized our Map process quite a bit that now the Shuffle becomes the bottleneck. 1. There are 300 Map jobs (128M size block), each takes about 13 sec. 2. The Reducer starts running at a very late stage (80% maps are done) 3. copy 300 map outputs (shuffle) takes as long as the entire map process, although each map output is just about 50Kbytes --- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 11:43 AM These configuration options will be useful: property namemapred.job.shuffle.merge.percent/name value0.66/value descriptionThe usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. /description /property property namemapred.job.shuffle.input.buffer.percent/name value0.70/value descriptionThe percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. /description /property property namemapred.job.reduce.input.buffer.percent/name value0.0/value descriptionThe percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. /description /property How long did the shuffle take relative to the rest of the job? Alex On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote: We encountered a bottleneck during the shuffle phase. However, there is not much data to be shuffled across the network at all - total less than 10MBytes (the combiner aggregated most of the data). Are there any parameters or anything we can tune to improve the shuffle performance? Thanks, -Songting
Re: slow shuffle
To summarize the slow shuffle issue: 1. I think one problem is that the Reducer starts very late in the process, slowing the entire job significantly. Is there a way to let reducer start earlier? 2. Copying 300 files with 30K each took total 3 mins (after all map finished). This really puzzles me what's behind the scene. (note that sorting takes 1 sec) Thanks, -Songting --- On Fri, 12/5/08, Songting Chen [EMAIL PROTECTED] wrote: From: Songting Chen [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 1:27 PM We have 4 testing data nodes with 3 reduce tasks. The parallel.copies parameter has been increased to 20,30, even 50. But it doesn't really help... --- On Fri, 12/5/08, Aaron Kimball [EMAIL PROTECTED] wrote: From: Aaron Kimball [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 12:28 PM How many reduce tasks do you have? Look into increasing mapred.reduce.parallel.copies from the default of 5 to something more like 20 or 30. - Aaron On Fri, Dec 5, 2008 at 10:00 PM, Songting Chen [EMAIL PROTECTED]wrote: A little more information: We optimized our Map process quite a bit that now the Shuffle becomes the bottleneck. 1. There are 300 Map jobs (128M size block), each takes about 13 sec. 2. The Reducer starts running at a very late stage (80% maps are done) 3. copy 300 map outputs (shuffle) takes as long as the entire map process, although each map output is just about 50Kbytes --- On Fri, 12/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote: From: Alex Loddengaard [EMAIL PROTECTED] Subject: Re: slow shuffle To: core-user@hadoop.apache.org Date: Friday, December 5, 2008, 11:43 AM These configuration options will be useful: property namemapred.job.shuffle.merge.percent/name value0.66/value descriptionThe usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. /description /property property namemapred.job.shuffle.input.buffer.percent/name value0.70/value descriptionThe percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. /description /property property namemapred.job.reduce.input.buffer.percent/name value0.0/value descriptionThe percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. /description /property How long did the shuffle take relative to the rest of the job? Alex On Fri, Dec 5, 2008 at 11:17 AM, Songting Chen [EMAIL PROTECTED]wrote: We encountered a bottleneck during the shuffle phase. However, there is not much data to be shuffled across the network at all - total less than 10MBytes (the combiner aggregated most of the data). Are there any parameters or anything we can tune to improve the shuffle performance? Thanks, -Songting