Re: Performance tuning of sort

2010-06-17 Thread 李钰
Hi Jeff, Thanks a lot for your explanation. It really helps for understanding the details of job workflow. Hi all, Thanks a lot for your help. One more question, through monitoring data I find the iowait% is quite high. Do you think this normal for there's a lot of data read and written, as well

Re: Performance tuning of sort

2010-06-17 Thread Jeff Zhang
The scale of each reducer depends on the Partitioner. You can think of Partitioner as a Hash Function, and the reducer as bucket, So you can not expect that each bucket has same number of items. Skewed data distribution will make a few reducers cost much more time. 2010/6/18 李钰 : > Hi Jeff and

Re: Performance tuning of sort

2010-06-17 Thread 李钰
Hi Jeff and Amogh, Thanks for your comments! In my understanding, in the partitioning phase before spilling to disk, the threads will divide the data into partitions corresponding to the number of reducers, as described int the Definitive Guide. So I think the scale of input data should be the sam

Re: Performance tuning of sort

2010-06-17 Thread Amogh Vasekar
>>Since the scale of input data and operations of each reduce task is the same, >>what may cause the execution time of reduce tasks different? You should consider looking at the copy, shuffle and reduce times separately from JT UI to get better info. Many (dynamic) considerations like network

Re: Performance tuning of sort

2010-06-17 Thread Jeff Zhang
The input of each reducer is not same, it depends on the input data distribution and Partitioner. And the running time of each reducer consist of three phases: copy, sort and reducer. 2010/6/18 李钰 : > Hi Todd and Jeff, > > Thanks a lot for your discussion, it's really helpful to me. I'd like to >

Re: Performance tuning of sort

2010-06-17 Thread 李钰
Hi Todd and Jeff, Thanks a lot for your discussion, it's really helpful to me. I'd like to express my especial appreciation for Todd's patient explanation, you help me see more clearly about the working mechanism of SORT. And Jeff, really thank you for reminding me that sort uses TotalOrderPartiti

Re: Performance tuning of sort

2010-06-17 Thread Todd Lipcon
On Thu, Jun 17, 2010 at 9:37 AM, Jeff Zhang wrote: > Todd, > > Why's there a sorting in map task, the sorting here seems useless in my > opinion. > > For map-only jobs there isn't. For jobs with reduce, typically the number of reduce tasks is smaller than the number of map tasks, so parallelizing

Re: Performance tuning of sort

2010-06-17 Thread Jeff Zhang
Todd, Why's there a sorting in map task, the sorting here seems useless in my opinion. On Thu, Jun 17, 2010 at 9:26 AM, Todd Lipcon wrote: > On Thu, Jun 17, 2010 at 12:43 AM, Jeff Zhang wrote: > >> Your understanding of Sort is not right. The key concept of Sort is >> the TotalOrderPartitione

Re: Performance tuning of sort

2010-06-17 Thread Todd Lipcon
On Thu, Jun 17, 2010 at 12:43 AM, Jeff Zhang wrote: > Your understanding of Sort is not right. The key concept of Sort is > the TotalOrderPartitioner. Actually before the map-reduce job, client > side will do sampling of input data to estimate the distribution of > input data. And the mapper do n

Re: Performance tuning of sort

2010-06-17 Thread 李钰
Hi Jeff, Really thank you for your reply. It really helps! I'll take a look at TotalOrderPartitioner carefully. BTW, what's your opinion of where the bottleneck lies in SORT, and which parameters impact the performance of SORT most? Looking forward to your reply, thanks. Dear all, Any other comm

Re: Performance tuning of sort

2010-06-17 Thread Jeff Zhang
Your understanding of Sort is not right. The key concept of Sort is the TotalOrderPartitioner. Actually before the map-reduce job, client side will do sampling of input data to estimate the distribution of input data. And the mapper do nothing, each reducer will fetch its data according the TotalOr

Performance tuning of sort

2010-06-17 Thread 李钰
Hi all, I'm doing some tuning of the sort benchmark of hadoop. To be more specified, running test against the org.apache.hadoop.examples.Sort class. As looking through the source code, I think the map tasks take responsibility of sorting the input data, and the reduce tasks just merge the map outp