Hi Jeff,
Thanks a lot for your explanation. It really helps for understanding the
details of job workflow.
Hi all,
Thanks a lot for your help. One more question, through monitoring data I
find the iowait% is quite high. Do you think this normal for there's a lot
of data read and written, as well
The scale of each reducer depends on the Partitioner. You can think of
Partitioner as a Hash Function, and the reducer as bucket, So you can
not expect that each bucket has same number of items.
Skewed data distribution will make a few reducers cost much more time.
2010/6/18 李钰 :
> Hi Jeff and
Hi Jeff and Amogh,
Thanks for your comments! In my understanding, in the partitioning phase
before spilling to disk, the threads will divide the data into partitions
corresponding to the number of reducers, as described int the Definitive
Guide. So I think the scale of input data should be the sam
>>Since the scale of input data and operations of each reduce task is the same,
>>what may cause the execution time of reduce tasks different?
You should consider looking at the copy, shuffle and reduce times separately
from JT UI to get better info. Many (dynamic) considerations like network
The input of each reducer is not same, it depends on the input data
distribution and Partitioner.
And the running time of each reducer consist of three phases: copy,
sort and reducer.
2010/6/18 李钰 :
> Hi Todd and Jeff,
>
> Thanks a lot for your discussion, it's really helpful to me. I'd like to
>
Hi Todd and Jeff,
Thanks a lot for your discussion, it's really helpful to me. I'd like to
express my especial appreciation for Todd's patient explanation, you help me
see more clearly about the working mechanism of SORT. And Jeff, really thank
you for reminding me that sort uses TotalOrderPartiti
On Thu, Jun 17, 2010 at 9:37 AM, Jeff Zhang wrote:
> Todd,
>
> Why's there a sorting in map task, the sorting here seems useless in my
> opinion.
>
>
For map-only jobs there isn't. For jobs with reduce, typically the number of
reduce tasks is smaller than the number of map tasks, so parallelizing
Todd,
Why's there a sorting in map task, the sorting here seems useless in my opinion.
On Thu, Jun 17, 2010 at 9:26 AM, Todd Lipcon wrote:
> On Thu, Jun 17, 2010 at 12:43 AM, Jeff Zhang wrote:
>
>> Your understanding of Sort is not right. The key concept of Sort is
>> the TotalOrderPartitione
On Thu, Jun 17, 2010 at 12:43 AM, Jeff Zhang wrote:
> Your understanding of Sort is not right. The key concept of Sort is
> the TotalOrderPartitioner. Actually before the map-reduce job, client
> side will do sampling of input data to estimate the distribution of
> input data. And the mapper do n
Hi Jeff,
Really thank you for your reply. It really helps! I'll take a look at
TotalOrderPartitioner carefully.
BTW, what's your opinion of where the bottleneck lies in SORT, and which
parameters impact the performance of SORT most? Looking forward to your
reply, thanks.
Dear all,
Any other comm
Your understanding of Sort is not right. The key concept of Sort is
the TotalOrderPartitioner. Actually before the map-reduce job, client
side will do sampling of input data to estimate the distribution of
input data. And the mapper do nothing, each reducer will fetch its
data according the TotalOr
Hi all,
I'm doing some tuning of the sort benchmark of hadoop. To be more specified,
running test against the org.apache.hadoop.examples.Sort class. As looking
through the source code, I think the map tasks take responsibility of
sorting the input data, and the reduce tasks just merge the map outp
12 matches
Mail list logo