About MapTask.java

2011-02-24 Thread Dongwon Kim
Hi,

 

I want to know how "MapTask.java" is implemented, especially
"MapOutputBuffer" class defined in "MapTask.java".

I've been trying to read "MapTask.java" after reading some references such
as "Hadoop definitive guide" and
"http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html";, but
it's quite tough to directly read the code without detailed comments.

 

As I know, when each intermediate (key, value) pair is generated by the
user-defined map function, the pair is written by "MapOutputBuffer" class
defined in "MapTask.java" with MapOutputBuffer.collect() invoked.

However, I can't understand what each variable defined in "MapOutputBuffer"
means.

What I've understood is as follows (* please correct any misunderstanding): 

- The byte buffer "kvbuffer" is where each actual (partition, key, value)
triple is written.

- An integer array "kvindices" is called "accounting buffer", every three
elements of which save indices to the corresponding triple in "kvbuffer".

- Another integer array "kvoffsets" contains indices of triples in
"kvindices".

- "kvstart", "kvend", "kvindex" are used to point "kvindex"

- "bufstart", "bufend", "bufvoid", "bufindex", "bufmark" are used to point
"kvbuffer"

 

What I can't understand is the comments beside variable definitions.

= definitions of some variables
=

private volatile int kvstart = 0;  // marks beginning of *spill*

private volatile int kvend = 0;// marks beginning of *collectable*

private int kvindex = 0;   // marks end of *collected*

private final int[] kvoffsets; // indices into kvindices

private final int[] kvindices; // partition, k/v offsets into
kvbuffer

private volatile int bufstart = 0; // marks beginning of *spill*

private volatile int bufend = 0;   // marks beginning of *collectable*

private volatile int bufvoid = 0;  // marks the point where we should
stop

   // reading at the end of the buffer

private int bufindex = 0;  // marks end of *collected*

private int bufmark = 0;   // marks end of *record*

private byte[] kvbuffer;   // main output buffer


==

 

Q1)

What do the terms "spill", "collectable", and "collected" mean?

I guess, because map outputs continue to be written to the buffer while the
spill takes place, there must be at least two pointers: from where to write
map outputs and to where to spill data; but I don't know what those "spill"
"collectable", and "collected" mean exactly.

 

Q2)

Is it efficient to partition data first and then sort records inside each
partition?

Does it happen to avoid comparing expensive pair-wise key comparisons?

 

Q3)

Are there any documents containing explanations about how such internal
classes are implemented? 

 

Thanks,



eastcirclek

 

 



Re: About MapTask.java

2011-02-24 Thread Harsh J
Hey,

On Thu, Feb 24, 2011 at 6:26 PM, Dongwon Kim  wrote:
> I've been trying to read "MapTask.java" after reading some references such
> as "Hadoop definitive guide" and
> "http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html";, but
> it's quite tough to directly read the code without detailed comments.

Perhaps you can add some after getting things cleared ;-)

> Q2)
>
> Is it efficient to partition data first and then sort records inside each
> partition?
>
> Does it happen to avoid comparing expensive pair-wise key comparisons?

Typically you would only want sorting done inside a partitioned set,
since all of the different partitions are sent off to different
reducers. Total-order partitioning may be an exception here, perhaps.

> Q3)
>
> Are there any documents containing explanations about how such internal
> classes are implemented?

There's a very good presentation you may want to see, on the
spill/shuffle/sort framework portions your doubts are about:
http://www.slideshare.net/hadoopusergroup/ordered-record-collection

HTH :)

-- 
Harsh J
www.harshj.com