Hi,

 

I want to know how "MapTask.java" is implemented, especially
"MapOutputBuffer" class defined in "MapTask.java".

I've been trying to read "MapTask.java" after reading some references such
as "Hadoop definitive guide" and
"http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html";, but
it's quite tough to directly read the code without detailed comments.

 

As I know, when each intermediate (key, value) pair is generated by the
user-defined map function, the pair is written by "MapOutputBuffer" class
defined in "MapTask.java" with MapOutputBuffer.collect() invoked.

However, I can't understand what each variable defined in "MapOutputBuffer"
means.

What I've understood is as follows (* please correct any misunderstanding): 

- The byte buffer "kvbuffer" is where each actual (partition, key, value)
triple is written.

- An integer array "kvindices" is called "accounting buffer", every three
elements of which save indices to the corresponding triple in "kvbuffer".

- Another integer array "kvoffsets" contains indices of triples in
"kvindices".

- "kvstart", "kvend", "kvindex" are used to point "kvindex"

- "bufstart", "bufend", "bufvoid", "bufindex", "bufmark" are used to point
"kvbuffer"

 

What I can't understand is the comments beside variable definitions.

===================== definitions of some variables
=========================

    private volatile int kvstart = 0;  // marks beginning of *spill*

    private volatile int kvend = 0;    // marks beginning of *collectable*

    private int kvindex = 0;           // marks end of *collected*

    private final int[] kvoffsets;     // indices into kvindices

    private final int[] kvindices;     // partition, k/v offsets into
kvbuffer

    private volatile int bufstart = 0; // marks beginning of *spill*

    private volatile int bufend = 0;   // marks beginning of *collectable*

    private volatile int bufvoid = 0;  // marks the point where we should
stop

                                       // reading at the end of the buffer

    private int bufindex = 0;          // marks end of *collected*

    private int bufmark = 0;           // marks end of *record*

    private byte[] kvbuffer;           // main output buffer

============================================================================
==

 

Q1)

What do the terms "spill", "collectable", and "collected" mean?

I guess, because map outputs continue to be written to the buffer while the
spill takes place, there must be at least two pointers: from where to write
map outputs and to where to spill data; but I don't know what those "spill"
"collectable", and "collected" mean exactly.

 

Q2)

Is it efficient to partition data first and then sort records inside each
partition?

Does it happen to avoid comparing expensive pair-wise key comparisons?

 

Q3)

Are there any documents containing explanations about how such internal
classes are implemented? 

 

Thanks,

----

eastcirclek

 

 

Reply via email to