[ 
https://issues.apache.org/jira/browse/TEZ-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1094:
----------------------------------
    Attachment: TEZ-1094.3.patch


Removed "finalMergeEnabled=false" case in UnorderedPartitionedKVWriter.  It 
isn't adding much value in this case. We can add it later on need basis.

writeLargeRecord - needs to send out an event in case of pipelined shuffle.
- Fixed. Added test to cover this.

Empty partition details for each event are being generated from the global 
stats - which means they're cumulative. This should be generated from each 
Buffer before the buffer is reset. This likely implies storing the bitSet for 
each event that has been generated - especially for enableMerge = false, 
pipelined = false - since events are generated at the end in this case. 
(Alternate is to generate a single uber event at the end for the non pipelined 
case - but that's not part of this jira)
- Fixed.  Getting the stats from individual buffers before reset. Constructing 
emptypartition bitset out of this.

Is it possible to get rid of getSpillIndex - maybe storing the spillIndex 
within SpillPathDetails. getSpillIndex would not work correctly with multiple 
spill threads.
- Fixed

In the close() method
"if (currentBuffer.nextPosition == 0) {" - This will send out an event with 
invalid partition information (global). More important, in case of 
mergeEnabled=false - this is going to skip all previous events, since it 
returns immediately after this event. Should this event just be generated with 
a null pathComponent and a fully set BitSet.
- Fixed. Setting bits for all partitions and setting pathcomponent to null in 
this case.

calling finalSpill() ends up mixing filenames for the pipelined case. 
finalSpill will always generate the final output file name as it is today.
- For the pipelined/mergeDisabled case, we generate the spill filenames ending 
with "_spillId". Otherwise, the last spill event would not be retrieved in 
consumer side. If it is not pipelined, it would generate the final output file 
as it is today.

I think the rest of the code to generate events may change depending on how 
emptyPartition details are handled (Likely a list of previous spills - which 
can be used to generate events rather than relying on getSpillIndex+1).
ShuffleManager: "if (shuffleInfoEventsMap.get(srcAttemptIdentifier) == null)" - 
Shouldn't the following else if attemptNumber == 0 check always be checked ? 
Otherwise it can let attemptNumber = 1 into the pending list.
- Fixed. Reversed the check to handle this.

ShuffleManager: On numFetchedSpills - was thinking the log would be something 
like copy(inputsDone) (spillsDone) of total numInputs complete. Otherwise it 
can be a little misleading as to how the shuffle is progressing.
- Fixed. Added numFetchedSpills to the log.

Nit: maybeWriteSpillIndex - rename to handleSpillIndex since it's doing more 
than just writing to the file.
- Fixed

Minor: spillIndex isn't really used in SpillCallable - pass in as a String for 
logging purposes, so that there isn't confusion in the future (especially in 
case of finalSpill where this is a random value)
- Removed spillIndex. It is now available as a part of SpillPathDetails.

Minor: getSpillPathDetails doesn't need to set finalOutPath and finalIndexPath 
(unless it's being used by tests?)
- Yes, it is set for testing purpose. Added a comment.


> Support pipelined data transfer for Unordered Output
> ----------------------------------------------------
>
>                 Key: TEZ-1094
>                 URL: https://issues.apache.org/jira/browse/TEZ-1094
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1094.1.patch, TEZ-1094.2.patch, TEZ-1094.3.patch
>
>
> For unsorted output (and possibly for sorted output), it should be possible 
> to send data in small batches instead of waiting for everything to be 
> generated before transmitting. For now, planning on getting started with 
> UnsortedOutput / Input pairs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to