Hello Tez devs,

The current OutputCommitter added as a dataSink to the end vertex allows to 
finalize the output.

Pretext -
Currently we are generating 'temp output files' at the output stage. The 
filename of these files are made using unique identifiers (including task 
index, task attempt number, task vertex index, numPhysicalOutputs)

Problem -
During the output committer stage I couldn't find a way to access task 
information (task index, task attempt number) of the final output vertex.

Why I need task information?

  *   Recreate paths of the 'temp output files' for final processing
  *   If speculation is turned on. Then the output final vertex might generate 
multiple duplicate tasks generating similar temp output files but with 
different task attempt number. However at the end only one task attempt is 
successful and its output is used. The problem arises in a race condition when 
both or more attempts finish successfully generating multiple similar temp 
output files. Though only one attempt is registered as successful. We would 
like to know the other attempts so as to clean-up these speculated temp output 
files.

Can you provide a way how we can access task information of the final output 
vertex.

Appreciate any suggestions.

Thank You,
Gurleen

Reply via email to