Hi Peter,

Thanks a lot for the breakdown, it all makes sense.

Unfortunately I work with companies who are stuck with older versions of
Hive, so I'm trying to find some workarounds.

I was actually able to make it mostly work. Here's what I do:

   - In configureJobConf():
      - Create a work directory with a name based on the Hive Query ID,
      e.g. hdfs:///tmp/work-ABC123
      - Get some details from the TableDesc object and save them into a
      JSON file: hdfs:///tmp/work-ABC123/info.json  -- We need those details to
      know what external destination table we will output the results
to later at
      commit time.
   - Each RecordWriter instance (which runs in a task) writes their own
   output file, e.g.: hdfs:///tmp/work-ABC123/task-XYZ789.output
   - When using MR, the OutputCommitter.commitJob() automatically gets
   called. There we read the external destination table details from the
   info.json file, collect all the task output files, and copy them to the
   external table. Lastly we clean up the work directory.
   - When using Tez, the commitInsertTable() automatically gets called.
   From there, we make explicit call to the OutputCommitter.commitJob() to
   finish the work.

Does that approach seem sound to you?

If so, the one thing that I think is still missing, is the ability to clean
up the work directory at the end of the job **if the job has failed**. My
understanding is that with MR the OutputCommitter.abortJob() method would
be called. However, with Tez I'm not sure where such a hook might be. I see
there is rollbackInsertTable() but that seems to be used for another
purpose, i.e. to do something if commitInsertTable() itself fails.

Is there a place I could hook into with Tez **at the end of a job** in the
event that the job as a whole has failed? Alternatively, is there a hook
for when a job is complete, regardless of whether it has succeeded or
 failed?

Thanks,

Julien

Reply via email to