steveloughran commented on PR #5931:
URL: https://github.com/apache/hadoop/pull/5931#issuecomment-1669566586

   this is very like the staging committer's partitioned overwrite, and is 
needed for the magic committer to support insert overwrite in spark, so will be 
good.
   
   Now HADOOP-16570 covers the scale problems with the staging committer, I'd 
hoped the manifest committer would be safe as the per-file data is so much 
smaller, but MAPREDUCE-7435 shows that no, you can't even include manifest 
(source, dest) rename lists without overloading the memory of a spark driver.  
The fix there involved streaming the pending data to the local fs and reading 
back in...I think this may be needed here too. Using the local fs avoids all s3 
writeback/reading. 
   
   The hardest bit of that PR, 
org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.EntryFileIO, 
will be on the classpath; maybe a more abstract superclass can be extracted, 
SinglePendingCommit data made Writable and then the same queue-based 
serialization used in this job commit: a pool of threads to read all 
.pendingset files, all then streamed to a temp file while that list of dirs to 
clean up is enumerated.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to