[GitHub] [spark] dongjoon-hyun edited a comment on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs

GitBox Wed, 12 May 2021 11:29:18 -0700


dongjoon-hyun edited a comment on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-839992094



   Here are my thoughts.
   
   > Be aware that https://issues.apache.org/jira/browse/HADOOP-17483 turns the 
magic committer on everywhere
   
   Of course, we've been waiting for Apache Hadoop 3.3.1 as a next step of 
Hadoop 3.2.2. We are going to upgrade willingly.
   
   > so this patch will make the magic committer the default on s3.
   
   This patch fills the missing parts only when Spark's configuration 
`spark.hadoop.fs.s3a.bucket.<bucket>.committer.magic.enabled=true` is provided. 
So, it's orthogonal to Hadoop default configuration.
   
   > I am perfectly happy with this.
   
   Thank you. Yes, for S3, this is a correct and better direction and 
especially useful when we build Apache Spark source with a provided hadoop 
versions like 3.2.x or 3.3.0.
   
   > Note also that MAPREDUCE-7431 is adding a committer for ABFS and GCS for 
max performance on abfs and performance and correctness on gcs. (it'll work on 
HDFS too, FWIW). Those changes needed in the spark config will be needed there 
too.
   
   Also, thank you for the head-ups. Yep, definitely, we are looking forward to 
seeing it. In addition to S3's offset bug, those will be beneficial to the end 
users.
   
   > Now, one of the reasons that binding factory stuff is in the spark 
codebase is that it was still using some of the old MRv1 algorithms to create 
and invoke committers, rather than the V2 APIs, which automatically go through 
the factory mechanism. So the real solution here would to be find those bits of 
the spark code which uses org.apache.hadoop.mapred.FileOutputCommitter and 
other stuff in the same package and see if it can be replaced with a move to 
the stuff in org.apache.hadoop.mapreduce.lib.output.
   
   Yes, it's related to the non-trivial code path at this stage and may cause 
another regression. I hope we can revisit that later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs

Reply via email to