dongjoon-hyun edited a comment on pull request #32518: URL: https://github.com/apache/spark/pull/32518#issuecomment-839992094
Here are my thoughts. > Be aware that https://issues.apache.org/jira/browse/HADOOP-17483 turns the magic committer on everywhere Of course, we've been waiting for Apache Hadoop 3.3.1 as a next step of Hadoop 3.2.2. We are going to upgrade willingly. > so this patch will make the magic committer the default on s3. This patch fills the missing parts only when Spark's configuration `spark.hadoop.fs.s3a.bucket.<bucket>.committer.magic.enabled=true` is provided. So, it's orthogonal to Hadoop default configuration. > I am perfectly happy with this. Thank you. Yes, for S3, this is a correct and better direction and especially useful when we build Apache Spark source with a provided hadoop versions like 3.2.x or 3.3.0. > Note also that MAPREDUCE-7431 is adding a committer for ABFS and GCS for max performance on abfs and performance and correctness on gcs. (it'll work on HDFS too, FWIW). Those changes needed in the spark config will be needed there too. Also, thank you for the head-ups. Yep, definitely, we are looking forward to seeing it. In addition to S3's offset bug, those will be beneficial to the end users. > Now, one of the reasons that binding factory stuff is in the spark codebase is that it was still using some of the old MRv1 algorithms to create and invoke committers, rather than the V2 APIs, which automatically go through the factory mechanism. So the real solution here would to be find those bits of the spark code which uses org.apache.hadoop.mapred.FileOutputCommitter and other stuff in the same package and see if it can be replaced with a move to the stuff in org.apache.hadoop.mapreduce.lib.output. Yes, it's related to the non-trivial code path at this stage and may cause another regression. I hope we can revisit that later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org