[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs
dongjoon-hyun commented on pull request #32518: URL: https://github.com/apache/spark/pull/32518#issuecomment-1050425417 Hi, @itayB . If you are using Apache Spark 3.2.1 with Hadoop 3.3.1, you don't need to the first one. However, you still needs the Parquet recommendation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs
dongjoon-hyun commented on pull request #32518: URL: https://github.com/apache/spark/pull/32518#issuecomment-840019899 Although AppVeyor build failed due to timeout, Jenkins passed. Merged to master. Thank you, @dbtsai , @HyukjinKwon , @steveloughran . This is a part of efforts to give Apache Spark 3.2.0 a better cloud support in the end. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs
dongjoon-hyun commented on pull request #32518: URL: https://github.com/apache/spark/pull/32518#issuecomment-839992094 Here are my thoughts. > Be aware that https://issues.apache.org/jira/browse/HADOOP-17483 turns the magic committer on everywhere Of course, we've been waiting for Apache Hadoop 3.3.1 as a next step of Hadoop 3.2.2. We are going to upgrade willingly. > so this patch will make the magic committer the default on s3. This patch fills the missing parts only when Spark's configuration `spark.hadoop.fs.s3a.bucket..committer.magic.enabled=true` is not provided. So, it's orthogonal to Hadoop default configuration. > I am perfectly happy with this. Thank you. Yes, for S3, this is a correct and better direction and especially useful when we build Apache Spark source with a provided hadoop versions like 3.2.x or 3.3.0. > Note also that MAPREDUCE-7431 is adding a committer for ABFS and GCS for max performance on abfs and performance and correctness on gcs. (it'll work on HDFS too, FWIW). Those changes needed in the spark config will be needed there too. Also, thank you for the head-ups. Yep, definitely, we are looking forward to seeing it. In addition to S3's offset bug, those will be beneficial to the end users. > Now, one of the reasons that binding factory stuff is in the spark codebase is that it was still using some of the old MRv1 algorithms to create and invoke committers, rather than the V2 APIs, which automatically go through the factory mechanism. So the real solution here would to be find those bits of the spark code which uses org.apache.hadoop.mapred.FileOutputCommitter and other stuff in the same package and see if it can be replaced with a move to the stuff in org.apache.hadoop.mapreduce.lib.output. Yes, it's related to the non-trivial code path at this stage and may cause another regression. I hope we can revisit that later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs
dongjoon-hyun commented on pull request #32518: URL: https://github.com/apache/spark/pull/32518#issuecomment-839980053 Thank you so much for review and comments, @steveloughran ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs
dongjoon-hyun commented on pull request #32518: URL: https://github.com/apache/spark/pull/32518#issuecomment-839841587 > LGTM. Should we eventually do this in Hadoop, cc @steveloughran and @dongjoon-hyun ? Thank you for review, @dbtsai . The following two are Spark configurations. ``` spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32518: [SPARK-35383][CORE] Improve s3a magic committer support by inferring missing configs
dongjoon-hyun commented on pull request #32518: URL: https://github.com/apache/spark/pull/32518#issuecomment-839470621 Hi, @steveloughran . Could you review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org