GitHub user steveloughran opened a pull request: https://github.com/apache/spark/pull/19448
[SPARK-22217] [SQL] ParquetFileFormat to support arbitrary OutputCommitters ## What changes were proposed in this pull request? `ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and implicitly Hadoop `FileOutputCommitter` to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter` This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this. Because a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message. (It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.) ## How was this patch tested? The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary. | committer | summary | outcome | |-----------|---------|---------| | parquet | true | success | | parquet | false | success | | marking | false | success with marker | | marking | true | exception | All tests are happy. You can merge this pull request into a Git repository by running: $ git pull https://github.com/steveloughran/spark cloud/SPARK-22217-committer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19448.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19448 ---- commit e6fdbdcf4118283abd22f7b14586ed742d225657 Author: Steve Loughran <ste...@hortonworks.com> Date: 2017-07-12T10:42:51Z SPARK-22217 tuning ParquetOutputCommitter to support any committer class, provided saveSummaries is disabled. With Tests Change-Id: I19872dc1c095068ed5a61985d53cb7258bd9a9bb ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org