Felix Kizhakkel Jose created SPARK-31072: --------------------------------------------
Summary: Default to ParquetOutputCommitter even after configuring setting committer as "partitioned" Key: SPARK-31072 URL: https://issues.apache.org/jira/browse/SPARK-31072 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.5 Reporter: Felix Kizhakkel Jose My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ even after I configure to use "PartitionedStagingCommitter" with the following configuration: * sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", "append"); * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", false); Application logs stacktrace: 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter But when I use _*ORC*_ as the file format, with the same configuration as above it correctly pick "PartitionedStagingCommitter": 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer partitioned to output data to s3a:************ 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter PartitionedStagingCommitter********** So I am wondering why Parquet and ORC has different behavior ? How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? I started this because when I was trying to save data to S3 directly with partitionBy() two columns - I was getting file not found exceptions intermittently. So how could I avoid this issue with *Parquet using Spark to S3 using s3A without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org