[ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053543#comment-17053543 ]
Felix Kizhakkel Jose commented on SPARK-31072: ---------------------------------------------- [~steve_l], I have seen some issues you have addressed in this area, could you please give me some insights? All, Please provide some help on this issue. > Default to ParquetOutputCommitter even after configuring setting committer as > "partitioned" > ------------------------------------------------------------------------------------------- > > Key: SPARK-31072 > URL: https://issues.apache.org/jira/browse/SPARK-31072 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.4.5 > Reporter: Felix Kizhakkel Jose > Priority: Major > > My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ > even after I configure to use "PartitionedStagingCommitter" with the > following configuration: > * > sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", > "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); > * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); > * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", > "append"); > * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); > * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", > false); > Application logs stacktrace: > 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for > Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output > committer class org.apache.parquet.hadoop.ParquetOutputCommitter > But when I use _*ORC*_ as the file format, with the same configuration as > above it correctly pick "PartitionedStagingCommitter": > 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer > partitioned to output data to s3a:************ > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter > PartitionedStagingCommitter********** > So I am wondering why Parquet and ORC has different behavior ? > How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? > I started this because when I was trying to save data to S3 directly with > partitionBy() two columns - I was getting file not found exceptions > intermittently. > So how could I avoid this issue with *Parquet using Spark to S3 using s3A > without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org