Felix Kizhakkel Jose created SPARK-31072:
--------------------------------------------

             Summary: Default to ParquetOutputCommitter even after configuring 
setting committer as "partitioned"
                 Key: SPARK-31072
                 URL: https://issues.apache.org/jira/browse/SPARK-31072
             Project: Spark
          Issue Type: Bug
          Components: Java API
    Affects Versions: 2.4.5
            Reporter: Felix Kizhakkel Jose


My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
even after I configure to use "PartitionedStagingCommitter" with the following 
configuration:
 * 
sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
 "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
 * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
 * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", "append");
 * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
 * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
false);

Application logs stacktrace:

20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2
20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2
20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output committer 
class org.apache.parquet.hadoop.ParquetOutputCommitter

But when I use _*ORC*_ as the file format, with the same configuration as above 
it correctly pick "PartitionedStagingCommitter":
20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 1
20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer partitioned 
to output data to s3a:************
20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
PartitionedStagingCommitter**********

So I am wondering why Parquet and ORC has different behavior ?
How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?

I started this because when I was trying to save data to S3 directly with 
partitionBy() two columns -  I was getting  file not found exceptions 
intermittently.  
So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to