[jira] [Created] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

Lorenzo Martini (Jira) Tue, 14 Feb 2023 09:01:46 -0800

Lorenzo Martini created SPARK-42439:
---------------------------------------


             Summary: Job description in v2 FileWrites can have the wrong 
committer
                 Key: SPARK-42439
                 URL: https://issues.apache.org/jira/browse/SPARK-42439
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.3.1
            Reporter: Lorenzo Martini


There is a difference in behavior between v1 writes and v2 writes in the order 
of events happening when configuring the file writer and the committer.

v1:
 # writer.prepareWrite()
 # committer.setupJob()

v2:
 # committer.setupJob()
 # writer.prepareWrite()

 

This is because the `prepareWrite()` call (that is the one performing the call `
job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
happens as part of the `createWriteJobDescription` which is `lazy val` in the 
`toBatch` call and therefore is evaluated after the `committer.setupJob` at the 
end of the `toBatch`

This causes issues when evaluating the committer as some elements might be 
missing, for example the aforementioned output format class not being set, 
causing the committer being set up as generic write instead of parquet write.

 

The fix is very simple and it is to make the `createJobDescription` call 
non-lazy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42439) Job description in v2 FileWrites can have the wrong committer

Reply via email to