Hi Spark users and developers, I read the ticket [SPARK-8578] (Should ignore user defined output committer when appending data) which ignore DirectParquetOutputCommitter if append mode is selected. The logic was that it is unsafe to use because it is not possible to revert a failed job in append mode using DirectParquetOutputCommitter. I think wouldn't it better to allow users to use it at their own risk? Say, if you use DirectParquetOutputCommitter with append mode, the job fails immediately when a task fails. The user can then choose to reprocess the job entirely which is not a big deal since failure is rare in most cases. Another approach is to provide at least once-task semantic for append mode using DirectParquetOutputCommitter. This will end up having duplicate records but for some applications, this is fine.
The second issue is that the assumption that Overwrite mode works with DirectParquetOutputCommitter for all cases is wrong at least from the perspective of using it with s3. S3 provides eventual consistency for overwrite PUTS and DELETES. So if you attempt to delete a directory and then create the same directory again in a split of a second. The chance you hit org.apache.hadoop.fs.FileAlreadyExistsException is very high because deletes don't immediately and creating the same file before it is deleted will result with the above exception. Might I propose to change the code such that it will actually OVERWRITE the file instead of a delete following by a create? Best Regards, Jerry