steveloughran commented on pull request #33332: URL: https://github.com/apache/spark/pull/33332#issuecomment-881396754
First: You are using the the V2 FileOutputCommitt committer. This is dangerous as copies a task attempt's files during task commit. If for any reason a task is then reexecuted, those files will not be cleaned up. Unless you are 100% sure that the task files are the same and happy with intermingled file output from two tasks, it's not safe on *any* filesystem. On S3 it is also really slow. V1 isn't safe their either, as it relies on atomic dir rename. Switch to a zero rename committer, for both safety and performance. (I treat all support calls related to FOC on S3A as WONTFIX) Second: Looking at the output: ``` 21-07-12 11:58:38 INFO org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker: Expected 1 files, but only saw 0. This could be due to the output format not writing empty files, or files being not immediately visible in the filesystem. 21-07-12 11:58:38 INFO org.apache.spark.mapred.SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_xxxx836_0021_m_000000_617 ```` Line 2 there says the entire task attempt dir doesn't exist, that's not just the file, it's the parent dir. The parquet output writer *dit not write a file*. Either no records at all were created or close() was never called (remember: on S3 files don't get created until close() manifests them) Summary: worry more about needsTaskCommit() == false and why that is surfacing. That is independent of committer; maybe S3 (on other stores the file would be empty (nothing written)/incomplete (not closed properly)), or task abort was called before the other probes (which deletes the dir...) debugging strategies: * If you've set "spark.sql.maxConcurrentOutputFileWriters" to >= 1, set it to 0. That guarantees is the non-current codepath. If the file is created then, but not in parallel runs, maybe something to do with the concurrent run not being complete.= * switch to S3A committer and set org.apache.hadoop.fs.s3a.commit to log at debug. There's a lot of logging there. * turn on all logging of org.apache.hadoop.fs.s3a to see what's happening * use a different cluster store as the dest of the same job * turn on s3 bucket logging and see what happened there. We're adding some really good auditing there, matching S3 REST calls to FS Calls, jobs IDs. Not yet shipping though. [Side issue: @HyukjinKwon you ever thought of filing MAPREDUCE and PARQUET patches where you change those internal methods tagged @Private to @LimitedPrivate("spark") so those teams know what would break? It might give you some more stability.] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org