steveloughran commented on pull request #33332:
URL: https://github.com/apache/spark/pull/33332#issuecomment-881396754


   First: You are using the the V2 FileOutputCommitt committer. 
   
   This is dangerous as copies a task attempt's files during task commit. If 
for any reason a task is then reexecuted, those files will not be cleaned up. 
Unless you are 100% sure that the task files are the same and happy with 
intermingled file output from two tasks, it's not safe on *any* filesystem.
   
   On S3 it is also really slow. V1 isn't safe their either, as it relies on 
atomic dir rename.
   
   Switch to a zero rename committer, for both safety and performance. (I treat 
all support calls related to FOC on S3A as WONTFIX)
   
   Second: Looking at the output:
   
   ```
   21-07-12 11:58:38 INFO 
org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker: Expected 
1 files, but only saw 0. This could be due to the output format not writing 
empty files, or files being not immediately visible in the filesystem.
   21-07-12 11:58:38 INFO org.apache.spark.mapred.SparkHadoopMapRedUtil: No 
need to commit output of task because needsTaskCommit=false: 
attempt_xxxx836_0021_m_000000_617
   ````
   
   Line 2 there says the entire task attempt dir doesn't exist, that's not just 
the file, it's the parent dir.
   
   The parquet output writer *dit not write a file*. Either no records at all 
were created or close() was never called (remember: on S3 files don't get 
created until close() manifests them)
   
   Summary: worry more about needsTaskCommit() == false and why that is 
surfacing.
   
   
   That is independent of  committer; maybe S3 (on other stores the file would 
be empty (nothing written)/incomplete (not closed properly)), or task abort was 
called before the other probes (which deletes the dir...)
   
   debugging strategies: 
   
   * If you've set "spark.sql.maxConcurrentOutputFileWriters"  to >= 1, set it 
to 0. That guarantees is the non-current codepath. If the file is created then, 
but not in parallel runs, maybe something to do with the concurrent run not 
being complete.=
   * switch to S3A committer and set org.apache.hadoop.fs.s3a.commit to log at 
debug. There's a lot of logging there.
   * turn on all logging of org.apache.hadoop.fs.s3a to see what's happening
   * use a different cluster store as the dest of the same job
   * turn on s3 bucket logging and see what happened there. We're adding some 
really good auditing there, matching S3 REST calls to FS Calls, jobs IDs. Not 
yet shipping though.
   
   
   [Side issue: @HyukjinKwon you ever thought of filing MAPREDUCE and PARQUET 
patches where you change those internal methods tagged @Private to 
@LimitedPrivate("spark") so those teams know what would break? It might give 
you some more stability.]
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to