[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

steveloughran Wed, 01 Nov 2017 06:47:05 -0700

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19623#discussion_r148263405

--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
---
@@ -75,8 +82,10 @@
/**
* Aborts this writing job because some data writers are failed to write
the records and aborted,
* or the Spark job fails with some unknown reasons, or {@link
#commit(WriterCommitMessage[])}
- * fails. If this method fails(throw exception), the underlying data
source may have garbage that
- * need to be cleaned manually, but these garbage should not be visible
to data source readers.
+ * fails.
+ *
+ * If an exception was thrown, the underlying data source may have
garbage that need to be
+ * cleaned manually, but these garbage should not be visible to data
source readers.
--- End diff --

call it, because commit may fail for other reasons, e.g. the committer
permits the job to commit if the output is all to empty partitions, but fails
if there is data in any of the dest partitions. the abort() call triggers the
cleanup.

1. exceptions aborts after job commit failure. must of course be
caught/swallowed. Hadoop MR doesn't do that properly, which is something
someone should fix. Probably me.
1. ideally the committers should be doing their own cleanup if the job
fails, so they can be confident the cleanup always takes place. Again: catch &
swallow exceptions.
1. Failure in job commit is very much an "outcome is unknown" state.

Couple of references
* [s3a committer lifecycle
tests](https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/AbstractITCommitProtocol.java),
including OOO workflows (abort before commit, 2x commit task, 2x commit job,
commit job -> commit task).
* [SPARK-2984](https://issues.apache.org/jira/browse/SPARK-2984), where
underlying problem of recurrent issue is people trying to write to same dest
with >1 job, and finding all jobs after first one failing as the `__temporary`
dir has been deleted in cleanup(). Arguably, the job commit/cleanup is being
overaggressive about cleanup, but given how partition merging isn't atomic with
the classic fileoutput committers, its really a manifestation of people trying
to do something they shouldn't.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...

Reply via email to