[jira] [Commented] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible
[ https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201418#comment-16201418 ] Dongjoon Hyun commented on SPARK-21762: --- Since this is a regression like SPARK-22258, I updated the priority. > FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new > file isn't yet visible > > > Key: SPARK-21762 > URL: https://issues.apache.org/jira/browse/SPARK-21762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: object stores without complete creation consistency > (this includes AWS S3's caching of negative GET results) >Reporter: Steve Loughran > > The metrics collection of SPARK-20703 can trigger premature failure if the > newly written object isn't actually visible yet, that is if, after > {{writer.close()}}, a {{getFileStatus(path)}} returns a > {{FileNotFoundException}}. > Strictly speaking, not having a file immediately visible goes against the > fundamental expectations of the Hadoop FS APIs, namely full consistent data & > medata across all operations, with immediate global visibility of all > changes. However, not all object stores make that guarantee, be it only newly > created data or updated blobs. And so spurious FNFEs can get raised, ones > which *should* have gone away by the time the actual task is committed. Or if > they haven't, the job is in such deep trouble. > What to do? > # leave as is: fail fast & so catch blobstores/blobstore clients which don't > behave as required. One issue here: will that trigger retries, what happens > there, etc, etc. > # Swallow the FNFE and hope the file is observable later. > # Swallow all IOEs and hope that whatever problem the FS has is transient. > Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at > least, not the counter of bytes written. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible
[ https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144045#comment-16144045 ] Apache Spark commented on SPARK-21762: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/18979 > FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new > file isn't yet visible > > > Key: SPARK-21762 > URL: https://issues.apache.org/jira/browse/SPARK-21762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: object stores without complete creation consistency > (this includes AWS S3's caching of negative GET results) >Reporter: Steve Loughran >Priority: Minor > > The metrics collection of SPARK-20703 can trigger premature failure if the > newly written object isn't actually visible yet, that is if, after > {{writer.close()}}, a {{getFileStatus(path)}} returns a > {{FileNotFoundException}}. > Strictly speaking, not having a file immediately visible goes against the > fundamental expectations of the Hadoop FS APIs, namely full consistent data & > medata across all operations, with immediate global visibility of all > changes. However, not all object stores make that guarantee, be it only newly > created data or updated blobs. And so spurious FNFEs can get raised, ones > which *should* have gone away by the time the actual task is committed. Or if > they haven't, the job is in such deep trouble. > What to do? > # leave as is: fail fast & so catch blobstores/blobstore clients which don't > behave as required. One issue here: will that trigger retries, what happens > there, etc, etc. > # Swallow the FNFE and hope the file is observable later. > # Swallow all IOEs and hope that whatever problem the FS has is transient. > Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at > least, not the counter of bytes written. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible
[ https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131190#comment-16131190 ] Steve Loughran commented on SPARK-21762: SPARK-20703 simplifies this, especially testing, as it's isolated from FileFormatWriter. Same problem exists though: if you are getting any Create inconsistency, metrics probes trigger failures which may not be present by the time task commit actually takes place > FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new > file isn't yet visible > > > Key: SPARK-21762 > URL: https://issues.apache.org/jira/browse/SPARK-21762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: object stores without complete creation consistency > (this includes AWS S3's caching of negative GET results) >Reporter: Steve Loughran >Priority: Minor > > The metrics collection of SPARK-20703 can trigger premature failure if the > newly written object isn't actually visible yet, that is if, after > {{writer.close()}}, a {{getFileStatus(path)}} returns a > {{FileNotFoundException}}. > Strictly speaking, not having a file immediately visible goes against the > fundamental expectations of the Hadoop FS APIs, namely full consistent data & > medata across all operations, with immediate global visibility of all > changes. However, not all object stores make that guarantee, be it only newly > created data or updated blobs. And so spurious FNFEs can get raised, ones > which *should* have gone away by the time the actual task is committed. Or if > they haven't, the job is in such deep trouble. > What to do? > # leave as is: fail fast & so catch blobstores/blobstore clients which don't > behave as required. One issue here: will that trigger retries, what happens > there, etc, etc. > # Swallow the FNFE and hope the file is observable later. > # Swallow all IOEs and hope that whatever problem the FS has is transient. > Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at > least, not the counter of bytes written. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org