GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/18979

    [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsTracker metrics 
collection fails if a new file isn't yet visible

    ## What changes were proposed in this pull request?
    
    `BasicWriteTaskStatsTracker.getFileSize()` to catch 
`FileNotFoundException`, log @ info and then return 0 as a file size.
    
    This ensures that if a newly created file isn't visible due to the store 
not always having create consistency, the metric collection doesn't cause the 
failure. 
    
    ## How was this patch tested?
    
    New test suite included, `BasicWriteTaskStatsTrackerSuite`. This not only 
checks the resilience to missing files, but verifies the existing logic as to 
how file statistics are gathered.
    
    Note that in the current implementation
    
    1. if you call `Tracker..getFinalStats()` more than once, the file size 
count will increase by size of the last file. This could be fixed by clearing 
the filename field inside `getFinalStats()` itself.
    
    2. If you pass in an empty or null string to `Tracker.newFile(path)` then 
IllegalArgumentException is raised, but only in `getFinalStats()`, rather than 
in `newFile`.  There's a test for this behaviour in the new suite, as it 
verifies that only FNFEs get swallowed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark 
cloud/SPARK-21762-missing-files-in-metrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18979.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18979
    
----
commit 8ad28b9bcd6a56b963ab57a5b4937d10f492de33
Author: Steve Loughran <ste...@hortonworks.com>
Date:   2017-08-17T19:35:35Z

    SPARK-21762 handle FNFE events in BasicWriteStatsTracker; add a suite of 
tests for various file states.
    
    Change-Id: I3269cb901a38b33e399ebef10b2dbcd51ccf9b75

commit 2a113fde1653743a3543df8ada395f320b826a3e
Author: Steve Loughran <ste...@hortonworks.com>
Date:   2017-08-17T20:01:50Z

    SPARK-21762 add tests for "" and null filenames
    
    Change-Id: I38ac11c808849e2fd91f4931f4cb5cdfad43e2af

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to