GitHub user sahilTakiar opened a pull request:

    https://github.com/apache/spark/pull/19413

    [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE

    ## What changes were proposed in this pull request?
    
    Fix for SPARK-20466, full description of the issue in the JIRA. To 
summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The 
cache uses soft-references, which means the JVM can delete entries from the 
cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it 
would check if the cache contained the `JobConf`, if it did it would get the 
`JobConf` from the cache and return it. This doesn't work when soft-references 
are used as the JVM can delete the entry between the existence check and the 
get call.
    
    ## How was this patch tested?
    
    Haven't thought of a good way to test this yet given the issue only occurs 
sometimes, and happens during high GC pressure. Was thinking of using mocks to 
verify `#getJobConf` is doing the right thing. I deleted the method 
`HadoopRDD#containsCachedMetadata` so that we don't hit this issue again.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sahilTakiar/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19413.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19413
    
----
commit 680f32c311e33784e11763c109488d528178efc8
Author: Sahil Takiar <stak...@cloudera.com>
Date:   2017-10-02T20:44:23Z

    [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to