GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/17617

    [SPARK-20244][Core] Handle get bytesRead from different thread in Hadoop RDD

    ## What changes were proposed in this pull request?
    
    Hadoop FileSystem's statistics in based on thread local variables, this is 
ok if the RDD computation chain is running in the same thread. But if child RDD 
creates another thread to consume the iterator got from Hadoop RDDs, the 
bytesRead computation will be error, because now the iterator's `next()` and 
`close()` may run in different threads. This could be happened when using 
PySpark with PythonRDD.
    
    So here building a map to track the `bytesRead` for different thread and 
add them together. This method will be used in three RDDs, `HadoopRDD`, 
`NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called 
directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`.
    
    ## How was this patch tested?
    
    Unit test and local cluster verification.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-20244

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17617.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17617
    
----
commit d6f3c42c74ab38b0b6becc80a80b5aeda4459c40
Author: jerryshao <ss...@hortonworks.com>
Date:   2017-04-12T06:22:15Z

    Handle get bytesRead from different thread
    
    Change-Id: I8e64393151ef3eef22b868f6ae47a48ecb8694d3

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to