GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/17617
[SPARK-20244][Core] Handle get bytesRead from different thread in Hadoop RDD ## What changes were proposed in this pull request? Hadoop FileSystem's statistics in based on thread local variables, this is ok if the RDD computation chain is running in the same thread. But if child RDD creates another thread to consume the iterator got from Hadoop RDDs, the bytesRead computation will be error, because now the iterator's `next()` and `close()` may run in different threads. This could be happened when using PySpark with PythonRDD. So here building a map to track the `bytesRead` for different thread and add them together. This method will be used in three RDDs, `HadoopRDD`, `NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`. ## How was this patch tested? Unit test and local cluster verification. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-20244 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17617.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17617 ---- commit d6f3c42c74ab38b0b6becc80a80b5aeda4459c40 Author: jerryshao <ss...@hortonworks.com> Date: 2017-04-12T06:22:15Z Handle get bytesRead from different thread Change-Id: I8e64393151ef3eef22b868f6ae47a48ecb8694d3 ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org