GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/15245
[SPARK-17666] Ensure that RecordReaders are closed by data source file scans ## What changes were proposed in this pull request? This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed. This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed. ## How was this patch tested? Tested manually for now. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark SPARK-17666-close-recordreader Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15245 ---- commit d804e025c2b4a8799f38f2f67beba1d12e224180 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-26T20:28:12Z Add close() to RecordReaderIterator and HadoopLinesReader commit e4b8577ed71a30f4ad220cd1a2f19a8edd596c64 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-09-26T20:29:24Z Register close() callbacks in all implementations of FileFormat.buildReader() ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org