GitHub user raofu opened a pull request: https://github.com/apache/spark/pull/22157
[SPARK-25126] Avoid creating Reader for all orc files In OrFileOperator.ReadSchema, a Reader is created for every file although only the first valid one is used. This uses significant amount of memory when there `paths` have a lot of files. In 2.3 a different code path OrcUtils.readSchema is used for inferring schema for orc files. This commit change both function to creat Reader lazily. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/raofu/spark SPARK-25126 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22157.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22157 ---- commit 5a86b3618da695431c01ddbe4bb102a45f93b3b1 Author: Rao Fu <rao@...> Date: 2018-08-17T23:40:05Z [SPARK-25126] Avoid creating Reader for all orc files In OrFileOperator.ReadSchema, a Reader is created for every file although only the first valid one is used. This uses significant amount of memory when there `paths` have a lot of files. In 2.3 a different code path OrcUtils.readSchema is used for inferring schema for orc files. This commit change both function to creat Reader lazily. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org