[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...
Github user raofu commented on the issue: https://github.com/apache/spark/pull/22157 @dongjoon-hyun, thanks lot for the pointers! I've update the PR description. Please let me know if there is any other information you'd like me to add. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22157: [SPARK-25126][SQL] Avoid creating Reader for all orc fil...
Github user raofu commented on the issue: https://github.com/apache/spark/pull/22157 @dongjoon-hyun Title updated. Thanks for adding the test coverage! I've merged your commit. Can you help kick off another Jenkins run? I don't think I have the permission to do it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files
Github user raofu commented on the issue: https://github.com/apache/spark/pull/22157 I fixed the test by making the first file the corrupted file. @srowen, can you help kick off a Jenkins run? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22157: [SPARK-25126] Avoid creating Reader for all orc f...
GitHub user raofu opened a pull request: https://github.com/apache/spark/pull/22157 [SPARK-25126] Avoid creating Reader for all orc files In OrFileOperator.ReadSchema, a Reader is created for every file although only the first valid one is used. This uses significant amount of memory when there `paths` have a lot of files. In 2.3 a different code path OrcUtils.readSchema is used for inferring schema for orc files. This commit change both function to creat Reader lazily. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/raofu/spark SPARK-25126 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22157.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22157 commit 5a86b3618da695431c01ddbe4bb102a45f93b3b1 Author: Rao Fu Date: 2018-08-17T23:40:05Z [SPARK-25126] Avoid creating Reader for all orc files In OrFileOperator.ReadSchema, a Reader is created for every file although only the first valid one is used. This uses significant amount of memory when there `paths` have a lot of files. In 2.3 a different code path OrcUtils.readSchema is used for inferring schema for orc files. This commit change both function to creat Reader lazily. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22113: [SPARK-25126] Lazily create Reader for orc files
Github user raofu commented on a diff in the pull request: https://github.com/apache/spark/pull/22113#discussion_r210473687 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala --- @@ -70,7 +70,7 @@ private[hive] object OrcFileOperator extends Logging { hdfsPath.getFileSystem(conf) } -listOrcFiles(basePath, conf).iterator.map { path => +listOrcFiles(basePath, conf).view.map { path => --- End diff -- My bad. I misread the code. Sorry about the noise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22113: [SPARK-25126] Lazily create Reader for orc files
Github user raofu closed the pull request at: https://github.com/apache/spark/pull/22113 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22113: [SPARK-25126] Lazily create Reader for orc files
GitHub user raofu opened a pull request: https://github.com/apache/spark/pull/22113 [SPARK-25126] Lazily create Reader for orc files ## What changes were proposed in this pull request? Currently Reader is created for every orc file under the directory and then the first one with non-empty schema is returned. Using `view` lazily creates Reader instead. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/raofu/spark SPARK-25126 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22113 commit 9f5aad0591b9912f5186cd2da8328b348eea5425 Author: Rao Fu Date: 2018-08-15T20:20:45Z [SPARK-25126] Lazily create Reader for orc files --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org