GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/21004
[SPARK-23896][SQL]Improve PartitioningAwareFileIndex ## What changes were proposed in this pull request? Currently `PartitioningAwareFileIndex` accepts an optional parameter `userPartitionSchema`. If provided, it will combine the inferred partition schema with the parameter. However, 1. to get `userPartitionSchema`, we need to combine inferred partition schema with `userSpecifiedSchema` 2. to get the inferred partition schema, we have to create a temporary file index. Only after that, a final version of `PartitioningAwareFileIndex` can be created. This can be improved by passing `userSpecifiedSchema` to `PartitioningAwareFileIndex`. With the improvement, we can reduce redundant code and avoid parsing the file partition twice. ## How was this patch tested? Unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark PartitioningAwareFileIndex Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21004.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21004 ---- commit 35aff24743ff13ccd370a8e3747a3044e8a671c9 Author: Gengliang Wang <gengliang.wang@...> Date: 2018-04-08T18:19:48Z improve PartitioningAwareFileIndex ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org