GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/20763
[SPARK-23523] [SQL] [BACKPORT-2.3] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery This PR is to backport https://github.com/apache/spark/pull/20684 and https://github.com/apache/spark/pull/20693 to Spark 2.3 branch --- ## What changes were proposed in this pull request? ```Scala val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e") Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5") .write.json(tablePath.getCanonicalPath) val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct() df.show() ``` It generates a wrong result. ``` [c,e,a] ``` We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it. ## How was this patch tested? Added a test case You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark backport23523 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20763.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20763 ---- commit b47f1d4243ec72eeab69ae619c35bbbd9f9f2e6d Author: gatorsmile <gatorsmile@...> Date: 2018-02-27T16:44:25Z [SPARK-23523][SQL] Fix the incorrect result caused by the rule OptimizeMetadataOnlyQuery ## What changes were proposed in this pull request? ```Scala val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e") Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5") .write.json(tablePath.getCanonicalPath) val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct() df.show() ``` It generates a wrong result. ``` [c,e,a] ``` We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it. ## How was this patch tested? Added a test case Author: gatorsmile <gatorsm...@gmail.com> Closes #20684 from gatorsmile/optimizeMetadataOnly. commit c0ac5ef3a1f00eee44dd50be925f983be852fe96 Author: Xingbo Jiang <xingbo.jiang@...> Date: 2018-02-28T20:16:26Z [SPARK-23523][SQL][FOLLOWUP] Minor refactor of OptimizeMetadataOnlyQuery ## What changes were proposed in this pull request? Inside `OptimizeMetadataOnlyQuery.getPartitionAttrs`, avoid using `zip` to generate attribute map. Also include other minor update of comments and format. ## How was this patch tested? Existing test cases. Author: Xingbo Jiang <xingbo.ji...@databricks.com> Closes #20693 from jiangxb1987/SPARK-23523. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org