[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155707374 Also backported to branch-1.6. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9517 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155706699 LGTM, merged to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155309503 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45489/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155309501 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155309294 **[Test build #45489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45489/consoleFull)** for PR 9517 at commit [`32dfb87`](https://github.com/apache/spark/commit/32dfb87ce36a093c54d4a3dfd39ccbc00c417af9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155281726 **[Test build #45489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45489/consoleFull)** for PR 9517 at commit [`32dfb87`](https://github.com/apache/spark/commit/32dfb87ce36a093c54d4a3dfd39ccbc00c417af9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155281161 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155281155 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155280757 I used `sortBy` instead of `sortWith` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155053133 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155053016 **[Test build #45359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45359/consoleFull)** for PR 9517 at commit [`4f47063`](https://github.com/apache/spark/commit/4f4706352c84469503ae3c3388098458b570f62f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155021593 **[Test build #45359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45359/consoleFull)** for PR 9517 at commit [`4f47063`](https://github.com/apache/spark/commit/4f4706352c84469503ae3c3388098458b570f62f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155020044 In this commit, I added partitioned tables for the test and sorted the `FileStatus`es. There are several things to mention here. Firstly, now we do not need to change `Set` to `LinkedHashSet` and `Map` to `LinkedHashMap` for this issue since it manually sorts the `FileStatus`es. However, I left them as I though anyway the order of files better be in the order as they are retrieved. If that looks weird, I would like to get it back. Secondly, in any cases, the columns of the lexicographically first file shows first, which might be a matter for files starting/containing with numeric values. However, I left this as I though anyway it is deterministic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155019607 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-155019577 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9517#discussion_r44247117 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala --- @@ -461,13 +461,29 @@ private[sql] class ParquetRelation( // You should enable this configuration only if you are very sure that for the parquet // part-files to read there are corresponding summary files containing correct schema. + // As filed in SPARK-11500, the order of files to touch is a matter, which might affect + // the ordering of the output columns. There are several things to mention here. + // + // 1. If mergeRespectSummaries config is false, then it merges schemas by reducing from + // the first part-file so that the columns of the first file show first. + // + // 2. If mergeRespectSummaries config is true, then there should be, at least, + // "_metadata"s for all given files. So, we can ensure the columns of the first file + // show first. + // + // 3. If shouldMergeSchemas is false, but when multiple files are given, there is + // no guarantee of the output order, since there might not be a summary file for the + // first file, which ends up putting ahead the columns of the other files. However, + // this should be okay since not enabling shouldMergeSchemas means (assumes) all the + // files have the same schemas. + val needMerged: Seq[FileStatus] = if (mergeRespectSummaries) { Seq() } else { dataStatuses } - (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq + needMerged ++ metadataStatuses ++ commonMetadataStatuses --- End diff -- Yes, I think I should sort them. It looks it is not really recommended just to use it as it is, although they looks sorted, assuming from [this link](http://lucene.472066.n3.nabble.com/FileSystem-contract-of-listStatus-td3475540.html). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/9517#discussion_r44247087 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala --- @@ -155,4 +155,22 @@ class ParquetHadoopFsRelationSuite extends HadoopFsRelationTest { assert(physicalPlan.collect { case p: execution.Filter => p }.length === 1) } } + + test("SPARK-11500: Not deterministic order of columns when using merging schemas.") { +import testImplicits._ +withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") { + withTempPath { dir => +val pathOne = s"${dir.getCanonicalPath}/table1" +Seq(1, 1).zipWithIndex.toDF("a", "b").write.parquet(pathOne) +val pathTwo = s"${dir.getCanonicalPath}/table2" +Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo) +val pathThree = s"${dir.getCanonicalPath}/table3" +Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree) --- End diff -- Thanks for commands! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9517#discussion_r44245271 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala --- @@ -461,13 +461,29 @@ private[sql] class ParquetRelation( // You should enable this configuration only if you are very sure that for the parquet // part-files to read there are corresponding summary files containing correct schema. + // As filed in SPARK-11500, the order of files to touch is a matter, which might affect + // the ordering of the output columns. There are several things to mention here. + // + // 1. If mergeRespectSummaries config is false, then it merges schemas by reducing from + // the first part-file so that the columns of the first file show first. + // + // 2. If mergeRespectSummaries config is true, then there should be, at least, + // "_metadata"s for all given files. So, we can ensure the columns of the first file + // show first. + // + // 3. If shouldMergeSchemas is false, but when multiple files are given, there is + // no guarantee of the output order, since there might not be a summary file for the + // first file, which ends up putting ahead the columns of the other files. However, + // this should be okay since not enabling shouldMergeSchemas means (assumes) all the + // files have the same schemas. + val needMerged: Seq[FileStatus] = if (mergeRespectSummaries) { Seq() } else { dataStatuses } - (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq + needMerged ++ metadataStatuses ++ commonMetadataStatuses --- End diff -- Does HDFS guarantee that the result of `listStatus()` is always sorted? If not, we probably need to sort these `FileStatus`es. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/9517#discussion_r44244350 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala --- @@ -155,4 +155,22 @@ class ParquetHadoopFsRelationSuite extends HadoopFsRelationTest { assert(physicalPlan.collect { case p: execution.Filter => p }.length === 1) } } + + test("SPARK-11500: Not deterministic order of columns when using merging schemas.") { +import testImplicits._ +withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") { + withTempPath { dir => +val pathOne = s"${dir.getCanonicalPath}/table1" +Seq(1, 1).zipWithIndex.toDF("a", "b").write.parquet(pathOne) +val pathTwo = s"${dir.getCanonicalPath}/table2" +Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo) +val pathThree = s"${dir.getCanonicalPath}/table3" +Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree) --- End diff -- We should probably use a partitioned table here. Directories like `base/table1`, `base/table2`, and `base/table3` are not valid partition directory names, and loading `base` as a Parquet file should throw an exception. It's not expected that this test case can pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154906491 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154906427 **[Test build #45324 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45324/consoleFull)** for PR 9517 at commit [`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154891629 **[Test build #45324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45324/consoleFull)** for PR 9517 at commit [`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154891106 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154891112 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154890873 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154531328 Build finished. No test results found. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154531343 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45235/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154503748 Build finished. No test results found. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154503753 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45233/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154502425 **[Test build #45235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45235/consoleFull)** for PR 9517 at commit [`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154502199 Build started sha1 is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154502163 Build triggered. sha1 is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154501691 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154499906 Build triggered. sha1 is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154499940 Build started sha1 is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154498818 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/9517 [SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. https://issues.apache.org/jira/browse/SPARK-11500 As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns. This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order. Also, reducing order is set left, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist. One nit is, If merging schemas is enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files. However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas. In addition, in the test code for this, I only checked the names of fields. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-11500 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9517.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9517 commit b0e6ce2729f584a9f95996707f60eb650c2a58b9 Author: hyukjinkwon Date: 2015-11-06T07:38:26Z [SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. commit 08fc91ca8d21902677e78f0adb3b36769f2cba51 Author: hyukjinkwon Date: 2015-11-06T07:38:55Z [SPARK-11500][SQL] Add a test to check the deterministic order. commit bcf72d3ca308f9a69993803d9c8939696c915b07 Author: hyukjinkwon Date: 2015-11-06T07:40:17Z [SPARK-11500][SQL] Remove trailing newline. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154338582 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/9517#issuecomment-154338571 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org