[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17039 LGTM - merging to master! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17039 The code didn't respect the ordering because the analyzer adds a project on top; this is for cases where you sort by a column that is not in the final projection. We could also pattern match on that, but I don't think we should be adding too much magic. Let me merge this, and we can pick this up if it ever becomes an issue again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user robbinspg commented on the issue: https://github.com/apache/spark/pull/17039 ok there were a couple of similar issues such as in-set-operations query 9, group-analytics.sql.out query 21 and 22 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17039 H - apparently the ORDER BY is not the top node in that plan... Lemme check --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user robbinspg commented on the issue: https://github.com/apache/spark/pull/17039 ok so here is an example of output I'm not sure is correct: in-order-by -- !query 17 SELECT Count(DISTINCT( t1a )), t1b FROM t1 WHERE t1h NOT IN (SELECT t2h FROM t2 where t1a = t2a order by t2d DESC nulls first ) GROUP BY t1a, t1b ORDER BY t1b DESC nulls last -- !query 17 schema struct -- !query 17 output 1 10 1 10 1 16 1 6 1 8 1 NULL That is the "new" output with your change but it doesn't actually match what you'd expect from that query (it isn't t1b DESC) which would be 1 16 1 10 1 10 1 8 1 6 1 NULL --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17039 @robbinspg yeah let's use the approach that I suggested. That at least makes sure we won't have to fix this again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user robbinspg commented on the issue: https://github.com/apache/spark/pull/17039 @hvanhovell So I backed out the changes in this PR, implemented your change to SQLQueryTestSuite.getNormalizedResult, regenerated the golden results files and the tests all pass on my x86 and big endian platforms. results files that were changed: sql/core/src/test/resources/sql-tests/results/group-analytics.sql.out sql/core/src/test/resources/sql-tests/results/order-by-nulls-ordering.sql.out sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-joins.sql.out sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-order-by.sql.out sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-set-operations.sql.out sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/not-in-joins.sql.out So, should I abandon this PR and go with your solution? I can submit your change in a PR along with updated results files if you want. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17039 A small follow-up (I got the code to run). We could use the following in `SQLQueryTestSuite.getNormalizedResult`: ```scala val baseDf = session.sql(sql) val (df, isSorted) = baseDf.logicalPlan match { case Sort(ordering, true, child) => val sort = Sort(ordering ++ child.output.map(SortOrder(_, Ascending)), true, child) (Dataset.ofRows(session, sort), true) case _ => (baseDf, false) } val schema = df.schema // Get answer, but also get rid of the #1234 expression ids that show up in explain plans val answer = df.queryExecution.hiveResultString().map(_.replaceAll("#\\d+", "#x")) // If the output is not pre-sorted, sort it. if (isSorted) (schema, answer) else (schema, answer.sorted) ``` If I run this, it also changes results in the following files: - group-analytics.sql.out - in-set-operations.sql.out - order-by-nulls-ordering.sql.out --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17039 I have just taken a look at this. Modifying the sort does not really help. I am fine with this approach. LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user robbinspg commented on the issue: https://github.com/apache/spark/pull/17039 I think that the current "order if not currently ordered" in the test suite is good for checking the set of results for unordered queries. If ordered at all then the results should be deterministic given the input data and query are part of the test otherwise it is a bad test. So... I think this PR is the way to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17039 @hvanhovell Is that possible the SQL queries are used to verify the behavior of ORDER BY? Do you think we should explicitly leave a comment to say SQLQueryTestSuite will not be used for this goal? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17039 How about the more pragmatic approach. I think relation algebra only guarantees ordering when an order by is the top level operation. Why not just check that, and if we find one, add all output columns to the order by? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17039 The major issue is we do not know the original intention of users' query. The query might purposely check whether the result set is sorted or not. Thus, the existing test suite design is conservative to avoid adding any sort as long as users specify the ORDER BY clause. For example, ```SQL SELECT c1, c2, sum(c1) FROM tab1 GROUP BY c1, c2 ORDER BY c1, c2 ``` In the above example, although the order by clause does not contain all the columns, the result set is always sorted. Thus, our test suite should not sort it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17039 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17039 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73508/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17039 **[Test build #73508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73508/testReport)** for PR 17039 at commit [`4a4d7ad`](https://github.com/apache/spark/commit/4a4d7ad4b349e49dc4cb81235f796e360dd183f8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17039 **[Test build #73508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73508/testReport)** for PR 17039 at commit [`4a4d7ad`](https://github.com/apache/spark/commit/4a4d7ad4b349e49dc4cb81235f796e360dd183f8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user robbinspg commented on the issue: https://github.com/apache/spark/pull/17039 Jenkins retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user robbinspg commented on the issue: https://github.com/apache/spark/pull/17039 @gatorsmile I'm glad it wasn't just me that found it complex ;-) I've modified the patch to remove an unnecessary change as that query was not ordered and the test suite code handles that case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17039 @hvanhovell I made a try and just realized it might be hard to automatically detect whether it is a complete sort and respect the predefined sort order. The order by clauses could be complex expressions and the output attributes of queries might be the alias of complex expressions too. Thus, maybe we still can keep the existing conservative way (i.e., as long as the test query specifies the order-by clauses, we expect the query result have deterministic orders) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user robbinspg commented on the issue: https://github.com/apache/spark/pull/17039 @hvanhovell @gatorsmile I agree that would be a better solution however I don't know how to achieve that being unfamiliar with this code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17039 @hvanhovell Yes! This is caused by partial sort + duplication values on the sorted columns. Since all these result sets are pretty small, maybe another idea is to sort the whole result when we found duplication values on the sorted columns? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17039 @robbinspg `SQLQueryTestQuite` checks if the result is sorted, if it isn't the results get sorted. The problem with this sort check is that it does not check if the output is completely sorted, and thus causing test failures when rows have the same ordering. The current PR fixes issues with the current tests, but it does not guarantee that this won't happen again. Perhaps it is an idea to make a modification to the `SQLQueryTestQuite.getNormalizedResult`, that always does a complete sort (while respecting the predefined sort order)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17039 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17039 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73349/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17039 **[Test build #73349 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73349/testReport)** for PR 17039 at commit [`950415f`](https://github.com/apache/spark/commit/950415f98c4574532da9dd089e5a9b027d3683d8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17039: [SPARK-19710][SQL][TESTS] Fix ordering of rows in query ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17039 **[Test build #73349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73349/testReport)** for PR 17039 at commit [`950415f`](https://github.com/apache/spark/commit/950415f98c4574532da9dd089e5a9b027d3683d8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org