Github user arina-ielchiieva commented on a diff in the pull request: https://github.com/apache/drill/pull/1071#discussion_r158255284 --- Diff: exec/java-exec/src/test/java/org/apache/drill/TestUnionDistinct.java --- @@ -754,4 +756,37 @@ public void testDrill4147_1() throws Exception { } } + @Test + public void testUnionWithManyColumns() throws Exception { --- End diff -- During union we need to add two result sets and then remove the duplicates. Depending on the query cost query plans will be different, when number of columns are not large (approx. less then 40), sorting and streaming aggregate is chosen: Query `select columns[0]...columns[5] from dfs.test_1.csv union select columns[0]...columns[5] from dfs.test_2.csv`: ``` 00-00 Screen 00-01 Project 00-02 StreamAgg 00-03 Sort 00-04 UnionAll(all=[true]) 00-05 Scan table 2 00-06 Scan table 1 ``` When number of columns 40+ plan with hash aggregate is chosen since it is cheaper. Query `select columns[0]...columns[1200] from dfs.test_1.csv union select columns[0]...columns[1200] from dfs.test_2.csv`: ``` 00-00 Screen 00-01 Project 00-02 HashAgg 00-03 UnionAll(all=[true]) 00-04 Scan table 2 00-05 Scan table 1 ``` Also added in Jira more examples of generated code before and after the change.
---