Github user arina-ielchiieva commented on a diff in the pull request:
https://github.com/apache/drill/pull/1071#discussion_r158255284
--- Diff:
exec/java-exec/src/test/java/org/apache/drill/TestUnionDistinct.java ---
@@ -754,4 +756,37 @@ public void testDrill4147_1() throws Exception {
}
}
+ @Test
+ public void testUnionWithManyColumns() throws Exception {
--- End diff --
During union we need to add two result sets and then remove the duplicates.
Depending on the query cost query plans will be different, when number of
columns are not large (approx. less then 40), sorting and streaming aggregate
is chosen:
Query `select columns[0]...columns[5] from dfs.test_1.csv union select
columns[0]...columns[5] from dfs.test_2.csv`:
```
00-00 Screen
00-01 Project
00-02 StreamAgg
00-03 Sort
00-04 UnionAll(all=[true])
00-05 Scan table 2
00-06 Scan table 1
```
When number of columns 40+ plan with hash aggregate is chosen since it is
cheaper.
Query `select columns[0]...columns[1200] from dfs.test_1.csv union select
columns[0]...columns[1200] from dfs.test_2.csv`:
```
00-00 Screen
00-01 Project
00-02 HashAgg
00-03 UnionAll(all=[true])
00-04 Scan table 2
00-05 Scan table 1
```
Also added in Jira more examples of generated code before and after the
change.
---