Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1071#discussion_r158255284
  
    --- Diff: 
exec/java-exec/src/test/java/org/apache/drill/TestUnionDistinct.java ---
    @@ -754,4 +756,37 @@ public void testDrill4147_1() throws Exception {
         }
       }
     
    +  @Test
    +  public void testUnionWithManyColumns() throws Exception {
    --- End diff --
    
    During union we need to add two result sets and then remove the duplicates.
    Depending on the query cost query plans will be different, when number of 
columns are not large (approx. less then 40), sorting and streaming aggregate 
is chosen:
    Query `select columns[0]...columns[5] from dfs.test_1.csv union select 
columns[0]...columns[5] from dfs.test_2.csv`:
    ```
    00-00 Screen
    00-01 Project
    00-02 StreamAgg
    00-03 Sort
    00-04 UnionAll(all=[true])
    00-05 Scan table 2
    00-06 Scan table 1
    ```
    When number of columns 40+ plan with hash aggregate is chosen since it is 
cheaper.
    Query `select columns[0]...columns[1200] from dfs.test_1.csv union select 
columns[0]...columns[1200] from dfs.test_2.csv`:
    ```
    00-00 Screen
    00-01 Project
    00-02 HashAgg
    00-03 UnionAll(all=[true])
    00-04 Scan table 2
    00-05 Scan table 1
    ```
    Also added in Jira more examples of generated code before and after the 
change. 


---

Reply via email to