please open a JIRA issue. please provide the test file (compressed) or a script to generate similar data.
Thanks! On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <[email protected]> wrote: > Query that uses window functions takes too long to complete and return > results. It returns close to a million records, for which it took 533.8 > seconds ~8 minutes > Input CSV file has two columns, one integer and another varchar type > column. Please let me know if this needs to be investigated and I can > report a JIRA to track this if required ? > > Size of the input CSV file > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv > > -rwxr-xr-x 3 root root 27889455 2015-06-10 01:26 > /tmp/manyDuplicates.csv > > {code} > > select count(*) over(partition by cast(columns[1] as varchar(25)) order by > cast(columns[0] as bigint)) from `manyDuplicates.csv`; > > ... > > 1,000,007 rows selected (533.857 seconds) > {code} > > There are five distinct values in columns[1] in the CSV file. = [FIVE > PARTITIONS] > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from > `manyDuplicates.csv`; > > *+-----------------------+* > > *| ** EXPR$0 ** |* > > *+-----------------------+* > > *| *FFFFGGGGHHHHIIIIJJJJ * |* > > *| *PPPPQQQQRRRRSSSSTTTT * |* > > *| *AAAABBBBCCCCDDDDEEEE * |* > > *| *UUUUVVVVWWWWXXXXZZZZ * |* > > *| *KKKKLLLLMMMMNNNNOOOO * |* > > *+-----------------------+* > > 5 rows selected (1.906 seconds) > {code} > > Here is the count for each of those values in columns[1] > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *200484 * |* > > *+---------+* > > 1 row selected (0.961 seconds) > > {code} > > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *199353 * |* > > *+---------+* > > 1 row selected (0.86 seconds) > > {code} > > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *200702 * |* > > *+---------+* > > 1 row selected (0.826 seconds) > > {code} > > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *199916 * |* > > *+---------+* > > 1 row selected (0.851 seconds) > > {code} > > > {code} > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO'; > > *+---------+* > > *| **EXPR$0 ** |* > > *+---------+* > > *| *199552 * |* > > *+---------+* > > 1 row selected (0.827 seconds) > {code} > > Thanks, > Khurram > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
