Query that uses window functions takes too long to complete and return results. It returns close to a million records, for which it took 533.8 seconds ~8 minutes Input CSV file has two columns, one integer and another varchar type column. Please let me know if this needs to be investigated and I can report a JIRA to track this if required ?
Size of the input CSV file root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv -rwxr-xr-x 3 root root 27889455 2015-06-10 01:26 /tmp/manyDuplicates.csv {code} select count(*) over(partition by cast(columns[1] as varchar(25)) order by cast(columns[0] as bigint)) from `manyDuplicates.csv`; ... 1,000,007 rows selected (533.857 seconds) {code} There are five distinct values in columns[1] in the CSV file. = [FIVE PARTITIONS] {code} 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from `manyDuplicates.csv`; *+-----------------------+* *| ** EXPR$0 ** |* *+-----------------------+* *| *FFFFGGGGHHHHIIIIJJJJ * |* *| *PPPPQQQQRRRRSSSSTTTT * |* *| *AAAABBBBCCCCDDDDEEEE * |* *| *UUUUVVVVWWWWXXXXZZZZ * |* *| *KKKKLLLLMMMMNNNNOOOO * |* *+-----------------------+* 5 rows selected (1.906 seconds) {code} Here is the count for each of those values in columns[1] {code} 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ'; *+---------+* *| **EXPR$0 ** |* *+---------+* *| *200484 * |* *+---------+* 1 row selected (0.961 seconds) {code} {code} 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT'; *+---------+* *| **EXPR$0 ** |* *+---------+* *| *199353 * |* *+---------+* 1 row selected (0.86 seconds) {code} {code} 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE'; *+---------+* *| **EXPR$0 ** |* *+---------+* *| *200702 * |* *+---------+* 1 row selected (0.826 seconds) {code} {code} 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ'; *+---------+* *| **EXPR$0 ** |* *+---------+* *| *199916 * |* *+---------+* 1 row selected (0.851 seconds) {code} {code} 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO'; *+---------+* *| **EXPR$0 ** |* *+---------+* *| *199552 * |* *+---------+* 1 row selected (0.827 seconds) {code} Thanks, Khurram