Query that uses window functions takes too long to complete and return
results. It returns close to a million records, for which it took 533.8
seconds ~8 minutes
Input CSV file has two columns, one integer and another varchar type
column. Please let me know if this needs to be investigated and I can
report a JIRA to track this if required ?

Size of the input CSV file

root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv

-rwxr-xr-x   3 root root   27889455 2015-06-10 01:26 /tmp/manyDuplicates.csv

{code}

select count(*) over(partition by cast(columns[1] as varchar(25)) order by
cast(columns[0] as bigint)) from `manyDuplicates.csv`;

...

1,000,007 rows selected (533.857 seconds)
{code}

There are five distinct values in columns[1] in the CSV file. = [FIVE
PARTITIONS]

{code}

0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
`manyDuplicates.csv`;

*+-----------------------+*

*| **       EXPR$0        ** |*

*+-----------------------+*

*| *FFFFGGGGHHHHIIIIJJJJ * |*

*| *PPPPQQQQRRRRSSSSTTTT * |*

*| *AAAABBBBCCCCDDDDEEEE * |*

*| *UUUUVVVVWWWWXXXXZZZZ * |*

*| *KKKKLLLLMMMMNNNNOOOO * |*

*+-----------------------+*

5 rows selected (1.906 seconds)
{code}

Here is the count for each of those values in columns[1]

{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *200484 * |*

*+---------+*

1 row selected (0.961 seconds)

{code}


{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *199353 * |*

*+---------+*

1 row selected (0.86 seconds)

{code}


{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *200702 * |*

*+---------+*

1 row selected (0.826 seconds)

{code}


{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *199916 * |*

*+---------+*

1 row selected (0.851 seconds)

{code}


{code}

0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
`manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';

*+---------+*

*| **EXPR$0 ** |*

*+---------+*

*| *199552 * |*

*+---------+*

1 row selected (0.827 seconds)
{code}

Thanks,
Khurram

Reply via email to