please open a JIRA issue. please provide the test file (compressed) or a
script to generate similar data.

Thanks!

On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <kfar...@maprtech.com> wrote:

> Query that uses window functions takes too long to complete and return
> results. It returns close to a million records, for which it took 533.8
> seconds ~8 minutes
> Input CSV file has two columns, one integer and another varchar type
> column. Please let me know if this needs to be investigated and I can
> report a JIRA to track this if required ?
>
> Size of the input CSV file
>
> root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
>
> -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> /tmp/manyDuplicates.csv
>
> {code}
>
> select count(*) over(partition by cast(columns[1] as varchar(25)) order by
> cast(columns[0] as bigint)) from `manyDuplicates.csv`;
>
> ...
>
> 1,000,007 rows selected (533.857 seconds)
> {code}
>
> There are five distinct values in columns[1] in the CSV file. = [FIVE
> PARTITIONS]
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> `manyDuplicates.csv`;
>
> *+-----------------------+*
>
> *| **       EXPR$0        ** |*
>
> *+-----------------------+*
>
> *| *FFFFGGGGHHHHIIIIJJJJ * |*
>
> *| *PPPPQQQQRRRRSSSSTTTT * |*
>
> *| *AAAABBBBCCCCDDDDEEEE * |*
>
> *| *UUUUVVVVWWWWXXXXZZZZ * |*
>
> *| *KKKKLLLLMMMMNNNNOOOO * |*
>
> *+-----------------------+*
>
> 5 rows selected (1.906 seconds)
> {code}
>
> Here is the count for each of those values in columns[1]
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *200484 * |*
>
> *+---------+*
>
> 1 row selected (0.961 seconds)
>
> {code}
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *199353 * |*
>
> *+---------+*
>
> 1 row selected (0.86 seconds)
>
> {code}
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *200702 * |*
>
> *+---------+*
>
> 1 row selected (0.826 seconds)
>
> {code}
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *199916 * |*
>
> *+---------+*
>
> 1 row selected (0.851 seconds)
>
> {code}
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *199552 * |*
>
> *+---------+*
>
> 1 row selected (0.827 seconds)
> {code}
>
> Thanks,
> Khurram
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Reply via email to