Re: Window function query takes too long to complete and return results

Abdel Hakim Deneche Wed, 10 Jun 2015 09:29:00 -0700

I tried the query using the new implementation (DRILL-3200) and it's much
more faster: 14 seconds compared to 523 seconds using the current
implementation. I didn't check the results though.


On Tue, Jun 9, 2015 at 11:30 PM, Khurram Faraaz <[email protected]>
wrote:

> JIRA 3269 is opened to track this behavior.
> I tried to iterate over the ResultSet from a JDBC program, I only iterated
> over the results until there were records, no results were
> processed/printed. It still took close to nine minutes to complete
> execution.
>
> Here is a snippet of what I did from JDBC.
>
> String query = "select count(*) over(partition by cast(columns[1] as
> varchar(25)) order by cast(columns[0] as bigint)) from
> `manyDuplicates.csv`"
> ;
>
>
>
>                 ResultSet rs = stmt.executeQuery(query);
>
>
>                 while (rs.next()) {
>
>                     System.out.println("1");
>
>                 }
>
> On Tue, Jun 9, 2015 at 9:56 PM, Steven Phillips <[email protected]>
> wrote:
>
> > In cases like this where you are printing millions of record in SQLLINE,
> > you should pipe the output to /dev/null or to a file, and measure the
> > performance that way. I'm guessing that most of the time in this case is
> > spent printing the output to the console, and thus really unrelated to
> > Drill performance. If piping the data to a file or /dev/null causes the
> > query to run much faster, than it probably isn't a real issue.
> >
> > also, anytime you are investigating a performance related issue, you
> should
> > always check the profile. In this case, I suspect you might see that most
> > of the time is spent in the WAIT time of the SCREEN operator. That would
> > indicate that client side processing is slowing the query down.
> >
> > On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche <
> [email protected]
> > >
> > wrote:
> >
> > > please open a JIRA issue. please provide the test file (compressed) or
> a
> > > script to generate similar data.
> > >
> > > Thanks!
> > >
> > > On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <[email protected]>
> > > wrote:
> > >
> > > > Query that uses window functions takes too long to complete and
> return
> > > > results. It returns close to a million records, for which it took
> 533.8
> > > > seconds ~8 minutes
> > > > Input CSV file has two columns, one integer and another varchar type
> > > > column. Please let me know if this needs to be investigated and I can
> > > > report a JIRA to track this if required ?
> > > >
> > > > Size of the input CSV file
> > > >
> > > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
> > > >
> > > > -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
> > > > /tmp/manyDuplicates.csv
> > > >
> > > > {code}
> > > >
> > > > select count(*) over(partition by cast(columns[1] as varchar(25))
> order
> > > by
> > > > cast(columns[0] as bigint)) from `manyDuplicates.csv`;
> > > >
> > > > ...
> > > >
> > > > 1,000,007 rows selected (533.857 seconds)
> > > > {code}
> > > >
> > > > There are five distinct values in columns[1] in the CSV file. = [FIVE
> > > > PARTITIONS]
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from
> > > > `manyDuplicates.csv`;
> > > >
> > > > *+-----------------------+*
> > > >
> > > > *| **       EXPR$0        ** |*
> > > >
> > > > *+-----------------------+*
> > > >
> > > > *| *FFFFGGGGHHHHIIIIJJJJ * |*
> > > >
> > > > *| *PPPPQQQQRRRRSSSSTTTT * |*
> > > >
> > > > *| *AAAABBBBCCCCDDDDEEEE * |*
> > > >
> > > > *| *UUUUVVVVWWWWXXXXZZZZ * |*
> > > >
> > > > *| *KKKKLLLLMMMMNNNNOOOO * |*
> > > >
> > > > *+-----------------------+*
> > > >
> > > > 5 rows selected (1.906 seconds)
> > > > {code}
> > > >
> > > > Here is the count for each of those values in columns[1]
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *200484 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.961 seconds)
> > > >
> > > > {code}
> > > >
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *199353 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.86 seconds)
> > > >
> > > > {code}
> > > >
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *200702 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.826 seconds)
> > > >
> > > > {code}
> > > >
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *199916 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.851 seconds)
> > > >
> > > > {code}
> > > >
> > > >
> > > > {code}
> > > >
> > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from
> > > > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO';
> > > >
> > > > *+---------+*
> > > >
> > > > *| **EXPR$0 ** |*
> > > >
> > > > *+---------+*
> > > >
> > > > *| *199552 * |*
> > > >
> > > > *+---------+*
> > > >
> > > > 1 row selected (0.827 seconds)
> > > > {code}
> > > >
> > > > Thanks,
> > > > Khurram
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
> >
> >
> > --
> >  Steven Phillips
> >  Software Engineer
> >
> >  mapr.com
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Window function query takes too long to complete and return results

Reply via email to