I tried the query using the new implementation (DRILL-3200) and it's much more faster: 14 seconds compared to 523 seconds using the current implementation. I didn't check the results though.
On Tue, Jun 9, 2015 at 11:30 PM, Khurram Faraaz <[email protected]> wrote: > JIRA 3269 is opened to track this behavior. > I tried to iterate over the ResultSet from a JDBC program, I only iterated > over the results until there were records, no results were > processed/printed. It still took close to nine minutes to complete > execution. > > Here is a snippet of what I did from JDBC. > > String query = "select count(*) over(partition by cast(columns[1] as > varchar(25)) order by cast(columns[0] as bigint)) from > `manyDuplicates.csv`" > ; > > > > ResultSet rs = stmt.executeQuery(query); > > > while (rs.next()) { > > System.out.println("1"); > > } > > On Tue, Jun 9, 2015 at 9:56 PM, Steven Phillips <[email protected]> > wrote: > > > In cases like this where you are printing millions of record in SQLLINE, > > you should pipe the output to /dev/null or to a file, and measure the > > performance that way. I'm guessing that most of the time in this case is > > spent printing the output to the console, and thus really unrelated to > > Drill performance. If piping the data to a file or /dev/null causes the > > query to run much faster, than it probably isn't a real issue. > > > > also, anytime you are investigating a performance related issue, you > should > > always check the profile. In this case, I suspect you might see that most > > of the time is spent in the WAIT time of the SCREEN operator. That would > > indicate that client side processing is slowing the query down. > > > > On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche < > [email protected] > > > > > wrote: > > > > > please open a JIRA issue. please provide the test file (compressed) or > a > > > script to generate similar data. > > > > > > Thanks! > > > > > > On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz <[email protected]> > > > wrote: > > > > > > > Query that uses window functions takes too long to complete and > return > > > > results. It returns close to a million records, for which it took > 533.8 > > > > seconds ~8 minutes > > > > Input CSV file has two columns, one integer and another varchar type > > > > column. Please let me know if this needs to be investigated and I can > > > > report a JIRA to track this if required ? > > > > > > > > Size of the input CSV file > > > > > > > > root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv > > > > > > > > -rwxr-xr-x 3 root root 27889455 2015-06-10 01:26 > > > > /tmp/manyDuplicates.csv > > > > > > > > {code} > > > > > > > > select count(*) over(partition by cast(columns[1] as varchar(25)) > order > > > by > > > > cast(columns[0] as bigint)) from `manyDuplicates.csv`; > > > > > > > > ... > > > > > > > > 1,000,007 rows selected (533.857 seconds) > > > > {code} > > > > > > > > There are five distinct values in columns[1] in the CSV file. = [FIVE > > > > PARTITIONS] > > > > > > > > {code} > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select distinct columns[1] from > > > > `manyDuplicates.csv`; > > > > > > > > *+-----------------------+* > > > > > > > > *| ** EXPR$0 ** |* > > > > > > > > *+-----------------------+* > > > > > > > > *| *FFFFGGGGHHHHIIIIJJJJ * |* > > > > > > > > *| *PPPPQQQQRRRRSSSSTTTT * |* > > > > > > > > *| *AAAABBBBCCCCDDDDEEEE * |* > > > > > > > > *| *UUUUVVVVWWWWXXXXZZZZ * |* > > > > > > > > *| *KKKKLLLLMMMMNNNNOOOO * |* > > > > > > > > *+-----------------------+* > > > > > > > > 5 rows selected (1.906 seconds) > > > > {code} > > > > > > > > Here is the count for each of those values in columns[1] > > > > > > > > {code} > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > `manyDuplicates.csv` where columns[1] = 'FFFFGGGGHHHHIIIIJJJJ'; > > > > > > > > *+---------+* > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > *+---------+* > > > > > > > > *| *200484 * |* > > > > > > > > *+---------+* > > > > > > > > 1 row selected (0.961 seconds) > > > > > > > > {code} > > > > > > > > > > > > {code} > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > `manyDuplicates.csv` where columns[1] = 'PPPPQQQQRRRRSSSSTTTT'; > > > > > > > > *+---------+* > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > *+---------+* > > > > > > > > *| *199353 * |* > > > > > > > > *+---------+* > > > > > > > > 1 row selected (0.86 seconds) > > > > > > > > {code} > > > > > > > > > > > > {code} > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > `manyDuplicates.csv` where columns[1] = 'AAAABBBBCCCCDDDDEEEE'; > > > > > > > > *+---------+* > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > *+---------+* > > > > > > > > *| *200702 * |* > > > > > > > > *+---------+* > > > > > > > > 1 row selected (0.826 seconds) > > > > > > > > {code} > > > > > > > > > > > > {code} > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > `manyDuplicates.csv` where columns[1] = 'UUUUVVVVWWWWXXXXZZZZ'; > > > > > > > > *+---------+* > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > *+---------+* > > > > > > > > *| *199916 * |* > > > > > > > > *+---------+* > > > > > > > > 1 row selected (0.851 seconds) > > > > > > > > {code} > > > > > > > > > > > > {code} > > > > > > > > 0: jdbc:drill:schema=dfs.tmp> select count(columns[1]) from > > > > `manyDuplicates.csv` where columns[1] = 'KKKKLLLLMMMMNNNNOOOO'; > > > > > > > > *+---------+* > > > > > > > > *| **EXPR$0 ** |* > > > > > > > > *+---------+* > > > > > > > > *| *199552 * |* > > > > > > > > *+---------+* > > > > > > > > 1 row selected (0.827 seconds) > > > > {code} > > > > > > > > Thanks, > > > > Khurram > > > > > > > > > > > > > > > > -- > > > > > > Abdelhakim Deneche > > > > > > Software Engineer > > > > > > <http://www.mapr.com/> > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > < > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > > > > > > -- > > Steven Phillips > > Software Engineer > > > > mapr.com > > > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
