Aman Sinha created DRILL-4365: --------------------------------- Summary: Performance with lots of small parquet files Key: DRILL-4365 URL: https://issues.apache.org/jira/browse/DRILL-4365 Project: Apache Drill Issue Type: Bug Components: Storage - Parquet Affects Versions: 1.5.0 Reporter: Aman Sinha
I am seeing a performance degradation on 1.5.0 compared to 1.4.0 with a query over 968 small parquet files where the total # rows is only 1000, so just about 1 row per file. The profile shows parquet scan is slower. With bigger tables, I haven't seen the same issue yet (although need confirmation from the full performance run). Note: this is with default slice_target of 100K so only 1 scan fragment was used. I will attach the dataset to this JIRA if anyone wants to repro. On 1.4.0: (with multiple runs): {noformat} 0: jdbc:drill:zk=local> select min(ss_item_sk) from dfs.tmp.ss1test ; +---------+ | EXPR$0 | +---------+ | 39 | +---------+ 1 row selected (2.544 seconds) 0: jdbc:drill:zk=local> select min(ss_item_sk) from dfs.tmp.ss1test ; +---------+ | EXPR$0 | +---------+ | 39 | +---------+ 1 row selected (2.434 seconds) {noformat} On 1.5.0: (multiple runs): {noformat} 0: jdbc:drill:zk=local> select min(ss_item_sk) from dfs.tmp.ss1test ; +---------+ | EXPR$0 | +---------+ | 39 | +---------+ 1 row selected (3.851 seconds) 0: jdbc:drill:zk=local> select min(ss_item_sk) from dfs.tmp.ss1test ; +---------+ | EXPR$0 | +---------+ | 39 | +---------+ 1 row selected (3.61 seconds) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)