Rahul Challapalli created DRILL-5604:
----------------------------------------
Summary: Possible performance degradation with hash aggregate when
number of distinct keys increase
Key: DRILL-5604
URL: https://issues.apache.org/jira/browse/DRILL-5604
Project: Apache Drill
Issue Type: Bug
Components: Execution - Relational Operators
Affects Versions: 1.11.0
Reporter: Rahul Challapalli
git.commit.id.abbrev=90f43bf
I tried to track the runtime as we gradually increase the no of distinct keys
without increasing the total no of records. Below is one such test on top of
tpcds sf1000 dataset
{code}
0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_list_price) from
store_sales;
+---------+
| EXPR$0 |
+---------+
| 19736 |
+---------+
1 row selected (163.345 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_net_profit) from
store_sales;
+----------+
| EXPR$0 |
+----------+
| 1525675 |
+----------+
1 row selected (2094.962 seconds)
{code}
In both the above queries, the hash agg code processed 2879987999 records. So
the time difference is due to overheads like hash table resizing etc. The
second query took ~30 mins more than the first raising doubts whether there is
an issue somewhere.
The dataset is too large to attach to a jira and so are the logs
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)