[GitHub] [arrow-datafusion] jiangzhx opened a new issue #1246: group by high cardinality column in datafusion 10 times slower than low cardinality column

GitBox Thu, 04 Nov 2021 22:31:17 -0700


jiangzhx opened a new issue #1246:
URL: https://github.com/apache/arrow-datafusion/issues/1246



   **Describe the bug**
   group by high cardinality column in datafusion 10 times slower than low 
cardinality column.
   also i tested on other olap engine, there are only 2 times slow or less;
   
   ### [trion](https://github.com/trinodb/trino) olap engine write by java
   low cardinality  usage ms: 1000ms±
   high cardinality  usage ms: 2000ms±
   
   
   ### [doris](https://github.com/apache/incubator-doris/) olap engine write by 
c++
   low cardinality  usage ms: 350ms±
   high cardinality  usage ms: 500ms±
   
   
   **To Reproduce**
   Steps to reproduce the behavior:
   parquet table with 60,000,000 rows; data generate by 
[ssb-dbgen](https://github.com/electrum/ssb-dbgen)
   
   group by LO_ORDERPRIORITY
       
       SELECT sum(LO_EXTENDEDPRICE) AS revenue  FROM lineorder_flat group by 
LO_ORDERPRIORITY;
       5 rows in set. Query took 0.341 seconds.
   
   group by S_ADDRESS
   
       SELECT sum(LO_EXTENDEDPRICE) AS revenue  FROM lineorder_flat group by 
S_ADDRESS;
       20000 rows in set. Query took 2.582 seconds.
   
   
   **Expected behavior**
   should some with other engine;
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jiangzhx opened a new issue #1246: group by high cardinality column in datafusion 10 times slower than low cardinality column

Reply via email to