[GitHub] [arrow-datafusion] jiangzhx commented on issue #1246: group by high cardinality column in datafusion 10 times slower than low cardinality column

GitBox Fri, 05 Nov 2021 13:35:44 -0700


jiangzhx commented on issue #1246:
URL: 
https://github.com/apache/arrow-datafusion/issues/1246#issuecomment-961646148



   > If I recall correctly, datafusion doesn't do fine optimization about 
`group by` and `aggregate functions` at present. It's worth adding it to our 
RoadMap and doing it in the future.
   
   i try to dig code in trino and doris; there are all have streaming aggregate 
node; but i can't understand how they working.
   
   `aggregate functions`  was working fine; with sum(LO_EXTENDEDPRICE) or 
without; the performence has no big difference,there are also have 5~10 times 
slow;
   
   low cardinality:
   
       select 1  FROM lineorder_flat group by LO_ORDERPRIORITY;
       5 rows in set. Query took 0.236 seconds.
   
   high cardinality:
   
       select 1  FROM lineorder_flat group by S_ADDRESS;
       20000 rows in set. Query took 1.429 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jiangzhx commented on issue #1246: group by high cardinality column in datafusion 10 times slower than low cardinality column

Reply via email to