alamb opened a new issue, #7064:
URL: https://github.com/apache/arrow-datafusion/issues/7064

   ### Is your feature request related to a problem or challenge?
   
   DataFusion could be made faster for queries that have a `GROUP BY <string> 
column`
   
   For example, in ClickBench Q34
   
   ```sql
   Q34: SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER BY c DESC 
LIMIT 10;
   ```
   
   You can run this query from a datafusion checkout like this (using the code 
in https://github.com/apache/arrow-datafusion/pull/7060, which hopefully will 
be merged shortly): 
   
   ```shell
   # get data
   ./benchmarks/bench.sh data clickbench_1
   # run benchmark
   cargo run --release  --bin dfbench -- clickbench --query 34
   ```
   
   Here is the profile:
   
   (TBD)
   
   ### Describe the solution you'd like
   
   I would like a special cased `GroupsValue` for this case of a single string 
(hopefully Utf8, LargeUTf8, Binary, and LargeBinary) column that:
   1. Does no allocations per group (aka stores all strings in some single 
contiguous location)
   2. Avoids the Row format / copy of values
   
   Other ideas that could make this faster:
   1. Small String optimization
   2. special case ASCII (to avoid UTF8 checks for data, like TPCH, that does 
not contain UTF8 data)
   
   "Small String optimization" refers to the format described in the [umbra 
paper](https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf), 
   
   <img width="546" alt="Screenshot 2023-07-24 at 6 38 01 AM" 
src="https://github.com/apache/arrow-datafusion/assets/490673/967c1956-85e4-46b7-ac75-75620aaa99f5";>
   
   This would have to be adapted for Rust / safetly but the same general idea 
applies (inlining the first few bytes of the string into the hash table for 
quick "is it equal" comparisons, and then having an offset to an external area 
for larger strings)
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   @tustvold's changes in 
https://github.com/apache/arrow-datafusion/issues/6969 and 
https://github.com/apache/arrow-datafusion/pull/7043 should make it very easy 
to code this up as a different GroupValues implementation
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to