[GitHub] [arrow-datafusion] e-dard commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

GitBox Thu, 10 Feb 2022 04:58:02 -0800


e-dard commented on issue #1708:
URL: 
https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1034891519



   @alamb highlighted this thread internally and I saw a couple of interesting 
points. I work on IOx's Read Buffer, which is an in-memory columnar engine that 
currently implements Datafusion's table provider (so currently only supports 
scans with predicate pushdown etc).
   
   I have experimented with a prototype that can do grouping/aggregation 
directly on encoded columnar data (e.g., on integer representations of 
RLE/dictionary encodings) and I found a couple of things mentioned already in 
this thread:
   
   Using a `Vec<SomeEnum>` had a big overhead (as @alamb mentioned) on hashing 
performance. However, in the Read Buffer's case it was possible to use all 
group column value's encoded  representations directly, which were (`u32`) [^1].
   
   Using `Vec<u32>` made a significant improvement to performance. Further, as 
a special case optimisation I found that if one were grouping on four or fewer 
columns then there was another big bump in performance by packing the encoded 
group key values into a single `u128`, and using that as the key in the 
hashmap. This is where I see the similarities to using a binary representation 
of the group key. 
   
   Anyway, just some anecdotal thoughts :-). Whilst there are some significant 
constraints the Read Buffer can take advantage of that Datafusion can't, based 
on my experience from playing around with similar ideas, I suspect the 
direction @yjshen has proposed things go here is going will have a significant 
improvement on grouping performance 👍 
   
   [1]: Because all group columns in the read buffer are dictionary or RLE 
encoded such that the encoded representation have the same ordinal properties


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] e-dard commented on issue #1708: Introduce a `Vec` based row-wise representation for DataFusion

Reply via email to