Dandandan commented on pull request #68:
URL: https://github.com/apache/arrow-datafusion/pull/68#issuecomment-923086839


   Discussion on the parquet results lead to following conclusions:
   
   * quite a bit of CPU time is spent in utf8 validation (around 20% for a full 
lineitem parquet file or ~50% for a string column). We can use simdutf8  
(https://github.com/jorgecarleitao/arrow2/pull/426) to speed up utf8 validation 
for >= 64 byte columns, especially for non-ascii data, but that doesn't help 
with smaller strings in the TPC-H files. It seems `parquet` might not or not 
always do the utf8 validation https://github.com/apache/arrow-rs/issues/786 . 
Beyond this `parquet2` is having a larger advantage with optional fields 
compared to `parquet`. Based on profiling, I think there are still some 
opportunities for improving the `parquet2` performance a bit more, but I 
couldn't find a lot of low hanging fruit.
   
   Also there are some arrow2 updates which might have an effect on DataFusion 
queries and I will rerun the benchmarks when merged:
   
   * https://github.com/jorgecarleitao/arrow2/pull/428 : makes ahash the 
default and adds some multiversioning to select SIMD instructions (aes / sse / 
etc. on x86_64) used by `aHash`. Note that `aHash` was not used by default 
before, this already gives a 2-3x boost over the Rust default `SipHash`. A 
nightly compiler might be even faster as it can use specialization in that case.
   * https://github.com/jorgecarleitao/arrow2/pull/427 : improves performance 
on utf8 kernels.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to