Dandandan commented on pull request #68: URL: https://github.com/apache/arrow-datafusion/pull/68#issuecomment-923086839
Discussion on the parquet results lead to following conclusions: * quite a bit of CPU time is spent in utf8 validation (around 20% for a full lineitem parquet file or ~50% for a string column). We can use simdutf8 (https://github.com/jorgecarleitao/arrow2/pull/426) to speed up utf8 validation for >= 64 byte columns, especially for non-ascii data, but that doesn't help with smaller strings in the TPC-H files. It seems `parquet` might not or not always do the utf8 validation https://github.com/apache/arrow-rs/issues/786 . Beyond this `parquet2` is having a larger advantage with optional fields compared to `parquet`. Based on profiling, I think there are still some opportunities for improving the `parquet2` performance a bit more, but I couldn't find a lot of low hanging fruit. Also there are some arrow2 updates which might have an effect on DataFusion queries and I will rerun the benchmarks when merged: * https://github.com/jorgecarleitao/arrow2/pull/428 : makes ahash the default and adds some multiversioning to select SIMD instructions (aes / sse / etc. on x86_64) used by `aHash`. Note that `aHash` was not used by default before, this already gives a 2-3x boost over the Rust default `SipHash`. A nightly compiler might be even faster as it can use specialization in that case. * https://github.com/jorgecarleitao/arrow2/pull/427 : improves performance on utf8 kernels. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org