Re: [I] Blog post about parquet vs custom file formats [datafusion]
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2976129703 > I am also curious: **Why would uncompressed Parquet be considered an optimization over Snappy-compressed Parquet?** Is the decompression overhead of Snappy significant enough to slow down read performance? Yes, exactly this -- once you have the data locally, the speed of block decompression like snappy often dominates the query performance. Of course using no decompression comes at a tradeoff of file size / more network required -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
JigaoLuo commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2972640469 Hi @alamb @zhuqi-lucas , I recently encountered an issue and it is very nice. Thanks. I am also curious: **Why would uncompressed Parquet be considered an optimization over Snappy-compressed Parquet?** Is the decompression overhead of Snappy significant enough to slow down read performance? I understand that compressed Parquet is typically motivated by overcoming I/O-bound limitations in databases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2948801017 Here is a related blog about doing this on clickhouse: - https://altinity.com/blog/the-future-has-arrived-parquet-on-iceberg-finally-outperforms-mergetree -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2934649736 I posted about this on twitter too in case anyone is interested: https://x.com/andrewlamb/status/1929852296323547273 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
zhuqi-lucas commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2934671069 Thank you @alamb! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
zhuqi-lucas commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2922543426 > > Do some changes in arrow-rs clickbench benchmark: > > Do I understand that the changes you report are due to simply rewriting the parquet files to have a page index and be uncompressed? Yeah @alamb , this's the reason. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2922493420 > Do some changes in arrow-rs clickbench benchmark: Do I understand that the changes you report are due to simply rewriting the parquet files to have a page index and be uncompressed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2919981992 3x faster for Q21 is pretty neat to see -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
zhuqi-lucas commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2918761743
> > A fun experiment might be to "fix" the clickbench partitioned dataset by
>
> > resorting and writing with page indexes (could use a bunch of DataFusion
COPY commands pretty easily to do this). The sort order should be some subset
of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL.
> > disabling compression
>
> This is very interesting, maybe we can also do this for arrow-rs
clickbench benchmark to see the result.
cc @alamb @Dandandan I do some experiment now, here is the result, mostly it
will benefit from page_index with uncompressed parquet data:
```rust
critcmp mock_customer_format main
group
main mock_customer_format
-
arrow_reader_clickbench/async/Q1
1.23 1025.0±96.12µs? ?/sec1.00 834.2±66.60µs?
?/sec
arrow_reader_clickbench/async/Q10
1.00 20.5±5.75ms? ?/sec1.25 25.7±6.92ms?
?/sec
arrow_reader_clickbench/async/Q11
1.00 22.8±4.31ms? ?/sec1.03 23.6±7.31ms?
?/sec
arrow_reader_clickbench/async/Q12
1.86 29.7±2.88ms? ?/sec1.00 16.0±2.67ms?
?/sec
arrow_reader_clickbench/async/Q13
1.74 36.9±2.59ms? ?/sec1.00 21.2±2.74ms?
?/sec
arrow_reader_clickbench/async/Q14
1.33 37.2±1.88ms? ?/sec1.00 27.9±3.90ms?
?/sec
arrow_reader_clickbench/async/Q19
1.00 3.9±0.45ms? ?/sec1.21 4.7±1.11ms?
?/sec
arrow_reader_clickbench/async/Q20
2.59 90.6±3.54ms? ?/sec1.00 34.9±2.57ms?
?/sec
arrow_reader_clickbench/async/Q21
2.95117.2±5.50ms? ?/sec1.00 39.7±4.13ms?
?/sec
arrow_reader_clickbench/async/Q22
3.88219.8±3.14ms? ?/sec1.0056.6±10.14ms?
?/sec
arrow_reader_clickbench/async/Q23
1.64215.7±3.89ms? ?/sec1.00131.7±7.86ms?
?/sec
arrow_reader_clickbench/async/Q24
1.80 40.7±3.22ms? ?/sec1.00 22.6±3.77ms?
?/sec
arrow_reader_clickbench/async/Q27
2.82 98.8±4.16ms? ?/sec1.00 35.1±6.48ms?
?/sec
arrow_reader_clickbench/async/Q28
2.37 95.6±3.30ms? ?/sec1.00 40.3±7.00ms?
?/sec
arrow_reader_clickbench/async/Q30
1.28 43.8±4.08ms? ?/sec1.00 34.3±3.55ms?
?/sec
arrow_reader_clickbench/async/Q36
2.35 96.6±4.54ms? ?/sec1.00 41.1±6.47ms?
?/sec
arrow_reader_clickbench/async/Q37
1.43 58.4±8.72ms? ?/sec1.00 40.9±8.38ms?
?/sec
arrow_reader_clickbench/async/Q38
1.02 22.6±1.55ms? ?/sec1.00 22.3±1.38ms?
?/sec
arrow_reader_clickbench/async/Q39
1.11 24.7±0.66ms? ?/sec1.00 22.2±2.25ms?
?/sec
arrow_reader_clickbench/async/Q40
1.22 28.7±1.41ms? ?/sec1.00 23.6±1.69ms?
?/sec
arrow_reader_clickbench/async/Q41
1.03 21.9±1.36ms? ?/sec1.00 21.3±3.00ms?
?/sec
arrow_reader_clickbench/async/Q42
1.00 12.3±1.00ms? ?/sec1.48 18.2±2.33ms?
?/sec
arrow_reader_clickbench/sync/Q1
1.00736.6±8.42µs? ?/sec1.46 1077.0±53.33µs?
?/sec
arrow_reader_clickbench/sync/Q10
Re: [I] Blog post about parquet vs custom file formats [datafusion]
zhuqi-lucas commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2903175289 > A fun experiment might be to "fix" the clickbench partitioned dataset by resorting and writing with page indexes (could use a bunch of DataFusion COPY commands pretty easily to do this). The sort order should be some subset of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL. disabling compression This is very interesting, maybe we can also do this for arrow-rs clickbench benchmark to see the result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2901748283 > Interestingly, Clickbench being quite a bit faster again for 1.3 ([ClickHouse/ClickBench#376](https://github.com/ClickHouse/ClickBench/pull/376) ) seems mostly related to using predicate pushdown more effectively during Parquet decoding (which they already might have implemented for their own format). Indeed -- unsurprisingly the more effort that is put into parquet readers the faster they go 😆 and the open nature / wide spread adoption of the format makes it easier to gather that required effort. BTW, I am working on the same for DataFusion with @zhuqi-lucas in https://github.com/apache/arrow-rs/issues/7456 I hope we will have some major improvements to share in another week or two -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Blog post about parquet vs custom file formats [datafusion]
Dandandan commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2901389429 Interestingly, Clickbench being quite a bit faster again for 1.3 (https://github.com/ClickHouse/ClickBench/pull/376 ) seems mostly related to using predicate pushdown more effectively during Parquet decoding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
