Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-06-16 Thread via GitHub


alamb commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2976129703

   > I am also curious: **Why would uncompressed Parquet be considered an 
optimization over Snappy-compressed Parquet?** Is the decompression overhead of 
Snappy significant enough to slow down read performance?
   
   Yes, exactly this -- once you have the data locally, the speed of block 
decompression like snappy often dominates the query performance. Of course 
using no decompression comes at a tradeoff of file size / more network required


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-06-14 Thread via GitHub


JigaoLuo commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2972640469

   Hi @alamb @zhuqi-lucas ,
   
   I recently encountered an issue and it is very nice. Thanks.
   I am also curious: **Why would uncompressed Parquet be considered an 
optimization over Snappy-compressed Parquet?** Is the decompression overhead of 
Snappy significant enough to slow down read performance?
   
   I understand that compressed Parquet is typically motivated by overcoming 
I/O-bound limitations in databases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-06-06 Thread via GitHub


alamb commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2948801017

   Here is a related blog about doing this on clickhouse:
   - 
https://altinity.com/blog/the-future-has-arrived-parquet-on-iceberg-finally-outperforms-mergetree


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-06-03 Thread via GitHub


alamb commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2934649736

   I posted about this on twitter too in case anyone is interested: 
https://x.com/andrewlamb/status/1929852296323547273


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-06-03 Thread via GitHub


zhuqi-lucas commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2934671069

   Thank you @alamb!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-30 Thread via GitHub


zhuqi-lucas commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2922543426

   > > Do some changes in arrow-rs clickbench benchmark:
   > 
   > Do I understand that the changes you report are due to simply rewriting 
the parquet files to have a page index and be uncompressed?
   
   Yeah @alamb , this's the reason. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-30 Thread via GitHub


alamb commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2922493420

   > Do some changes in arrow-rs clickbench benchmark:
   
   Do I understand that the changes you report are due to simply rewriting the 
parquet files to have a page index and be uncompressed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-29 Thread via GitHub


alamb commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2919981992

   3x faster for Q21 is pretty neat to see


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-29 Thread via GitHub


zhuqi-lucas commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2918761743

   > > A fun experiment might be to "fix" the clickbench partitioned dataset by
   > 
   > > resorting and writing with page indexes (could use a bunch of DataFusion 
COPY commands pretty easily to do this). The sort order should be some subset 
of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL.
   > > disabling compression
   > 
   > This is very interesting, maybe we can also do this for arrow-rs 
clickbench benchmark to see the result.
   
   cc @alamb @Dandandan I do some experiment now, here is the result, mostly it 
will benefit from page_index with uncompressed parquet data:
   
   
   
   ```rust
   critcmp  mock_customer_format main
   group
 main   mock_customer_format
   -
    
   arrow_reader_clickbench/async/Q1 
 1.23  1025.0±96.12µs? ?/sec1.00   834.2±66.60µs? 
?/sec
   arrow_reader_clickbench/async/Q10
 1.00 20.5±5.75ms? ?/sec1.25 25.7±6.92ms? 
?/sec
   arrow_reader_clickbench/async/Q11
 1.00 22.8±4.31ms? ?/sec1.03 23.6±7.31ms? 
?/sec
   arrow_reader_clickbench/async/Q12
 1.86 29.7±2.88ms? ?/sec1.00 16.0±2.67ms? 
?/sec
   arrow_reader_clickbench/async/Q13
 1.74 36.9±2.59ms? ?/sec1.00 21.2±2.74ms? 
?/sec
   arrow_reader_clickbench/async/Q14
 1.33 37.2±1.88ms? ?/sec1.00 27.9±3.90ms? 
?/sec
   arrow_reader_clickbench/async/Q19
 1.00  3.9±0.45ms? ?/sec1.21  4.7±1.11ms? 
?/sec
   arrow_reader_clickbench/async/Q20
 2.59 90.6±3.54ms? ?/sec1.00 34.9±2.57ms? 
?/sec
   arrow_reader_clickbench/async/Q21
 2.95117.2±5.50ms? ?/sec1.00 39.7±4.13ms? 
?/sec
   arrow_reader_clickbench/async/Q22
 3.88219.8±3.14ms? ?/sec1.0056.6±10.14ms? 
?/sec
   arrow_reader_clickbench/async/Q23
 1.64215.7±3.89ms? ?/sec1.00131.7±7.86ms? 
?/sec
   arrow_reader_clickbench/async/Q24
 1.80 40.7±3.22ms? ?/sec1.00 22.6±3.77ms? 
?/sec
   arrow_reader_clickbench/async/Q27
 2.82 98.8±4.16ms? ?/sec1.00 35.1±6.48ms? 
?/sec
   arrow_reader_clickbench/async/Q28
 2.37 95.6±3.30ms? ?/sec1.00 40.3±7.00ms? 
?/sec
   arrow_reader_clickbench/async/Q30
 1.28 43.8±4.08ms? ?/sec1.00 34.3±3.55ms? 
?/sec
   arrow_reader_clickbench/async/Q36
 2.35 96.6±4.54ms? ?/sec1.00 41.1±6.47ms? 
?/sec
   arrow_reader_clickbench/async/Q37
 1.43 58.4±8.72ms? ?/sec1.00 40.9±8.38ms? 
?/sec
   arrow_reader_clickbench/async/Q38
 1.02 22.6±1.55ms? ?/sec1.00 22.3±1.38ms? 
?/sec
   arrow_reader_clickbench/async/Q39
 1.11 24.7±0.66ms? ?/sec1.00 22.2±2.25ms? 
?/sec
   arrow_reader_clickbench/async/Q40
 1.22 28.7±1.41ms? ?/sec1.00 23.6±1.69ms? 
?/sec
   arrow_reader_clickbench/async/Q41
 1.03 21.9±1.36ms? ?/sec1.00 21.3±3.00ms? 
?/sec
   arrow_reader_clickbench/async/Q42
 1.00 12.3±1.00ms? ?/sec1.48 18.2±2.33ms? 
?/sec
   arrow_reader_clickbench/sync/Q1  
 1.00736.6±8.42µs? ?/sec1.46  1077.0±53.33µs? 
?/sec
   arrow_reader_clickbench/sync/Q10 
 

Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-22 Thread via GitHub


zhuqi-lucas commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2903175289

   > A fun experiment might be to "fix" the clickbench partitioned dataset by
   
   resorting and writing with page indexes (could use a bunch of DataFusion 
COPY commands pretty easily to do this). The sort order should be some subset 
of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL.
   disabling compression
   
   
   This is very interesting, maybe we can also do this for arrow-rs clickbench 
benchmark to see the result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-22 Thread via GitHub


alamb commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2901748283

   > Interestingly, Clickbench being quite a bit faster again for 1.3 
([ClickHouse/ClickBench#376](https://github.com/ClickHouse/ClickBench/pull/376) 
) seems mostly related to using predicate pushdown more effectively during 
Parquet decoding (which they already might have implemented for their own 
format).
   
   Indeed -- unsurprisingly the more effort that is put into parquet readers 
the faster they go 😆  and the open nature / wide spread adoption of the format 
makes it easier to gather that required effort. 
   
   BTW, I am working on the same for DataFusion with @zhuqi-lucas in 
https://github.com/apache/arrow-rs/issues/7456
   
   I hope we will have some major improvements to share in another week or two


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-22 Thread via GitHub


Dandandan commented on issue #16149:
URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2901389429

   Interestingly, Clickbench being quite a bit faster again for 1.3 
(https://github.com/ClickHouse/ClickBench/pull/376 ) seems  mostly related to 
using predicate pushdown more effectively during Parquet decoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]