Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-08-20 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3208985060

   After merging your PR, Conbench analyzed the 4 benchmarking runs that have 
been run so far on merge-commit 0ecc472c7858471f17d52fd542eef9cc20390b4d.
   
   There weren't enough matching historic benchmark results to make a call on 
whether there were regressions.
   
   The [full Conbench report](https://github.com/apache/arrow/runs/48551026382) 
has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-08-20 Thread via GitHub


adamreeve merged PR #46972:
URL: https://github.com/apache/arrow/pull/46972


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-08-19 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3202815553

   Hi @pitrou, I wanted to give you a chance to comment on this before merging 
but I know you've been away and probably have lots to catch up on. Are you 
happy with this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-20 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3094845515

   @wgtmac could you please take a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-10 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3057943304

   > Perhaps we need to slightly change the meaning of 
properties_->write_batch_size() to be the maximum number of values in a batch 
to write to a ColumnWriter. Does it make sense? @adamreeve @pitrou
   
   It definitely makes sense to me. I think that's how the Rust and Java 
implementation use it.
   
   (also, shouldn't it be discussed in 
https://github.com/apache/arrow/issues/47030 ?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-10 Thread via GitHub


wgtmac commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3057746671

   > Ah I think you're thinking of the 
[`write_batch_size`](https://github.com/apache/arrow/blob/0b34e6bed40d48ae44a137afd196af94d9117e3b/cpp/src/parquet/properties.h#L160)
 parameter that's used by the Arrow API. This is a number of rows and defaults 
to 1024. I used the column writer based API rather than the Arrow API though.
   
   I just realized that large `properties_->write_batch_size()` makes it 
difficult to precisely split data pages based on 
`properties_->data_pagesize()`. To implement 
https://github.com/apache/arrow/issues/47030, we have to adjust batch size to 
satisfy the new `properties_->max_rows_per_data_page()`. Perhaps we need to 
slightly change the meaning of `properties_->write_batch_size()` to be the 
maximum number of values in a batch to write to a  ColumnWriter. Does it make 
sense? @adamreeve @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-08 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-305325

   I looked at benchmarking unencrypted int32 data with nulls and using a 
non-plain encoding (delta binary packed). Otherwise the data layout is the same 
as my previous tests. Making the decompression buffers temporary decreases the 
max RSS with the system allocator, which is a bit surprising to me. But there 
is a slight increase in RSS and time taken with mimalloc and jemalloc.
   
   |  | System allocator | mimalloc | jemalloc |
   | --- | --- | --- | --- |
   | Baseline time (s) | 10.82 | 11.02 | 10.62 |
   | Time with temp decompress buffers (s) | 10.93 | 11.53 | 10.96 |
   
   | | System allocator | mimalloc | jemalloc |
   | --- | --- | --- | --- |
   | Baseline max RSS (MB) | 1,235 | 1,047 | 891 |
   | Max RSS with temp decompress buffers (MB) | 1,065 | 1,085 | 902 |
   
   I also looked at the memory allocations with massif, and the peak heap size 
is exactly the same, and is still dominated by the decompression buffers. 
Although my comment about them being referenced by the record batches isn't 
correct, the page buffers are still held in memory by the column reader, and 
then batches of data are decoded incrementally to Arrow arrays.
   
   So I don't think there's much reason to make the decompression buffers 
temporary and performance is generally a bit better if only the decryption 
buffers are temporary. I'm going to revert this PR back to only change the 
decryption buffers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-08 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3048825203

   Ok, I've opened https://github.com/apache/arrow/issues/47030


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


wgtmac commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3047145349

   FTR, we have a `max_row_group_length` to control number of rows in a single 
row group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3046718785

   I looked a bit closer at the `write_batch_size` parameter. This doesn't 
actually control how many values are written to a page, but just how many are 
written at once before checking whether the page size has reached the 
configured page size limit and writing out the page. From that mailing list 
thread, it sounds like other implementations have a byte based limit and a 
separate row count limit, but it doesn't look like there's a row limit in the 
C++ implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3045298852

   There doesn't seem to be any related regression on the benchmarks.
   I've also run this PR locally on a couple Parquet files I have lying around, 
and could not see any concerning performance drop.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3045093242

   Thanks for your patience. Conbench analyzed the 4 benchmarking runs that 
have been run so far on PR commit 7d639fd82e329c8047f6d1fa132eedd70c783b61.
   
   There were 50 benchmark results indicating a performance regression:
   
   - Pull Request Run on `amd64-c6a-4xlarge-linux` at [2025-07-07 
07:47:30Z](https://conbench.ursa.dev/compare/runs/6f99f7191a6f4d9ba9422d81e9ff8751...4db717258147466ea790c4c607862ffa/)
 - [`ArrayArrayKernel` (C++) with params=/size:524288/inverse_null_proportion:0, source=cpp-micro, 
suite=arrow-compute-scalar-arithmetic-benchmark](https://conbench.ursa.dev/compare/benchmarks/0686a5c8d8c8798280009d3276bb717b...0686b7c4186c7f1c80003f31bbcad96a)
 - [`BM_PlainEncodingSpacedBoolean` (C++) with params=32768/1, 
source=cpp-micro, 
suite=parquet-encoding-benchmark](https://conbench.ursa.dev/compare/benchmarks/0686a5ca85da7d6280001af9453bdf8e...0686b7c65e03786a88d9734b9efe)
   - and 48 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/45476826206) 
has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044573352

   I had already started a discussion on the Parquet dev ML about this: 
https://lists.apache.org/thread/vsxmbvnx9gy5414cfo25mnwcj17h1xyp
   
   I do think we should revisit this default page size constant, even if in 
some cases other factors make it smaller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044560950

   Hmm actually it looks like it shouldn't be specific to the Arrow API, I'll 
check what's happening there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044548886

   Ah I think you're thinking of the `write_batch_size` parameter that's used 
by the Arrow API. This is a number of rows and defaults to 1024. I used the 
column writer based API rather than the Arrow API though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044523865

   1024 * 1024 is also the default max row group size. Maybe for some integer 
or dictionary encoded data this limit can be hit before the page size?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044352159

   @wgtmac No, I mean other parameters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


wgtmac commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044211822

   @pitrou Did you mean CDC? 
https://github.com/apache/arrow/blob/3ebe7ee1828793d0a619bcd773eb4d990ccb6b3c/cpp/src/parquet/column_writer.cc#L1418


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044122657

   I'm using ParquetSharp but this is a wrapper of the C++ Parquet library, and 
1 MiB is the default there: 
https://github.com/apache/arrow/blob/0b34e6bed40d48ae44a137afd196af94d9117e3b/cpp/src/parquet/properties.h#L157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044146060

   > I'm using ParquetSharp but this is a wrapper of the C++ Parquet library, 
and 1 MiB is the default there:
   
   Hmm, I was under the impression that another factor limited the actual page 
size produced by Parquet C++, but I can't find again what it is. @wgtmac Could 
you enlighten me here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-07 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3043908894

   By the way, how did you get or generate your test files? A 1 MiB page size 
sounds rather large.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-06 Thread via GitHub


ursabot commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3043666313

   Benchmark runs are scheduled for commit 
7d639fd82e329c8047f6d1fa132eedd70c783b61. Watch 
https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A 
comment will be posted here when the runs are complete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-06 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3043666102

   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-06 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3043538730

   Right OK that makes sense. I'm testing with plain encoded float columns so 
hadn't noticed that but yes it might be a good idea to also change the 
decompression buffers then.
   
   I did some rough benchmarks with `/usr/bin/time -v` and get the following 
results for my test case (what's described in #46971 but reading all row 
groups), taking the best out of three runs:
   
   |  | System allocator | mimalloc | jemalloc |
   | --- | --- | --- | --- |
   | Baseline time (s) | 6.92 |  6.89 | 6.68 |
   | Time with temp decrypt buffers (s) | 8.23 | 6.00 | 6.42 |
   | Time with temp decrypt and decompress buffers (s) | 6.50 |  6.08 | 6.59 |
   
   
   | | System allocator | mimalloc | jemalloc |
   | --- | --- | --- | --- |
   | Baseline max RSS (MB) | 1,556 | 1,550 | 1,128 |
   | Max RSS with temp decrypt buffers (MB) | 894 | 891 | 627 |
   | Max RSS with temp decrypt and decompress buffers (MB) | 1,823 |890 | 629 |
   
   The behaviour with mimalloc and jemalloc looks good, but the results with 
the system allocator are quite concerning. The max RSS decreases if using 
temporary decryption buffers, but actually increases quite significantly when 
also using temporary decompression buffers. I'm not sure why that would be, 
maybe this causes more memory fragmentation? (C++ heap memory management is not 
something I know a lot about...). There is also a noticeable slow-down in the 
temporary decryption buffer case with the system allocator.
   
   Maybe this is acceptable, given most users will be using mimalloc?
   
   I also tested with unencrypted data:
   
   |  | System allocator | mimalloc | jemalloc |
   | --- | --- | --- | --- |
   | Baseline time (s) | 4.07 | 4.09  | 3.93 |
   | Time with temp decompress buffers (s) | 4.08 | 4.25  | 3.98 |
   
   | | System allocator | mimalloc | jemalloc |
   | --- | --- | --- | --- |
   | Baseline max RSS (MB) | 884 | 895 | 627 |
   | Max RSS with temp decompress buffers (MB) | 954 | 913 | 660 |
   
   Based on these benchmarks alone, maybe only the decryption buffers should be 
temporary. But I've only tested with plain float data. I'll look into testing 
with more data types and encodings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-02 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3026982839

   > I was assuming the decompression buffers remain referenced by the returned 
record batches so doing the same for those wouldn't help, but I haven't 
verified that that's true.
   
   Only in the zero-copy cases, which are quite limited (fixed-width type, 
PLAIN encoding, no nulls). And even then, we probably don't always do zero-copy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-02 Thread via GitHub


adamreeve commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3026961372

   I was assuming the decompression buffers remain referenced by the returned 
record batches so doing the same for those wouldn't help, but I haven't 
verified that that's true. I'll test that too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]

2025-07-02 Thread via GitHub


pitrou commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3026863669

   Do we want to do the same for the decompression buffer? It would also be 
easier to benchmark, as compressed Parquet files are much more common than 
encrypted Parquet files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]