Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
conbench-apache-arrow[bot] commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3208985060 After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0ecc472c7858471f17d52fd542eef9cc20390b4d. There weren't enough matching historic benchmark results to make a call on whether there were regressions. The [full Conbench report](https://github.com/apache/arrow/runs/48551026382) has more details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve merged PR #46972: URL: https://github.com/apache/arrow/pull/46972 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3202815553 Hi @pitrou, I wanted to give you a chance to comment on this before merging but I know you've been away and probably have lots to catch up on. Are you happy with this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3094845515 @wgtmac could you please take a look at this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3057943304 > Perhaps we need to slightly change the meaning of properties_->write_batch_size() to be the maximum number of values in a batch to write to a ColumnWriter. Does it make sense? @adamreeve @pitrou It definitely makes sense to me. I think that's how the Rust and Java implementation use it. (also, shouldn't it be discussed in https://github.com/apache/arrow/issues/47030 ?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
wgtmac commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3057746671 > Ah I think you're thinking of the [`write_batch_size`](https://github.com/apache/arrow/blob/0b34e6bed40d48ae44a137afd196af94d9117e3b/cpp/src/parquet/properties.h#L160) parameter that's used by the Arrow API. This is a number of rows and defaults to 1024. I used the column writer based API rather than the Arrow API though. I just realized that large `properties_->write_batch_size()` makes it difficult to precisely split data pages based on `properties_->data_pagesize()`. To implement https://github.com/apache/arrow/issues/47030, we have to adjust batch size to satisfy the new `properties_->max_rows_per_data_page()`. Perhaps we need to slightly change the meaning of `properties_->write_batch_size()` to be the maximum number of values in a batch to write to a ColumnWriter. Does it make sense? @adamreeve @pitrou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-305325 I looked at benchmarking unencrypted int32 data with nulls and using a non-plain encoding (delta binary packed). Otherwise the data layout is the same as my previous tests. Making the decompression buffers temporary decreases the max RSS with the system allocator, which is a bit surprising to me. But there is a slight increase in RSS and time taken with mimalloc and jemalloc. | | System allocator | mimalloc | jemalloc | | --- | --- | --- | --- | | Baseline time (s) | 10.82 | 11.02 | 10.62 | | Time with temp decompress buffers (s) | 10.93 | 11.53 | 10.96 | | | System allocator | mimalloc | jemalloc | | --- | --- | --- | --- | | Baseline max RSS (MB) | 1,235 | 1,047 | 891 | | Max RSS with temp decompress buffers (MB) | 1,065 | 1,085 | 902 | I also looked at the memory allocations with massif, and the peak heap size is exactly the same, and is still dominated by the decompression buffers. Although my comment about them being referenced by the record batches isn't correct, the page buffers are still held in memory by the column reader, and then batches of data are decoded incrementally to Arrow arrays. So I don't think there's much reason to make the decompression buffers temporary and performance is generally a bit better if only the decryption buffers are temporary. I'm going to revert this PR back to only change the decryption buffers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3048825203 Ok, I've opened https://github.com/apache/arrow/issues/47030 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
wgtmac commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3047145349 FTR, we have a `max_row_group_length` to control number of rows in a single row group. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3046718785 I looked a bit closer at the `write_batch_size` parameter. This doesn't actually control how many values are written to a page, but just how many are written at once before checking whether the page size has reached the configured page size limit and writing out the page. From that mailing list thread, it sounds like other implementations have a byte based limit and a separate row count limit, but it doesn't look like there's a row limit in the C++ implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3045298852 There doesn't seem to be any related regression on the benchmarks. I've also run this PR locally on a couple Parquet files I have lying around, and could not see any concerning performance drop. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
conbench-apache-arrow[bot] commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3045093242 Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 7d639fd82e329c8047f6d1fa132eedd70c783b61. There were 50 benchmark results indicating a performance regression: - Pull Request Run on `amd64-c6a-4xlarge-linux` at [2025-07-07 07:47:30Z](https://conbench.ursa.dev/compare/runs/6f99f7191a6f4d9ba9422d81e9ff8751...4db717258147466ea790c4c607862ffa/) - [`ArrayArrayKernel` (C++) with params=/size:524288/inverse_null_proportion:0, source=cpp-micro, suite=arrow-compute-scalar-arithmetic-benchmark](https://conbench.ursa.dev/compare/benchmarks/0686a5c8d8c8798280009d3276bb717b...0686b7c4186c7f1c80003f31bbcad96a) - [`BM_PlainEncodingSpacedBoolean` (C++) with params=32768/1, source=cpp-micro, suite=parquet-encoding-benchmark](https://conbench.ursa.dev/compare/benchmarks/0686a5ca85da7d6280001af9453bdf8e...0686b7c65e03786a88d9734b9efe) - and 48 more (see the report linked below) The [full Conbench report](https://github.com/apache/arrow/runs/45476826206) has more details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044573352 I had already started a discussion on the Parquet dev ML about this: https://lists.apache.org/thread/vsxmbvnx9gy5414cfo25mnwcj17h1xyp I do think we should revisit this default page size constant, even if in some cases other factors make it smaller. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044560950 Hmm actually it looks like it shouldn't be specific to the Arrow API, I'll check what's happening there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044548886 Ah I think you're thinking of the `write_batch_size` parameter that's used by the Arrow API. This is a number of rows and defaults to 1024. I used the column writer based API rather than the Arrow API though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044523865 1024 * 1024 is also the default max row group size. Maybe for some integer or dictionary encoded data this limit can be hit before the page size? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044352159 @wgtmac No, I mean other parameters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
wgtmac commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044211822 @pitrou Did you mean CDC? https://github.com/apache/arrow/blob/3ebe7ee1828793d0a619bcd773eb4d990ccb6b3c/cpp/src/parquet/column_writer.cc#L1418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044122657 I'm using ParquetSharp but this is a wrapper of the C++ Parquet library, and 1 MiB is the default there: https://github.com/apache/arrow/blob/0b34e6bed40d48ae44a137afd196af94d9117e3b/cpp/src/parquet/properties.h#L157 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3044146060 > I'm using ParquetSharp but this is a wrapper of the C++ Parquet library, and 1 MiB is the default there: Hmm, I was under the impression that another factor limited the actual page size produced by Parquet C++, but I can't find again what it is. @wgtmac Could you enlighten me here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3043908894 By the way, how did you get or generate your test files? A 1 MiB page size sounds rather large. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
ursabot commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3043666313 Benchmark runs are scheduled for commit 7d639fd82e329c8047f6d1fa132eedd70c783b61. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3043666102 @ursabot please benchmark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3043538730 Right OK that makes sense. I'm testing with plain encoded float columns so hadn't noticed that but yes it might be a good idea to also change the decompression buffers then. I did some rough benchmarks with `/usr/bin/time -v` and get the following results for my test case (what's described in #46971 but reading all row groups), taking the best out of three runs: | | System allocator | mimalloc | jemalloc | | --- | --- | --- | --- | | Baseline time (s) | 6.92 | 6.89 | 6.68 | | Time with temp decrypt buffers (s) | 8.23 | 6.00 | 6.42 | | Time with temp decrypt and decompress buffers (s) | 6.50 | 6.08 | 6.59 | | | System allocator | mimalloc | jemalloc | | --- | --- | --- | --- | | Baseline max RSS (MB) | 1,556 | 1,550 | 1,128 | | Max RSS with temp decrypt buffers (MB) | 894 | 891 | 627 | | Max RSS with temp decrypt and decompress buffers (MB) | 1,823 |890 | 629 | The behaviour with mimalloc and jemalloc looks good, but the results with the system allocator are quite concerning. The max RSS decreases if using temporary decryption buffers, but actually increases quite significantly when also using temporary decompression buffers. I'm not sure why that would be, maybe this causes more memory fragmentation? (C++ heap memory management is not something I know a lot about...). There is also a noticeable slow-down in the temporary decryption buffer case with the system allocator. Maybe this is acceptable, given most users will be using mimalloc? I also tested with unencrypted data: | | System allocator | mimalloc | jemalloc | | --- | --- | --- | --- | | Baseline time (s) | 4.07 | 4.09 | 3.93 | | Time with temp decompress buffers (s) | 4.08 | 4.25 | 3.98 | | | System allocator | mimalloc | jemalloc | | --- | --- | --- | --- | | Baseline max RSS (MB) | 884 | 895 | 627 | | Max RSS with temp decompress buffers (MB) | 954 | 913 | 660 | Based on these benchmarks alone, maybe only the decryption buffers should be temporary. But I've only tested with plain float data. I'll look into testing with more data types and encodings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3026982839 > I was assuming the decompression buffers remain referenced by the returned record batches so doing the same for those wouldn't help, but I haven't verified that that's true. Only in the zero-copy cases, which are quite limited (fixed-width type, PLAIN encoding, no nulls). And even then, we probably don't always do zero-copy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
adamreeve commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3026961372 I was assuming the decompression buffers remain referenced by the returned record batches so doing the same for those wouldn't help, but I haven't verified that that's true. I'll test that too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46971: [C++][Parquet] Use temporary buffers when decrypting Parquet data pages [arrow]
pitrou commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3026863669 Do we want to do the same for the decompression buffer? It would also be easier to benchmark, as compressed Parquet files are much more common than encrypted Parquet files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
