gianm commented on code in PR #1270:
URL: https://github.com/apache/parquet-mr/pull/1270#discussion_r1544751640
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedCompressor.java:
##########
@@ -100,7 +102,14 @@ public synchronized void setInput(byte[] buffer, int off,
int len) {
!outputBuffer.hasRemaining(), "Output buffer should be empty. Caller
must call compress()");
if (inputBuffer.capacity() - inputBuffer.position() < len) {
- ByteBuffer tmp = ByteBuffer.allocateDirect(inputBuffer.position() + len);
+ final int newBufferSize;
+ if (inputBuffer.capacity() == 0) {
+ newBufferSize = Math.max(INITIAL_INPUT_BUFFER_SIZE, len);
+ } else {
+ newBufferSize = Math.max(inputBuffer.position() + len,
inputBuffer.capacity() * 2);
Review Comment:
Some analysis:
With doubling, peak memory usage could be up to about 1x the size of the
amount of really required memory (if we really needed `x`, and started with
`x-1`, we would double to `2x-2`, and then we'd have allocated `x - 2` of
unnecessary memory).
If the target size is 64MB (the abnormally large size that I encountered in
the wild), starting at 4KB and doubling gets us there in 14 iterations,
allocating and deallocating 134MB of total memory.
We could set an upper bound for each allocation at 1MB, so peak memory usage
would be at most 1MB more than the amount of really required memory. If we
start at 4KB and double up to 1MB, then go in 1MB chunks, we get there in 71
iterations, allocating and deallocating 2GB of total memory.
We could also use `* 1.2` instead of `* 2`, which would make the peak memory
usage at most 20% of the amount of really required memory. Starting at 4KB and
increasing by 20% each allocation gets us there in 53 iterations, allocating
and deallocating 380MB of total memory.
Perhaps 20% growth is a good balance, since it still gets us to target
pretty quickly compared to using a 1MB cap, and peak memory usage is at most
20% higher than what is really needed. Please let me know what you think.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]