arouel opened a new pull request, #3488:
URL: https://github.com/apache/parquet-java/pull/3488

   ### Rationale for this change
   
   `ParquetFileReader` never closes the `ColumnChunkPageReadStore` it returns 
from `readNextRowGroup()`. When a subsequent call replaces `currentRowGroup`, 
the previous instance's `ByteBufferReleaser` is abandoned without releasing the 
compressed I/O buffers and any off-heap decompressed page buffers it holds. 
With the default `HeapByteBufferAllocator` this is masked by GC, but with a 
direct `ByteBufferAllocator` it becomes a hard native memory leak that grows 
with every row group read. `InternalParquetRecordReader` works around this by 
manually closing the `PageReadStore` before each read and in its own `close()`, 
but any direct caller of `ParquetFileReader` that does not replicate this 
pattern will leak buffers.
   
   ### What changes are included in this PR?
   
   A private `closeCurrentRowGroup()` method is added to `ParquetFileReader` 
that null-safely closes and nulls the `currentRowGroup` field. It is called in 
`readNextRowGroup()` and `readNextFilteredRowGroup()` before assigning the new 
row group, and `currentRowGroup` is included in the 
`AutoCloseables.uncheckedClose()` chain in `close()`. This brings the buffer 
lifecycle management into `ParquetFileReader` itself so all callers benefit 
automatically.
   
   ### Are these changes tested?
   
   The existing test suites in parquet-hadoop continue to pass. Additional 
tests got added to verify that `PageReadStore` buffers are properly released.
   
   ### Are there any user-facing changes?
   
   No API changes. Callers that already close the `PageReadStore` themselves 
(like `InternalParquetRecordReader`) will see a harmless double-close since 
`ColumnChunkPageReadStore.close()` is idempotent via `ByteBufferReleaser`. 
Callers that did not close the `PageReadStore` will now have their buffers 
released automatically, reducing memory usage.
   
   Closes #3487
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to