Re: [PR] feat(parquet): utilize memory allocator in `serializedPageReader` [arrow-go]

via GitHub Sat, 30 Aug 2025 21:21:03 -0700


joechenrh commented on code in PR #485:
URL: https://github.com/apache/arrow-go/pull/485#discussion_r2312235038



##########
parquet/internal/encoding/typed_encoder.go:
##########
@@ -556,6 +572,36 @@ func (enc *DictByteArrayEncoder) Type() parquet.Type {
        return parquet.Types.ByteArray
 }
 
+// ByteArrayDecoderWrapper is a wrapper around a ByteArrayDecoder that ensures
+// that the decoded byte arrays are copied into a new, contiguous buffer.
+type ByteArrayDecoderWrapper struct {
+       ByteArrayDecoder
+}
+
+func (d *ByteArrayDecoderWrapper) Decode(out []parquet.ByteArray) (int, error) 
{
+       n, err := d.ByteArrayDecoder.Decode(out)
+       if err != nil {
+               return n, err
+       }
+       cloneByteArray(out[:n])
+       return n, nil

Review Comment:
   After inspect the code again, I guess the problem arises from 
`ColumnChunkReader.ReadBatch` function itself, which is used in test to read 
data. Let me revert the change and try to fix this function separately.
   
   For other place that read data from column chunk, like `recordReaderImpl`, 
seems it has special logic for `ByteArray` and `FixedLenByteArray`. That is, 
copy the value from the decoder to a new, separate buffer. Like the below code, 
`bldr` itself contains a buffer allocated from allocator.
   
   
https://github.com/apache/arrow-go/blob/c6ce2ef4e55009a786cf04b3845eba5170c98066/parquet/file/record_reader.go#L842-L846



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat(parquet): utilize memory allocator in `serializedPageReader` [arrow-go]

Reply via email to