Adam Hooper created ARROW-6895: ---------------------------------- Summary: parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()` Key: ARROW-6895 URL: https://issues.apache.org/jira/browse/ARROW-6895 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.0 Environment: Linux 5.2.17-200.fc30.x86_64 (Docker) Reporter: Adam Hooper Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet
Given most columns, I can run a loop like: {code:cpp} std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/); while (nRowsRemaining > 0) { int n = std::min(100, nRowsRemaining); std::shared_ptr<arrow::ChunkedArray> chunkedArray; auto status = columnReader->NextBatch(n, &chunkedArray); // ... and then use `chunkedArray` nRowsRemaining -= n; } {code} (The context is: "convert Parquet to CSV/JSON, with small memory footprint." Used in https://github.com/CJWorkbench/parquet-to-arrow) Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; the second return value looks like {{val100...val199}}; and so on. ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The first {{NextBatch()}} return value looks like {{val0...val100}}; the second return value looks like {{val0...val99, val100...val199}} (ChunkedArray with two arrays); the third return value looks like {{val0...val99, val100...val199, val200...val299}} (ChunkedArray with three arrays); and so on. The returned arrays are never cleared. In sum: {{NextBatch()}} on a dictionary column reader returns the wrong values. I've attached a minimal Parquet file that presents this problem with the above code; and I've written a patch that fixes this one case, to illustrate where things are wrong. I don't think I understand enough edge cases to decree that my patch is a correct fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)