[ https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Hooper updated ARROW-6895: ------------------------------- Attachment: 01-fix-arrow-6895.diff > [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader > repeats returned values when calling `NextBatch()` > ------------------------------------------------------------------------------------------------------------------------------- > > Key: ARROW-6895 > URL: https://issues.apache.org/jira/browse/ARROW-6895 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 0.15.0 > Environment: Linux 5.2.17-200.fc30.x86_64 (Docker) > Reporter: Adam Hooper > Assignee: Francois Saint-Jacques > Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > Attachments: 01-fix-arrow-6895.diff, bad.parquet, > reset-dictionary-on-read.diff, works.parquet > > Time Spent: 40m > Remaining Estimate: 0h > > Given most columns, I can run a loop like: > {code:cpp} > std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/); > while (nRowsRemaining > 0) { > int n = std::min(100, nRowsRemaining); > std::shared_ptr<arrow::ChunkedArray> chunkedArray; > auto status = columnReader->NextBatch(n, &chunkedArray); > // ... and then use `chunkedArray` > nRowsRemaining -= n; > } > {code} > (The context is: "convert Parquet to CSV/JSON, with small memory footprint." > Used in https://github.com/CJWorkbench/parquet-to-arrow) > Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; > the second return value looks like {{val100...val199}}; and so on. > ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The > first {{NextBatch()}} return value looks like {{val0...val100}}; the second > return value looks like {{val0...val99, val100...val199}} (ChunkedArray with > two arrays); the third return value looks like {{val0...val99, > val100...val199, val200...val299}} (ChunkedArray with three arrays); and so > on. The returned arrays are never cleared. > In sum: {{NextBatch()}} on a dictionary column reader returns the wrong > values. > I've attached a minimal Parquet file that presents this problem with the > above code; and I've written a patch that fixes this one case, to illustrate > where things are wrong. I don't think I understand enough edge cases to > decree that my patch is a correct fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)