Adam Hooper created ARROW-6895:
----------------------------------

             Summary: parquet::arrow::ColumnReader: 
ByteArrayDictionaryRecordReader repeats returned values when calling 
`NextBatch()`
                 Key: ARROW-6895
                 URL: https://issues.apache.org/jira/browse/ARROW-6895
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 0.15.0
         Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
            Reporter: Adam Hooper
         Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet

Given most columns, I can run a loop like:

{code:cpp}
std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/);
while (nRowsRemaining > 0) {
    int n = std::min(100, nRowsRemaining);
    std::shared_ptr<arrow::ChunkedArray> chunkedArray;
    auto status = columnReader->NextBatch(n, &chunkedArray);
    // ... and then use `chunkedArray`
    nRowsRemaining -= n;
}
{code}

(The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
Used in https://github.com/CJWorkbench/parquet-to-arrow)

Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
the second return value looks like {{val100...val199}}; and so on.

... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
two arrays); the third return value looks like {{val0...val99, val100...val199, 
val200...val299}} (ChunkedArray with three arrays); and so on. The returned 
arrays are never cleared.

In sum: {{NextBatch()}} on a dictionary column reader returns the wrong values.

I've attached a minimal Parquet file that presents this problem with the above 
code; and I've written a patch that fixes this one case, to illustrate where 
things are wrong. I don't think I understand enough edge cases to decree that 
my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to