[ 
https://issues.apache.org/jira/browse/ARROW-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Hooper reopened ARROW-6895:
--------------------------------

The code snippet given in the bug description still fails to read the 
{{bad.parquet}} file I uploaded.

Perhaps there were two bugs, and only one has been fixed? Please advise me 
whether I should leave this bug open (since the supplied code+file still aren't 
read properly) or open a new issue (since the GitHub patch does address the 
title of this bug).

By my reading, GitHub pull request #6206 did not add a test case for dictionary 
delta batches, like the one {{bad.parquet}} produces. The spec suggests the 
{{isDelta}} flag should prevent the dictionary from being cleared between 
column chunks: 
https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages. So from 
my understanding, the reader should not be allowed to reset the dictionary 
builder until after it receives a dictionary batch for which {{isDelta == 
false}}.

> [C++][Parquet] parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader 
> repeats returned values when calling `NextBatch()`
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-6895
>                 URL: https://issues.apache.org/jira/browse/ARROW-6895
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 0.15.0
>         Environment: Linux 5.2.17-200.fc30.x86_64 (Docker)
>            Reporter: Adam Hooper
>            Assignee: Francois Saint-Jacques
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.16.0
>
>         Attachments: bad.parquet, reset-dictionary-on-read.diff, works.parquet
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Given most columns, I can run a loop like:
> {code:cpp}
> std::unique_ptr<parquet::arrow::ColumnReader> columnReader(/*...*/);
> while (nRowsRemaining > 0) {
>     int n = std::min(100, nRowsRemaining);
>     std::shared_ptr<arrow::ChunkedArray> chunkedArray;
>     auto status = columnReader->NextBatch(n, &chunkedArray);
>     // ... and then use `chunkedArray`
>     nRowsRemaining -= n;
> }
> {code}
> (The context is: "convert Parquet to CSV/JSON, with small memory footprint." 
> Used in https://github.com/CJWorkbench/parquet-to-arrow)
> Normally, the first {{NextBatch()}} return value looks like {{val0...val99}}; 
> the second return value looks like {{val100...val199}}; and so on.
> ... but with a {{ByteArrayDictionaryRecordReader}}, that isn't the case. The 
> first {{NextBatch()}} return value looks like {{val0...val100}}; the second 
> return value looks like {{val0...val99, val100...val199}} (ChunkedArray with 
> two arrays); the third return value looks like {{val0...val99, 
> val100...val199, val200...val299}} (ChunkedArray with three arrays); and so 
> on. The returned arrays are never cleared.
> In sum: {{NextBatch()}} on a dictionary column reader returns the wrong 
> values.
> I've attached a minimal Parquet file that presents this problem with the 
> above code; and I've written a patch that fixes this one case, to illustrate 
> where things are wrong. I don't think I understand enough edge cases to 
> decree that my patch is a correct fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to