[ 
https://issues.apache.org/jira/browse/ARROW-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Karlstetter updated ARROW-13024:
--------------------------------------
    Summary: [C++][Parquet] Decoding byte stream split encoded columns fails 
when parquet file has nulls  (was: Decoding byte stream split encoded parquet 
columns fails when file has nulls)

> [C++][Parquet] Decoding byte stream split encoded columns fails when parquet 
> file has nulls
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13024
>                 URL: https://issues.apache.org/jira/browse/ARROW-13024
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>    Affects Versions: 2.0.0, 3.0.0, 4.0.0
>            Reporter: Roman Karlstetter
>            Priority: Major
>
> Reading from a parquet file fails with the following error
> {{Data size too small for number of values (corrupted file?)}}.
> This happens for the case when there is a {{BYTE_STREAM_SPLIT}}-encoded 
> column which has less values stored than number of rows, which is the case 
> when the column has null values (definition levels are present).
> The problematic part is the condition checked in 
> {{ByteStreamSplitDecoder<DType>::SetData}}, which raises the error if the 
> number of values does not match the size of the data array.
> I'm unsure whether I have enough experience with the internals of the 
> encoding/decoding part of this implementation to fix this issue, but my 
> suggestion would be to initialize {{num_values_in_buffer_}} with 
> {{len/static_cast<int64_t>(sizeof(T))}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to