[ https://issues.apache.org/jira/browse/ARROW-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman Karlstetter updated ARROW-13024: -------------------------------------- Summary: [C++][Parquet] Decoding byte stream split encoded columns fails when parquet file has nulls (was: Decoding byte stream split encoded parquet columns fails when file has nulls) > [C++][Parquet] Decoding byte stream split encoded columns fails when parquet > file has nulls > ------------------------------------------------------------------------------------------- > > Key: ARROW-13024 > URL: https://issues.apache.org/jira/browse/ARROW-13024 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet > Affects Versions: 2.0.0, 3.0.0, 4.0.0 > Reporter: Roman Karlstetter > Priority: Major > > Reading from a parquet file fails with the following error > {{Data size too small for number of values (corrupted file?)}}. > This happens for the case when there is a {{BYTE_STREAM_SPLIT}}-encoded > column which has less values stored than number of rows, which is the case > when the column has null values (definition levels are present). > The problematic part is the condition checked in > {{ByteStreamSplitDecoder<DType>::SetData}}, which raises the error if the > number of values does not match the size of the data array. > I'm unsure whether I have enough experience with the internals of the > encoding/decoding part of this implementation to fix this issue, but my > suggestion would be to initialize {{num_values_in_buffer_}} with > {{len/static_cast<int64_t>(sizeof(T))}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)