[ https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155675#comment-17155675 ]
Eric Gorelik commented on PARQUET-1882: --------------------------------------- Here's a minimal one. {code:c++} #include <arrow/io/api.h> #include <parquet/api/writer.h> #include <parquet/api/reader.h> using namespace parquet; using namespace parquet::schema; int main() { auto primitiveNode = PrimitiveNode::Make("nulls", Repetition::OPTIONAL, nullptr, Type::INT32); NodeVector columns({ primitiveNode }); auto rootNode = GroupNode::Make("root", Repetition::REQUIRED, columns, nullptr); std::shared_ptr<arrow::io::OutputStream> fileOut; arrow::io::FileOutputStream::Open("test.parquet", &fileOut); auto fileWriter = ParquetFileWriter::Open(fileOut, std::static_pointer_cast<GroupNode>(rootNode)); auto rowGroupWriter = fileWriter->AppendRowGroup(); auto columnWriter = static_cast<Int32Writer*>(rowGroupWriter->NextColumn()); int32_t values[3]; int16_t defLevels[] = { 0, 0, 0 }; columnWriter->WriteBatch(3, defLevels, nullptr, values); columnWriter->Close(); rowGroupWriter->Close(); fileWriter->Close(); fileOut->Close(); ReaderProperties props = default_reader_properties(); props.enable_buffered_stream(); auto fileReader = ParquetFileReader::OpenFile("test.parquet", true, props); auto rowGroupReader = fileReader->RowGroup(0); auto columnReader = std::static_pointer_cast<Int32Reader>(rowGroupReader->Column(0)); int64_t valuesRead; columnReader->ReadBatch(3, defLevels, nullptr, values, &valuesRead); } {code} > Writing an all-null column and then reading it with buffered_stream aborts > the process > -------------------------------------------------------------------------------------- > > Key: PARQUET-1882 > URL: https://issues.apache.org/jira/browse/PARQUET-1882 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Environment: Windows 10 64-bit, MSVC > Reporter: Eric Gorelik > Priority: Critical > > When writing a column unbuffered that contains only nulls, a 0-byte > dictionary page gets written. When then reading the resulting file with > buffered_stream enabled, the column reader gets the length of the page (which > is 0), and then tries to read that many bytes from the underlying input > stream. > parquet/column_reader.cc, SerializedPageReader::NextPage > > {code:java} > int compressed_len = current_page_header_.compressed_page_size; > int uncompressed_len = current_page_header_.uncompressed_page_size; > // Read the compressed data page. > std::shared_ptr<Buffer> page_buffer; > PARQUET_THROW_NOT_OK(stream_->Read(compressed_len, &page_buffer));{code} > > BufferedInputStream::Read, however, has an assertion that the bytes to read > is strictly positive, so the assertion fails and aborts the process. > arrow/io/buffered.cc, BufferedInputStream::Impl > > {code:java} > Status Read(int64_t nbytes, int64_t* bytes_read, void* out) { > ARROW_CHECK_GT(nbytes, 0); > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)