TP Boudreau created ARROW-8127:
----------------------------------

             Summary: [C++} [Parquet] Incorrect column chunk metadata for 
multipage batch writes
                 Key: ARROW-8127
                 URL: https://issues.apache.org/jira/browse/ARROW-8127
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: TP Boudreau
            Assignee: TP Boudreau
         Attachments: multipage-batch-write.cc

When writing to a buffered column writer using PLAIN encoding, if the size of 
the batch supplied for writing exceeds the page size for the writer, the 
resulting file has an incorrect data_page_offset set in its column chunk 
metadata.  This causes an exception to be thrown when reading the file (file 
appears to be too short to the reader).

For example, the attached code, which attempts to write a batch of 262145 
Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes 
(with buffered writer, PLAIN encoding), fails on reading, throwing the error: 
"Tried reading 1048678 bytes starting at position 1048633 from file but only 
got 333".

The error is caused by the second page write tripping the conditional here 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302,
 in the serialized in-memory writer wrapped by the buffered writer.

The fix builds the metadata with offsets from the terminal sink rather than the 
in memory buffered sink.  A PR is coming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to