mapleFU commented on issue #41321: URL: https://github.com/apache/arrow/issues/41321#issuecomment-2069554288
The root cause of this memory access is clear, it doesn't happen during reading a "valid" parquet file. During decompressing the "corrupt" file, this file has two row-groups: ``` RowGroup1: [Meta: 3 rows] [ Levels: empty ] RowGroup1: [Meta: 3 rows] [ Levels: data ] ``` When decoding the first row, `num_values_` would be 3 [1], but when decoding, because "levels" is empty, no records would be decoded, **and this is not checked** [2] , so it returns "0-rows". And it switch to next row-group [3], during switching, the `decoder_` is been cleared[4], and because it has `num_values_` left, new decoder would not been created [5]. And during read, this will segfault[6]. Solving: checking levels read is equal to row-group metadata during read levels ( This is used during read values, but read levels doesn't check this) [1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L794 [2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1390-L1426 [3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.cc#L472-L491 [4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1802 [5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L699 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
