mapleFU commented on issue #41321:
URL: https://github.com/apache/arrow/issues/41321#issuecomment-2069554288

   The root cause of this memory access is clear, it doesn't happen during 
reading a "valid" parquet file.
   
   During decompressing the "corrupt" file, this file has two row-groups:
   
   ```
   RowGroup1: [Meta: 3 rows] [ Levels: empty ]
   RowGroup1: [Meta: 3 rows] [ Levels: data ]
   ```
   
   When decoding the first row, `num_values_` would be 3 [1], but when 
decoding, because "levels" is empty, no records would be decoded, **and this is 
not checked** [2] , so it returns "0-rows".  And it switch to next row-group 
[3], during switching, the `decoder_` is been cleared[4], and because it has 
`num_values_` left, new decoder would not been created [5]. And during read, 
this will segfault[6].
   
   Solving: checking levels read is equal to row-group metadata during read 
levels ( This is used during read values, but read levels doesn't check this)
   
   [1] 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L794
   [2] 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1390-L1426
   [3] 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.cc#L472-L491
   [4] 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1802
   [5] 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L699


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to