ssirovica commented on issue #14007: URL: https://github.com/apache/arrow/issues/14007#issuecomment-1232469822
Awesome, it is a relief to have the problem confirmed! Looking, I think it makes sense to implement the copy in `ToThrift` as that's only one place and the problem will be present for all types as min max are both `[]byte`, so a reference value. Code gen'ing the copies for every arrow type Encode() is feasible as well though. > The answer is that row group statistics are also stored in the file level metadata. This means that the file writer itself ends up holding onto the stats for the row group, and since it writes an entire record as one row group, and the record is just a single contiguous byte array, the file level statistics for each row group ends up forcing it to keep around the memory for every record. Ah! So if I understand, we will be retaining the min and max for every row group till the end of the file. However, we should be able to drop these statistics once a row group is written as we don't need to keep them around for parquet. Would it be worth it to explore a way to not keep the RG statistics long term on the file level metadata? This would save us on copy and memory. With that said it's a more complex change. Also, it would only be a problem if there are many RG's, columns and large min max values. I'm also happy to try and take a stab at this in a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
