zeroshade commented on issue #14007: URL: https://github.com/apache/arrow/issues/14007#issuecomment-1232410957
Not only does it sound plausible, but I had just come to the same conclusion myself! > I think what's missing is understanding why the statistics aren't being released on write of a RG and then resolving that over copying. The copy example above was just to see if we're on the right track. The answer is that row group statistics are also stored in the file level metadata. This means that the file writer itself ends up holding onto the stats for the row group, and since it writes an entire record as one row group, and the record is just a single contiguous byte array, the file level statistics for each row group ends up forcing it to keep around the memory for every record. If you look at the `ToThrift` method, it's just propagating the slices for the min and max values, so even the encoded thrift I believe the solution will be to break the reference and actually perform the copy either in the `ToThrift` method or in the `Encode` methods. I'm gonna sleep on it and see if i can figure out what the best approach is as far as when to break the reference and actually copy the min and max values. If you have any thoughts on it, let me know. Otherwise I'll poke at this more in the morning and see what makes the most sense. This is definitely a tricky little memory bug you've discovered here, thanks for filing it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org