zeroshade commented on issue #14007:
URL: https://github.com/apache/arrow/issues/14007#issuecomment-1232410957

   Not only does it sound plausible, but I had just come to the same conclusion 
myself!
   
   > I think what's missing is understanding why the statistics aren't being 
released on write of a RG and then resolving that over copying. The copy 
example above was just to see if we're on the right track.
   
   The answer is that row group statistics are also stored in the file level 
metadata. This means that the file writer itself ends up holding onto the stats 
for the row group, and since it writes an entire record as one row group, and 
the record is just a single contiguous byte array, the file level statistics 
for each row group ends up forcing it to keep around the memory for every 
record. If you look at the `ToThrift` method, it's just propagating the slices 
for the min and max values, so even the encoded thrift
   
   I believe the solution will be to break the reference and actually perform 
the copy either in the `ToThrift` method or in the `Encode` methods. I'm gonna 
sleep on it and see if i can figure out what the best approach is as far as 
when to break the reference and actually copy the min and max values. If you 
have any thoughts on it, let me know. Otherwise I'll poke at this more in the 
morning and see what makes the most sense.
   
   This is definitely a tricky little memory bug you've discovered here, thanks 
for filing it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to