Re: [I] Data set which is much bigger than RAM [datafusion]

via GitHub Sun, 07 Jul 2024 04:25:27 -0700


alamb commented on issue #10897:
URL: https://github.com/apache/datafusion/issues/10897#issuecomment-2212415404


   Thanks @korowa -- this analysis makes sense (aka that there is some constant 
overhead per active partition)
   
   @Smotrov  does this match your dataset? As in how many partitions (aka 
files) are created by your query?
   
   Some other ideas for improvements:
   1. Account for this overhead somehow in the memory manager. This would not 
reduce the memory required, but instead would cause the query to error rather 
than using too much memory
   2. Implement early flushing somehow if too much memory was used (force flush 
currently open files), though this might result in very many small files for a 
highly partitioned dataset
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Data set which is much bigger than RAM [datafusion]

Reply via email to