alamb commented on issue #10897: URL: https://github.com/apache/datafusion/issues/10897#issuecomment-2212415404
Thanks @korowa -- this analysis makes sense (aka that there is some constant overhead per active partition) @Smotrov does this match your dataset? As in how many partitions (aka files) are created by your query? Some other ideas for improvements: 1. Account for this overhead somehow in the memory manager. This would not reduce the memory required, but instead would cause the query to error rather than using too much memory 2. Implement early flushing somehow if too much memory was used (force flush currently open files), though this might result in very many small files for a highly partitioned dataset -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org