Kimahriman commented on PR #52303:
URL: https://github.com/apache/spark/pull/52303#issuecomment-3319305232

   > > The user-facing API is just a necessary to change to solve four memory 
issues:
   > > 
   > > * Having to collect all input data in memory on the JVM side
   > > * Having to collect all input data in memory on the Python side
   > > * Having to collect all output data on the Python side
   > > * Having to collect all output data on the JVM side
   > 
   > @Kimahriman based on my observations, issue 1) occurs most frequently
   > 
   > , and the new iterator API is not enough to resolve 3) and 4) unless we 
also apply batching in the output of UDFs (e.g, the applyInArrow support N rows 
-> M rows computation, which can output groups larger than input).
   
   Yeah my iterator API update includes returning an iterator which solves (3) 
and (4). For us these are the most frequent issues we encounter and the main 
reason I made the PR, and they are harder to track down and debug if you don't 
have strict Python memory enabled.
    
   > I have updated the PR to remove the old non-batching code path, would you 
mind taking a look again? thanks!
   
   Yes it looks a lot cleaner now! And looks like 
https://github.com/apache/spark/pull/52391 is adding maxBytesPerBatch support 
to `BaseStreamingArrowWriter` so should be simple to consolidate the two paths 
after that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to