[GitHub] [beam] kennknowles opened a new issue, #19228: Make the spark runner not serialize data unless spark is spilling to disk

GitBox Fri, 03 Jun 2022 15:59:44 -0700


kennknowles opened a new issue, #19228:
URL: https://github.com/apache/beam/issues/19228


   Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. 
This lets Spark keep the data in memory avoiding the serialization round trip. 
Unfortunately the logic is fairly coarse - as soon as you switch to 
MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen 
to keep the data in memory, incurring the serialization overhead.
   
    
   
   Ideally Beam would serialize the data lazily - as Spark chooses to spill to 
disk. This would be a change in behavior when using beam, but luckily Spark has 
a solution for folks that want data serialized in memory - MEMORY_AND_DISK_SER 
will keep the data serialized.
   
   Imported from Jira 
[BEAM-5775](https://issues.apache.org/jira/browse/BEAM-5775). Original Jira may 
contain additional context.
   Reported by: mikekap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] kennknowles opened a new issue, #19228: Make the spark runner not serialize data unless spark is spilling to disk

Reply via email to