Mike Kaplinskiy created BEAM-5775:
-------------------------------------

             Summary: Make the spark runner not serialize data unless spark is 
spilling to disk
                 Key: BEAM-5775
                 URL: https://issues.apache.org/jira/browse/BEAM-5775
             Project: Beam
          Issue Type: Improvement
          Components: runner-spark
            Reporter: Mike Kaplinskiy
            Assignee: Amit Sela


Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. This 
lets Spark keep the data in memory avoiding the serialization round trip. 
Unfortunately the logic is fairly coarse - as soon as you switch to 
MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen 
to keep the data in memory, incurring the serialization overhead.

 

Ideally Beam would serialize the data lazily - as Spark chooses to spill to 
disk. This would be a change in behavior when using beam, but luckily Spark has 
a solution for folks that want data serialized in memory - MEMORY_AND_DISK_SER 
will keep the data serialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to