Mike Kaplinskiy created BEAM-5775: ------------------------------------- Summary: Make the spark runner not serialize data unless spark is spilling to disk Key: BEAM-5775 URL: https://issues.apache.org/jira/browse/BEAM-5775 Project: Beam Issue Type: Improvement Components: runner-spark Reporter: Mike Kaplinskiy Assignee: Amit Sela
Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. This lets Spark keep the data in memory avoiding the serialization round trip. Unfortunately the logic is fairly coarse - as soon as you switch to MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen to keep the data in memory, incurring the serialization overhead. Ideally Beam would serialize the data lazily - as Spark chooses to spill to disk. This would be a change in behavior when using beam, but luckily Spark has a solution for folks that want data serialized in memory - MEMORY_AND_DISK_SER will keep the data serialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005)