[ https://issues.apache.org/jira/browse/BEAM-5775?focusedWorklogId=226562&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-226562 ]
ASF GitHub Bot logged work on BEAM-5775: ---------------------------------------- Author: ASF GitHub Bot Created on: 12/Apr/19 10:03 Start Date: 12/Apr/19 10:03 Worklog Time Spent: 10m Work Description: iemejia commented on pull request #6714: [BEAM-5775] Spark: implement a custom class to lazily encode values for persistence. URL: https://github.com/apache/beam/pull/6714#discussion_r274826081 ########## File path: runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java ########## @@ -182,119 +176,4 @@ .map(TranslationUtils.fromPairFunction()) .map(TranslationUtils.toKVByWindowInValue()); } - - /** - * Wrapper around accumulated (combined) value with custom lazy serialization. Serialization is - * done through given coder and it is performed within on-serialization callbacks {@link - * #writeObject(ObjectOutputStream)} and {@link KryoAccumulatorSerializer#write(Kryo, Output, - * SerializableAccumulator)}. Both Spark's serialization mechanisms (Java Serialization, Kryo) are - * supported. Materialization of accumulated value is done when value is requested to avoid - * serialization of the coder itself. - * - * @param <AccumT> - */ - public static class SerializableAccumulator<AccumT> implements Serializable { Review comment: Huge +1 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 226562) Time Spent: 9h (was: 8h 50m) > Make the spark runner not serialize data unless spark is spilling to disk > ------------------------------------------------------------------------- > > Key: BEAM-5775 > URL: https://issues.apache.org/jira/browse/BEAM-5775 > Project: Beam > Issue Type: Improvement > Components: runner-spark > Reporter: Mike Kaplinskiy > Assignee: Mike Kaplinskiy > Priority: Minor > Labels: triaged > Time Spent: 9h > Remaining Estimate: 0h > > Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. > This lets Spark keep the data in memory avoiding the serialization round > trip. Unfortunately the logic is fairly coarse - as soon as you switch to > MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen > to keep the data in memory, incurring the serialization overhead. > > Ideally Beam would serialize the data lazily - as Spark chooses to spill to > disk. This would be a change in behavior when using beam, but luckily Spark > has a solution for folks that want data serialized in memory - > MEMORY_AND_DISK_SER will keep the data serialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005)