[ https://issues.apache.org/jira/browse/BEAM-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismaël Mejía reassigned BEAM-1444: ---------------------------------- Assignee: (was: Amit Sela) > Flatten of Bounded and Unbounded repeats the union with the RDD for each > micro-batch. > -------------------------------------------------------------------------------------- > > Key: BEAM-1444 > URL: https://issues.apache.org/jira/browse/BEAM-1444 > Project: Beam > Issue Type: Bug > Components: runner-spark > Reporter: Amit Sela > Priority: Major > > Flatten of BOUNDED and UNBOUNDED PCollections in the Spark runner is > implemented by applying {{SparkContext#union(RDD...)}} inside a > {{DStream.transform()}} which causes the same RDD to be "unionized" into each > micro-batch and so multiplying it's content in the resulting stream (x number > of batches). > Spark does not seem to provide any out-of-the-box implementations for this. > One approach I tried was to create a stream from Queue (single RDD stream) > but this is not an option since this fails checkpointing. > Another approach would be to create a custom {{InputDStream}} that does this. > An important note here is that the challenge here is to find a solution that > holds with checkpointing and recovery from failure. -- This message was sent by Atlassian Jira (v8.3.4#803005)