[jira] [Updated] (SPARK-13349) adding a split and union to a streaming application causes big performance hit

krishna ramachandran (JIRA) Tue, 16 Feb 2016 15:37:33 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


krishna ramachandran updated SPARK-13349:
-----------------------------------------
    Summary: adding a split and union to a streaming application causes big 
performance hit  (was: enabling cache causes out of memory error. Caching 
DStream helps reduce processing time in a streaming application but get out of 
memory errors)

> adding a split and union to a streaming application causes big performance hit
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-13349
>                 URL: https://issues.apache.org/jira/browse/SPARK-13349
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 1.4.1
>            Reporter: krishna ramachandran
>            Priority: Critical
>             Fix For: 1.4.2
>
>
> We have a streaming application containing approximately 12 stages every 
> batch, running in streaming mode (4 sec batches). Each stage persists output 
> to cassandra
> the pipeline stages 
> stage 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more stages of transforms and save to database. 
> Around stage 5, we union the output of Dstream from stage 1 (in red) with 
> another stream (generated by split during stage 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream (red block output), performance improves substantially but hit out of 
> memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-13349) adding a split and union to a streaming application causes big performance hit

Reply via email to