[ 
https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154555#comment-15154555
 ] 

krishna ramachandran commented on SPARK-13349:
----------------------------------------------

Hi Sean
I posted to user@
2 problems

1) not much traction
2) though I registered multiple times I keep getting this message at Nable

  "This post has NOT been accepted by the mailing list yet"
the message I posted is pasted below.

this is not just a question - it is a bug

We have a streaming application containing approximately 12 jobs every batch, 
running in streaming mode (4 sec batches). Each  job has several 
transformations and 1 action (output to cassandra) which causes the execution 
of the job (DAG) 

For example the first job, 

job 1 
---> receive Stream A --> map --> filter -> (union with another stream B) --> 
map --> groupbykey --> transform --> reducebykey --> map 

Likewise we go thro' few more transforms and save to database (job2, job3...) 

Recently we added a new transformation further downstream wherein we union the 
output of DStream from job 1 (in italics) with output from a new 
transformation(job 5). It appears the whole execution thus far is repeated 
which is redundant (I can see this in execution graph & also performance -> 
processing time). 

That is, with this additional transformation (union with a stream processed 
upstream) each batch runs as much as 2.5 times slower compared to runs without 
the union. If I cache the DStream from job 1(italics), performance improves 
substantially but hit out of memory errors within few hours. 

What is the recommended way to cache/unpersist in such a scenario? there is no 
dstream level "unpersist" 
setting "spark.streaming.unpersist" to true and 
streamingContext.remember("duration") did not help. 

> adding a split and union to a streaming application cause big performance hit
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-13349
>                 URL: https://issues.apache.org/jira/browse/SPARK-13349
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 1.4.1
>            Reporter: krishna ramachandran
>            Priority: Critical
>
> We have a streaming application containing approximately 12 jobs every batch, 
> running in streaming mode (4 sec batches). Each job writes output to cassandra
> each job can contain several stages.
> job 1
> ---> receive Stream A --> map --> filter -> (union with another stream B) --> 
> map --> groupbykey --> transform --> reducebykey --> map
> we go thro' few more jobs of transforms and save to database. 
> Around stage 5, we union the output of Dstream from job 1 (in red) with 
> another stream (generated by split during job 2) and save that state
> It appears the whole execution thus far is repeated which is redundant (I can 
> see this in execution graph & also performance -> processing time). 
> Processing time per batch nearly doubles or triples.
> This additional & redundant processing cause each batch to run as much as 2.5 
> times slower compared to runs without the union - union for most batches does 
> not alter the original DStream (union with an empty set). If I cache the 
> DStream from job 1(red block output), performance improves substantially but 
> hit out of memory errors within few hours.
> What is the recommended way to cache/unpersist in such a scenario? there is 
> no dstream level "unpersist"
> setting "spark.streaming.unpersist" to true and 
> streamingContext.remember("duration") did not help. Still seeing out of 
> memory errors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to