[ 
https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32734:
------------------------------------

    Assignee:     (was: Apache Spark)

> RDD actions in DStream.transfrom delays batch submission
> --------------------------------------------------------
>
>                 Key: SPARK-32734
>                 URL: https://issues.apache.org/jira/browse/SPARK-32734
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams
>    Affects Versions: 3.0.0
>            Reporter: Liechuan Ou
>            Priority: Major
>              Labels: pull-request-available
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h4. Issue
> Some spark applications have batch creation delay after running for some 
> time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the 
> latest batch doesn't match current time.
>   
> ||Clock||BatchTime||
> |10:00|10:00|
> |10:02|10:01|
> |10:04|10:02|
> |10:06|10:03|
> h4. Investigation
> We observe such applications share a commonality that rdd actions exist in 
> dstream.transfrom. Those actions will be executed in dstream.compute, which 
> is called by JobGenerator. JobGenerator runs with a single thread event loop 
> so any synchronized operations will block event processing.
> h4. Proposal
> delegate dstream.compute to JobSchduler
>  
> {code:java}
> // class ForEachDStream
> override def generateJob(time: Time): Option[Job] = {
>   val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
>     parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time))
>   }
>   Some(new Job(time, jobFunc))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to