[ https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-32734: ------------------------------------ Assignee: (was: Apache Spark) > RDD actions in DStream.transfrom delays batch submission > -------------------------------------------------------- > > Key: SPARK-32734 > URL: https://issues.apache.org/jira/browse/SPARK-32734 > Project: Spark > Issue Type: Bug > Components: DStreams > Affects Versions: 3.0.0 > Reporter: Liechuan Ou > Priority: Major > Labels: pull-request-available > Original Estimate: 168h > Remaining Estimate: 168h > > h4. Issue > Some spark applications have batch creation delay after running for some > time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the > latest batch doesn't match current time. > > ||Clock||BatchTime|| > |10:00|10:00| > |10:02|10:01| > |10:04|10:02| > |10:06|10:03| > h4. Investigation > We observe such applications share a commonality that rdd actions exist in > dstream.transfrom. Those actions will be executed in dstream.compute, which > is called by JobGenerator. JobGenerator runs with a single thread event loop > so any synchronized operations will block event processing. > h4. Proposal > delegate dstream.compute to JobSchduler > > {code:java} > // class ForEachDStream > override def generateJob(time: Time): Option[Job] = { > val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { > parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time)) > } > Some(new Job(time, jobFunc)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org