[ https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189765#comment-17189765 ]
Apache Spark commented on SPARK-32735: -------------------------------------- User 'Olwn' has created a pull request for this issue: https://github.com/apache/spark/pull/29578 > RDD actions in DStream.transfrom don't show at batch page > --------------------------------------------------------- > > Key: SPARK-32735 > URL: https://issues.apache.org/jira/browse/SPARK-32735 > Project: Spark > Issue Type: Bug > Components: DStreams, Web UI > Affects Versions: 3.0.0 > Reporter: Liechuan Ou > Priority: Major > Labels: pull-request-available > > h4. Issue > {code:java} > val lines = ssc.socketTextStream("localhost", 9999) > val words = lines.flatMap(_.split(" ")) > val mappedStream= words.transform(rdd => { > val c = rdd.count(); > rdd.map(x => s"$c x")} > ) > mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code} > Every batch two spark jobs are created. Only the second one is associated > with the streaming output operation and shows at batch page. > h4. Investigation > The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch > time and output op id are not available in spark context because they are set > in JobScheduler later. > h4. Proposal > delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run > in spark context with correct local properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org