[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155550#comment-14155550 ]
Tathagata Das commented on SPARK-3292: -------------------------------------- I mentioned this in the PR but I am adding it here as well. Not returning an RDD can mess up a lot of the logic and semantics. For example if there is a transform() followed by updateStateByKey(), the result will be unpredictable. updateStateByKey expects the previous batch to have a state RDD. If it does not find any state RDD it will assume that this the start of the streamign computation and effectively initialize again, forgetting the previous states from 2 batches ago. So this change is incorrect. Regarding the original problem of creating too many empty files, you can filter that out by doing explicitly saving yourself. dstream.foreachRDD { case (rdd, time) => if (rdd.take(1).size == 1) { rdd.saveAsHadoopFile(....) } } > Shuffle Tasks run incessantly even though there's no inputs > ----------------------------------------------------------- > > Key: SPARK-3292 > URL: https://issues.apache.org/jira/browse/SPARK-3292 > Project: Spark > Issue Type: Improvement > Components: Streaming > Affects Versions: 1.0.2 > Reporter: guowei > > such as repartition groupby join and cogroup > for example. > if i want the shuffle outputs save as hadoop file ,even though there is no > inputs , many emtpy file generate too. > it's too expensive , -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org