[ https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951593#comment-14951593 ]
Jack Hu commented on SPARK-6847: -------------------------------- Hi [~glyton.camilleri] You can check whether there are two dstreams in the DAG need to be checkpointed (updateStateByKey, reduceByKeyAndWindow), it yes, you can workaround this to use some output for the previous DStream which needs to checkpointed. {code} val d1 = input.updateStateByKey(func) val d2 = d1.map(...).updateStateByKey(func) d2.foreachRDD(rdd => print(rdd.count)) /// workaround the stack over flow listed in this JIRA d1.foreachRDD(rdd => rdd.foreach(_ => Unit)) {code} > Stack overflow on updateStateByKey which followed by a dstream with > checkpoint set > ---------------------------------------------------------------------------------- > > Key: SPARK-6847 > URL: https://issues.apache.org/jira/browse/SPARK-6847 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.3.0 > Reporter: Jack Hu > Labels: StackOverflowError, Streaming > > The issue happens with the following sample code: uses {{updateStateByKey}} > followed by a {{map}} with checkpoint interval 10 seconds > {code} > val sparkConf = new SparkConf().setAppName("test") > val streamingContext = new StreamingContext(sparkConf, Seconds(10)) > streamingContext.checkpoint("""checkpoint""") > val source = streamingContext.socketTextStream("localhost", 9999) > val updatedResult = source.map( > (1,_)).updateStateByKey( > (newlist : Seq[String], oldstate : Option[String]) => > newlist.headOption.orElse(oldstate)) > updatedResult.map(_._2) > .checkpoint(Seconds(10)) > .foreachRDD((rdd, t) => { > println("Deep: " + rdd.toDebugString.split("\n").length) > println(t.toString() + ": " + rdd.collect.length) > }) > streamingContext.start() > streamingContext.awaitTermination() > {code} > From the output, we can see that the dependency will be increasing time over > time, the {{updateStateByKey}} never get check-pointed, and finally, the > stack overflow will happen. > Note: > * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but > not the {{updateStateByKey}} > * If remove the {{checkpoint(Seconds(10))}} from the map result ( > {{updatedResult.map(_._2)}} ), the stack overflow will not happen -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org