[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453209#comment-16453209 ]
Jungtaek Lim commented on SPARK-24036: -------------------------------------- Maybe better to share what I've observed from continuous mode so far. * It leverages iterator hack to make logical batch (epoch) in stream. ** While iterator works different from normal, it doesn't touch existing operators by putting assumption that all operators are chained and fit to single stage. ** With this assumption, only WriteToContinuousDataSourceExec needs to know how to deal with iterator hack. ** Above assumption requires no repartition, which most of stateful operators need to deal with. * Based on the hack, actually it doesn't put epoch marker flow through downstreams. ** To apply distributed snapshot it is mandatory, but it might require non-trivial change of existing model, since checkpoint should be handled from each stateful operator and stored in distributed manner, and coordinator should be able to check snapshots from all tasks are taken correctly. ** This would be unnecessary change for batch, and making existing model being much complicated. ** This would bring latency concerns, since each operator should stop processing while taking a snapshot. (I guess sending or storing snapshot still could be done asynchronously.) ** If there're more than one upstreams, it should arrange sequences between upstreams to take a snapshot with only proper data within epoch. So there is a huge challenge with existing model to extend continuous mode to support stateful exactly-once (not about end-to-end exactly once, since it also depends on sink), and I'd like to see the follow-up idea/design doc around continuous mode to see the direction of continuous mode: whether relying on such assumption and try to explore (may need to have more hacks/workarounds), or willing to discard assumption and redesign. Most of features are supported with micro-batch manner, so also would like to see the goal of continuous mode. Is it to cover all or most of features being supported with micro-batch? Or is the goal of continuous mode only to cover low latency use cases? > Stateful operators in continuous processing > ------------------------------------------- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 2.4.0 > Reporter: Jose Torres > Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org