[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453209#comment-16453209
 ] 

Jungtaek Lim commented on SPARK-24036:
--------------------------------------

Maybe better to share what I've observed from continuous mode so far.
 * It leverages iterator hack to make logical batch (epoch) in stream.
 ** While iterator works different from normal, it doesn't touch existing 
operators by putting assumption that all operators are chained and fit to 
single stage.
 ** With this assumption, only WriteToContinuousDataSourceExec needs to know 
how to deal with iterator hack.
 ** Above assumption requires no repartition, which most of stateful operators 
need to deal with.
 * Based on the hack, actually it doesn't put epoch marker flow through 
downstreams.
 ** To apply distributed snapshot it is mandatory, but it might require 
non-trivial change of existing model, since checkpoint should be handled from 
each stateful operator and stored in distributed manner, and coordinator should 
be able to check snapshots from all tasks are taken correctly.
 ** This would be unnecessary change for batch, and making existing model being 
much complicated.
 ** This would bring latency concerns, since each operator should stop 
processing while taking a snapshot. (I guess sending or storing snapshot still 
could be done asynchronously.)
 ** If there're more than one upstreams, it should arrange sequences between 
upstreams to take a snapshot with only proper data within epoch.

So there is a huge challenge with existing model to extend continuous mode to 
support stateful exactly-once (not about end-to-end exactly once, since it also 
depends on sink), and I'd like to see the follow-up idea/design doc around 
continuous mode to see the direction of continuous mode: whether relying on 
such assumption and try to explore (may need to have more hacks/workarounds), 
or willing to discard assumption and redesign.

Most of features are supported with micro-batch manner, so also would like to 
see the goal of continuous mode. Is it to cover all or most of features being 
supported with micro-batch? Or is the goal of continuous mode only to cover low 
latency use cases?

 

> Stateful operators in continuous processing
> -------------------------------------------
>
>                 Key: SPARK-24036
>                 URL: https://issues.apache.org/jira/browse/SPARK-24036
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.4.0
>            Reporter: Jose Torres
>            Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to