[ 
https://issues.apache.org/jira/browse/FLINK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tzu-Li (Gordon) Tai closed FLINK-21642.
---------------------------------------
    Fix Version/s: statefun-3.0.0
         Assignee: Igal Shilman
       Resolution: Fixed

flink-statefun/master: d46a4511ecdc8ad6bf16d977b51d3ced85f403b4

> RequestReplyFunction recovery fails with a remote SDK
> -----------------------------------------------------
>
>                 Key: FLINK-21642
>                 URL: https://issues.apache.org/jira/browse/FLINK-21642
>             Project: Flink
>          Issue Type: Bug
>          Components: Stateful Functions
>            Reporter: Igal Shilman
>            Assignee: Igal Shilman
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: statefun-3.0.0
>
>
> While extending our smoke e2e test to use the remote SDKS I've stumbled upon 
> a bug in the RequestReplyFunction. We get a unknown state exception after 
> recovery.
> The exact scenario that trigger that bug is:
>  # There was  request in flight.
>  # A  failure occurs that causes the job to restart.
>  # On restore, we start with no managed state
>  # But we try to re-send to the SDK exactly the same ToFunction message.
>  # That ToFunction contains state definitions from the previous attempt. 
> (before the failure)
>  # The SDK processes this message normally (it has all the state definitions 
> that it knows)
>  # The SDK responds with a state mutation.
>  # The PersistedRemoteFunctionValues fails with unknown state. 
>  
> We need to treat the ToFunction messages as a retryBatch, instead of sending 
> it as-is.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to