[
https://issues.apache.org/jira/browse/FLINK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tzu-Li (Gordon) Tai closed FLINK-21642.
---------------------------------------
Fix Version/s: statefun-3.0.0
Assignee: Igal Shilman
Resolution: Fixed
flink-statefun/master: d46a4511ecdc8ad6bf16d977b51d3ced85f403b4
> RequestReplyFunction recovery fails with a remote SDK
> -----------------------------------------------------
>
> Key: FLINK-21642
> URL: https://issues.apache.org/jira/browse/FLINK-21642
> Project: Flink
> Issue Type: Bug
> Components: Stateful Functions
> Reporter: Igal Shilman
> Assignee: Igal Shilman
> Priority: Major
> Labels: pull-request-available
> Fix For: statefun-3.0.0
>
>
> While extending our smoke e2e test to use the remote SDKS I've stumbled upon
> a bug in the RequestReplyFunction. We get a unknown state exception after
> recovery.
> The exact scenario that trigger that bug is:
> # There was request in flight.
> # A failure occurs that causes the job to restart.
> # On restore, we start with no managed state
> # But we try to re-send to the SDK exactly the same ToFunction message.
> # That ToFunction contains state definitions from the previous attempt.
> (before the failure)
> # The SDK processes this message normally (it has all the state definitions
> that it knows)
> # The SDK responds with a state mutation.
> # The PersistedRemoteFunctionValues fails with unknown state.
>
> We need to treat the ToFunction messages as a retryBatch, instead of sending
> it as-is.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)