[
https://issues.apache.org/jira/browse/FLINK-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056390#comment-16056390
]
Stefan Richter commented on FLINK-6633:
---------------------------------------
Thanks for reporting this. First, let me clarify that some re-uploads of
already existing sst files are expected to happen sometimes. This is the case
when the previous checkpoint was not yet confirmed to the backend. In this
case, the next checkpoint cannot reference sst files from the unconfirmed
predecessor. However, in such a case the {{SharedStateRegistry}} will do a
de-duplication with the original file if it actually got confirmed after all;
in this case only the first registered copy of a sst files survives.
Placeholder in the serializer are a true bug. I think it would be very helpful
if you could log a bit more. In particular, all interactions with the
{{SharedStateRegistry}} are relevant. Most importantly, when the externalized
checkpoint is loaded and re-registered with the registry after restart. At all
times, the registry should never contain placeholder but only real files,
because part of its purpose is to replace placeholders with their originals.
You could introduce a precondition for that and see if it is ever violated.
This should be the case, because 000027.sst was detected as a new file to
upload by the backend, so the only way it could become a placeholder is if -
for any reason - a placeholder got registered and was mistakenly used for file
de-duplication against a non-duplicate file (000027.sst). Can you provide a log
that contains: triggered checkpoints, un/register interactions with the
registry (inputs and result), completed checkpoints as received by the backend,
the files that were written for the externalized checkpoints and the state of
the shared registry after the restores? That would be very helpful to track
this problem.
> Register with shared state registry before adding to CompletedCheckpointStore
> -----------------------------------------------------------------------------
>
> Key: FLINK-6633
> URL: https://issues.apache.org/jira/browse/FLINK-6633
> Project: Flink
> Issue Type: Sub-task
> Components: State Backends, Checkpointing
> Affects Versions: 1.3.0
> Reporter: Stefan Richter
> Assignee: Stefan Richter
> Priority: Blocker
> Fix For: 1.3.0
>
>
> Introducing placeholders for previously existing shared state requires a
> change that shared state is first registering with {{SharedStateregistry}}
> (thereby being consolidated) and only after that added to a
> {{CompletedCheckpointStore}}, so that the consolidated checkpoint is written
> to stable storage.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)