Cool thnx Paris.
On Thu, May 19, 2016 at 9:48 PM, Paris Carbone wrote:
> Sure, in practice you can set a threshold of retries since an operator
> implementation could cause this indefinitely or any other reason can make
> snapshotting generally infeasible. If I recall correctly
Sure, in practice you can set a threshold of retries since an operator
implementation could cause this indefinitely or any other reason can make
snapshotting generally infeasible. If I recall correctly that threshold exists
in the Flink configuration.
On 19 May 2016, at 20:42, Stavros
The problem here is different though if something is keep failing
(permanently) in practice someone needs to be notified. If the user loses
snapshotting he must know.
On Thu, May 19, 2016 at 9:36 PM, Abhishek R. Singh <
abhis...@tetrationanalytics.com> wrote:
> I was wondering how checkpoints
Hi Abhishek,
I don’t see the problem there (also this is unrelated to the snapshotting
protocol).
Intuitively, if you submit a copy of your state (full or delta) for a snapshot
version/epoch to a store backend and validate the full snapshot for that
version when you eventually receive the
I was wondering how checkpoints can be async? Because your state is constantly
mutating. You probably need versioned state, or immutable data structs?
-Abhishek-
> On May 19, 2016, at 11:14 AM, Paris Carbone wrote:
>
> Hi Stavros,
>
> Currently, rollback failure recovery in
Invalidations are not necessarily exposed (I hope). Think of it as implementing
TCP, you don’t have to warn the user that packets are lost since eventually a
packet will be received at the other side in an eventually sunchronous system.
Snapshots follow the same paradigm. Hope that helps.
On
Yes thats what i was thinking thnx. When people here exactly once they
think are you sure, there is something hidden there... because theory is
theory :)
So if you keep getting invalidated snapshots but data passes through
operators you issue a warning or fail the pipeline and return an exception
In that case, typically a timeout invalidates the whole snapshot (all states
for the same epoch) until eventually we have a full complete snapshot.
On 19 May 2016, at 20:26, Stavros Kontopoulos
> wrote:
"Checkpoints are only confirmed
True, if you like formal modelling and stuff like that you can think of it as a
more relaxed/abortable operation (e.g. like abortable consensus) which yields
the same guarantees and works ok in semi-synchronous distributed systems (as in
the case of Flink).
On 19 May 2016, at 20:22, Stavros
"Checkpoints are only confirmed if all parallel subtasks successfully
created a valid snapshot of the state." as stated by Robert. So to rephrase
my question... how confirmation that all snapshots are finished is done and
what happens if some task is very slow...or is blocked?
If you have N tasks
Hey thnx for the links. There are assumptions though like reliable
channels... since you rely on tcp in practice and if a checkpoint fails or
is very slow then you need to deal with it thats why i asked previously
what happens then.. 3cp does not need assumptions i think, but engineering
is
Regarding your last question,
If a checkpoint expires it just gets invalidated and a new complete checkpoint
will eventually occur that can be used for recovery. If I am wrong, or
something has changed please correct me.
Paris
On 19 May 2016, at 20:14, Paris Carbone
Hi Stavros,
Currently, rollback failure recovery in Flink works in the pipeline level, not
in the task level (see Millwheel [1]). It further builds on repayable stream
logs (i.e. Kafka), thus, there is no need for 3pc or backup in the pipeline
sources. You can also check this presentation [2]
Cool thnx. So if a checkpoint expires the pipeline will block or fail in
total or only the specific task related to the operator (running along with
the checkpoint task) or nothing happens?
On Tue, May 17, 2016 at 3:49 PM, Robert Metzger wrote:
> Hi Stravos,
>
> I haven't
Hi Stravos,
I haven't implemented our checkpointing mechanism and I didn't participate
in the design decisions while implementing it, so I can not compare it in
detail to other approaches.
>From a "does it work perspective": Checkpoints are only confirmed if all
parallel subtasks successfully
Hi,
I was looking into the flink snapshotting algorithm details also mentioned
here:
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
16 matches
Mail list logo