Hi Yu,

bq* I thought the FLIP aims at resolving some *existing* confusion, i.e.
the durability mystery to users.

I think it might help to highlight specific stumbling blocks users have
today and why I believe this change addresses those issues. Some frequent
things I've heard over the past several years include:

1) "We use RocksDB because we don't need fault tolerance."
2) "We don't use RocksDB because we don't want to manage an external
database."
3) Believing RocksDB is reading and writing directly with S3 or HDFS (vs.
local disk)
4) Believing FsStateBackend spills to disk or has anything to do with the
local filesystem
5) Pointing RocksDB at network-attached storage, believing that the state
backend needs to be fault-tolerant

This question from the ml is very representative of where users are
struggling[1]. Many of these questions were not from new users but from
organizations that were in production! Just yesterday I was on the phone
with a company that didn't realize they were in production without
checkpointing; honestly, you would be shocked how often this happens. The
current state backend abstraction is to complex for many of our users. What
all these questions have in common is misunderstanding the relationship
between how data is stored locally on TMs vs how checkpoints make that
state durable.

The FLIP aims actively help users by allowing them to reason about state
backends separately from checkpoint durability. In the future, a state
backend only defines where and how state is stored locally on the TM while
checkpoint storage defines where and how checkpoints are stored for
recovery. To be concrete I think the JavaDoc for setCheckpointStorage would
be something like:

```java
/**
 * CheckpointStorage defines how checkpoint snapshots are persisted for
fault tolerance
*. Various implementations  store their checkpoints in different fashions
and have different requirements and
 * availability guarantees.
 *
 *<p>For example, JobManagerCheckpointStorage stores checkpoints in the
memory of the JobManager.
 * It is lightweight and without additional dependencies but is not highly
available
 * and only supports small state sizes. This checkpoint storage policy is
convenient for
 * local testing and development.
 *
 *<p>FileSystemCheckpointStorage stores checkpoints in a filesystem. For
systems like
 * HDFS, NFS Drives, S3, and GCS, this storage policy supports large state
size,
 * in the magnitude of many terabytes while providing a highly available
foundation
 * for stateful applications. This checkpoint storage policy is recommended
for most
 * production deployments.
 */
void setCheckpointStorage(CheckpointStorage storage) {}
```

Seth

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/State-Storage-Questions-td37919.html
[2] Also naming, but we're aligned here

On Thu, Sep 24, 2020 at 10:24 AM Yu Li <car...@gmail.com> wrote:

> And to make it clear, I'm +1 on the idea of decoupling state backends with
> checkpointing. I don't have any question about making it clear that
> heap/RocksDB is where we serve the routine state read/write and where to
> put the checkpoint data is another story. My only concern lies in the newly
> introduced setCheckpointStorage API and how we define its semantics, and
> not sure whether it's due to my ignorance.
>
> Best Regards,
> Yu
>
>
> On Thu, 24 Sep 2020 at 23:11, Yu Li <car...@gmail.com> wrote:
>
> > *bq. What new confusion would be introduced here?*
> > No *new* confusion introduced, but as mentioned at the very beginning of
> > the motivation ("Apache Flink's durability story is a mystery to many
> > users"), I thought the FLIP aims at resolving some *existing*
> > confusions, i.e. the durability mystery to users.
> >
> > For me, I'm not 100% clear about how to write the javadoc of the
> > setCheckpointStorage API. Would it be like "specify where the checkpoint
> > data is stored"? If so, do we need to explain the fact that when a
> > checkpoint path is given, JM will also persist the checkpoint data to
> DFS?
> > It's true that such confusion also exists today, but would the
> introduction
> > of the new API expose it further?
> >
> > IMHO we need to document the newly introduced API / classes and their
> > semantics clearly in the FLIP to make sure everyone is on the same page,
> > but if we feel such work / discussions are all details and only need to
> > happen at the documenting and release note phase, it's also fine to me.
> >
> > And if I'm the only one who has such questions / concerns on the new
> > `setCheckpointStorage` API and most of others feel its semantic is sound
> > and clear, then please just ignore me and move on.
> >
> > Thanks.
> >
> > Best Regards,
> > Yu
> >
> >
> > On Wed, 23 Sep 2020 at 17:08, Stephan Ewen <se...@apache.org> wrote:
> >
> >> I am confused now with the concerns here. This is very much from the
> user
> >> perspective (which is partially also the developer perspective which is
> >> the
> >> sign of an intuitive abstraction).
> >>
> >> Of course, there will be docs describing what JMCheckpointStorage and
> >> FsCheckpointStorage are.
> >> And having release notes that describe that
> >> RocksDBStateBackend("s3://...")
> >> now corresponds to a combination of "RocksDBBackend" and
> >> "FsCheckpointStorage" is also straightforward.
> >>
> >> We said to keep the old RocksDBStateBackend class and let it implement
> >> both
> >> interfaces such that the old code still works exactly as before.
> >>
> >> What new confusion would be introduced here?
> >> Understanding the difference between JMCheckpointStorage and
> >> FsCheckpointStorage was always necessary when one needed to understand
> the
> >> difference between MemoryStateBackend and FsStateBackend. It should be
> >> easier to define this after this change, because it is the only thing
> that
> >> we describe when explaining what checkpoint storage to use (rather than
> >> also having the choice of index structure coupled to that).
> >>
> >>
> >> On Wed, Sep 23, 2020 at 10:39 AM Aljoscha Krettek <aljos...@apache.org>
> >> wrote:
> >>
> >> > On 23.09.20 04:40, Yu Li wrote:
> >> > > To be specific, with the old API users don't need to set checkpoint
> >> > > storage, instead they only need to pass the checkpoint path w/o
> caring
> >> > > about the storage. The new APIs are forcing users to set the storage
> >> so
> >> > > they have to know the difference between different storages. It's
> not
> >> an
> >> > > implementation change, but an API change that users have to
> understand
> >> > and
> >> > > follow-up.
> >> >
> >> > I think the main point of the FLIP is to make it more obvious to users
> >> > what is happening.
> >> >
> >> > With current Flink, they would do a `setStateBackend(new
> >> > FsStateBackend(<path>))`. What the user is actually "saying" with this
> >> > is: I want to keep state on heap but store checkpoints in DFS. They
> are
> >> > not actually changing the "State Backend", the thing that keeps state
> in
> >> > operators, but only where state is checkpointed. The thing that is
> used
> >> > for local state storage in operators is still the "Heap Backend".
> >> >
> >> > With the proposed FLIP, a user would do a `setCheckpointStorage(new
> >> > FsStorage(<path>))`. Which makes it obvious that they're changing
> where
> >> > checkpoints are stored but not the actual "State Backend", which is
> >> > still "Heap Backend" (the default).
> >> >
> >> > I do understand Yu's point, though, that this will be confusing for
> >> > current Flink users. They are used to setting a "State Backend"
> if/when
> >> > they want to change the storage location. To fit the new model they
> >> > would have to change the call from `setStateBackend()` to
> >> > `setCheckpointStorage()`.
> >> >
> >> > I think we need to life with this short-term confusion because in the
> >> > long run the proposed split between checkpoint location and state
> >> > backend makes sense and will be more straightforward for users to
> >> > understand.
> >> >
> >> > Best,
> >> > Aljoscha
> >> >
> >> >
> >>
> >
>

Reply via email to