Hi Devs,
@Users: I'm cc'ing the user ML to see if there are any users that are
relying on this feature. Please comment here if that is the case.
I'd like to discuss the deprecation and eventual removal of UnionList
Operator State, aka Operator State with Union Redistribution. If you
don't know what I'm talking about you can take a look in the
documentation: [1]. It's not documented thoroughly because it started
out as mostly an internal feature.
The immediate main reason for removing this is also mentioned in the
documentation: "Do not use this feature if your list may have high
cardinality. Checkpoint metadata will store an offset to each list
entry, which could lead to RPC framesize or out-of-memory errors." The
insidious part of this limitation is that you will only notice that
there is a problem when it is too late. Checkpointing will still work
and a program can continue when the state size is too big. The system
will only fail when trying to restore from a snapshot that has union
state that is too big. This could be fixed by working around that issue
but I think there are more long-term issues with this type of state.
I think we need to deprecate and remove API for state that is not tied
to a key. Keyed state is easy to reason about, the system can
re-partition state and also re-partition records and therefore scale the
system in and out. Operator state, on the other hand is not tied to a
key but an operator. This is a more "physical" concept, if you will,
that potentially ties business logic closer to the underlying runtime
execution model, which in turns means less degrees of freedom for the
framework, that is Flink. This is future work, though, but we should
start with deprecating union list state because it is the potentially
most dangerous type of state.
We currently use this state type internally in at least the
StreamingFileSink, FlinkKafkaConsumer, and FlinkKafkaProducer. However,
we're in the process of hopefully getting rid of it there with our work
on sources and sinks. Before we fully remove it, we should of course
signal this to users by deprecating it.
What do you think?
Best,
Aljoscha