Re: [Discuss] Semantics of event time for state TTL

Kostas Kloudas Mon, 08 Apr 2019 03:04:28 -0700

Hi all,

For GDPR: I am not sure about the regulatory requirements of GDPR but I
would assume that the time for deletion starts counting from the time an
organisation received the data (i.e. the wall-clock ingestion time of the
data), and not the "event time" of the data. In other case, an organisaton
may be violating GDPR by just receiving e.g. 1 year old data of a user
whole deletion policy is "you are allowed to keep them for 6 months".


Now for the discussion in this thread, I think that the scenario:

* Timestamp stored: Event timestamp
* Timestamp to check expiration: Processing Time

has the underlying assumption that there is a relationship between
event-time and processing time, which is not necessarily the case.
Event-time, although we call it "time", is just another user-defined column
or attribute of the data and can be anything. It is not an "objective" and
independently evolving attribute like wall-clock time. I am not sure what
could be the solution, as out-of-orderness can always lead to arbitrary,
non-reproducible and difficult to debug behaviour (e.g. a super-early
element that arrives out-of-order and, as the succeeding elements set the
timestamp to lower values, it gets deleted by the state backend, although
the user-level windowing logic would expect it to be there).

Given that last point made above, and apart from the semantics of the
proposed feature, I think that we should also discuss if it is a good idea
to have event time TTL implemented in state backend level in the first
place. Personally, I am not so convinced that this is a good idea, as we
introduce another (potentially competing) mechanism for handling event
time, apart from the user program. An example can be the one that I
described above. And this also defeats one of the main advantages of event
time, in my opinion, which is reproducability of the results.

I may be wrong, but I would appreciate any opinions on this.

Cheers,
Kostas

On Mon, Apr 8, 2019 at 11:12 AM Aljoscha Krettek <aljos...@apache.org>
wrote:

> Oh boy, this is an interesting pickle.
>
> For *last-access-timestamp*, I think only *event-time-of-current-record*
> makes sense. I’m looking at this from a GDPR/regulatory compliance
> perspective. If you update a state, by say storing the event you just
> received in state, you want to use the exact timestamp of that event to to
> expiration. Both *max-timestamp-of-data-seen-so-far* and *last-watermark*
> suffer from problems in edge cases: if the timestamp of an event you
> receive is quite a bit earlier than other timestamps that we have seen so
> far (i.e. the event is late) we would artificially lengthen the TTL of that
> event (which is stored in state) and would therefore break regulatory
> requirements. Always using the timestamp of an event doesn’t suffer from
> that problem.
>
> For *expiration-check-time*, both *last-watermark* and
> *current-processing-time* could make sense but I’m leaning towards
> *processing-time*. The reason is again the GDPR/compliance view: if we have
> an old savepoint with data that should have been expired by now but we
> re-process it with *last-watermark* expiration, this means that we will get
> to “see” that state even though we shouldn’t allowed to be. If we use
> *current-processing-time* for expiration, we wouldn’t have that problem
> because that old data (according to their event-time timestamp) would be
> properly cleaned up and access would be prevented.
>
> To sum up:
> last-access-timestamp: event-time of event
> expiration-check-time: processing-time
>
> What do you think?
>
> Aljoscha
>
> > On 6. Apr 2019, at 01:30, Konstantin Knauf <konstan...@ververica.com>
> wrote:
> >
> > Hi Andrey,
> >
> > I agree with Elias. This would be the most natural behavior. I wouldn't
> add
> > additional slightly different notions of time to Flink.
> >
> > As I can also see a use case for the combination
> >
> > * Timestamp stored: Event timestamp
> > * Timestamp to check expiration: Processing Time
> >
> > we could (maybe in a second step) add the possibility to mix and match
> time
> > characteristics for both aspects.
> >
> > Cheers,
> >
> > Konstantin
> >
> > On Thu, Apr 4, 2019 at 7:59 PM Elias Levy <fearsome.lucid...@gmail.com>
> > wrote:
> >
> >> My 2c:
> >>
> >> Timestamp stored with the state value: Event timestamp
> >> Timestamp used to check expiration: Last emitted watermark
> >>
> >> That follows the event time processing model used elsewhere is Flink.
> >> E.g. events are segregated into windows based on their event time, but
> the
> >> windows do not fire until the watermark advances past the end of the
> window.
> >>
> >>
> >> On Thu, Apr 4, 2019 at 7:55 AM Andrey Zagrebin <and...@ververica.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>
> >>> As you might have already seen there is an effort tracked in
> FLINK-12005
> >>> [1] to support event time scale for state with time-to-live (TTL) [2].
> >>> While thinking about design, we realised that there can be multiple
> >>> options
> >>> for semantics of this feature, depending on use case. There is also
> >>> sometimes confusion because of event time out-of-order nature in
> Flink. I
> >>> am starting this thread to discuss potential use cases of this feature
> and
> >>> their requirements for interested users and developers. There was
> already
> >>> discussion thread asking about event time for TTL and it already
> contains
> >>> some thoughts [3].
> >>>
> >>> There are two semantical cases where we use time for TTL feature at the
> >>> moment. Firstly, we store timestamp of state last access/update.
> Secondly,
> >>> we use this timestamp and current timestamp to check expiration and
> >>> garbage
> >>> collect state at some point later.
> >>>
> >>> At the moment, Flink supports *only processing time* for both
> timestamps:
> >>> state *last access and current timestamp*. It is basically current
> local
> >>> system unix epoch time.
> >>>
> >>> When it comes to event time scale, we also need to define what Flink
> >>> should
> >>> use for these two timestamps. Here I will list some options and their
> >>> possible pros&cons for discussion. There might be more depending on use
> >>> case.
> >>>
> >>> *Last access timestamp (stored in backend with the actual state
> value):*
> >>>
> >>>   - *Event timestamp of currently being processed record.* This seems
> to
> >>>   be the simplest option and it allows user-defined timestamps in state
> >>>   backend. The problem here might be instability of event time which
> can
> >>> not
> >>>   only increase but also decrease if records come out of order. This
> can
> >>> lead
> >>>   to rewriting the state timestamp to smaller value which is unnatural
> >>> for
> >>>   the notion of time.
> >>>   - *Max event timestamp of records seen so far for this record key.*
> >>> This
> >>>   option is similar to the previous one but it tries to fix the notion
> of
> >>>   time to make it always increasing. Maintaining this timestamp has
> also
> >>>   performance implications because the previous timestamp needs to be
> >>> read
> >>>   out to decide whether to rewrite it.
> >>>   - *Last emitted watermark*. This is what we usually use for other
> >>>   operations to trigger some actions in Flink, like timers and windows
> >>> but it
> >>>   can be unrelated to the record which actually triggers the state
> >>> update.
> >>>
> >>> *Current timestamp to check expiration:*
> >>>
> >>>   - *Event timestamp of last processed record.* Again quite simple but
> >>>   unpredictable option for out-of-order events. It can potentially lead
> >>> to
> >>>   undesirable expiration of late buffered data in state without
> control.
> >>>   - *Max event timestamp of records seen so far for operator backend.*
> >>> Again
> >>>   similar to previous one, more stable but still user does not have too
> >>> much
> >>>   control when to expire state.
> >>>   - *Last emitted watermark*. Again, this is what we usually use for
> >>> other
> >>>   operations to trigger some actions in Flink, like timers and windows.
> >>> It
> >>>   also gives user some control to decide when state is expired (up to
> >>> which
> >>>   point in event time) by emitting certain watermark. It is more
> >>> flexible but
> >>>   complicated. If some watermark emitting strategy is already used for
> >>> other
> >>>   operations, it might be not optimal for TTL and delay state cleanup.
> >>>   - *Current processing time.* This option is quite simple, It would
> mean
> >>>   that user just decides which timestamp to store but it will expire in
> >>> real
> >>>   time. For data privacy use case, it might be better because we want
> >>> state
> >>>   to be unavailable in particular real moment of time since the
> >>> associated
> >>>   piece of data was created in event time. For long term approximate
> >>> garbage
> >>>   collection, it might be not a problem as well. For quick expiration,
> >>> the
> >>>   time skew between event and processing time can lead again to
> premature
> >>>   deletion of late data and user cannot delay it.
> >>>
> >>> We could also make this behaviour configurable. Another option is to
> make
> >>> time provider pluggable for users. The interface can give users context
> >>> (currently processed record, watermark etc) and ask them which
> timestamp
> >>> to
> >>> use. This is more complicated though.
> >>>
> >>> Looking forward for your feedback.
> >>>
> >>> Best,
> >>> Andrey
> >>>
> >>> [1] https://issues.apache.org/jira/browse/FLINK-12005
> >>> [2]
> >>>
> >>>
> https://docs.google.com/document/d/1SI_WoXAfOd4_NKpGyk4yh3mf59g12pSGNXRtNFi-tgM
> >>> [3]
> >>>
> >>>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/State-TTL-in-Flink-1-6-0-td22509.html
> >>>
> >>
> >
> > --
> >
> > Konstantin Knauf | Solutions Architect
> >
> > +49 160 91394525
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> > --
> >
> > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Data Artisans GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>
>

Re: [Discuss] Semantics of event time for state TTL

Reply via email to