Implement task local recovery on TaskManager restart for Signifyd

Colman OSullivan Wed, 04 Aug 2021 20:05:25 -0700

Hello!

At Signifyd we use machine learning to protect our customers from credit
card fraud. Efficiently calculating feature values for our models based on
historical data is one of the primary challenges we face, and we’re meeting
it with Flink.


We need our system to be highly available and quickly recover large
per-TaskManager state, even in the event of TaskManager failure. Sadly
judging by this thread
<https://lists.apache.org/thread.html/r6880752fa918ca5515bd95a1656f1dd31c3547ed3a1f28741cc68391%40%3Cuser.flink.apache.org%3E>,
task-local recovery isn’t currently supported for this, even when
TaskManagers are guaranteed to be using the same persistent storage.

That same thread also proposes a couple of solutions, the simplest of which
is to persist the slot allocations of a TaskExecutor and use it to
re-initialize a TaskExecutor on restart, so that it can offer its slots to
the jobs it remembers.

We’d love to have this functionality sooner rather than later, so I’d like
to know if anyone experienced with Flink development is interested in
implementing this on our behalf as a paid project and contributing it to
mainline Flink?

If so, please get in touch, making sure to CC nikhil.bys...@signifyd.com,
nia.schm...@signifyd.com and myself.

Thanks!

Colman O'Sullivan

Software Engineering Manager,

Link^ Team, Signifyd.

^ The ‘f’ is silent ;)

Implement task local recovery on TaskManager restart for Signifyd

Reply via email to