Hello! At Signifyd we use machine learning to protect our customers from credit card fraud. Efficiently calculating feature values for our models based on historical data is one of the primary challenges we face, and we’re meeting it with Flink.
We need our system to be highly available and quickly recover large per-TaskManager state, even in the event of TaskManager failure. Sadly judging by this thread <https://lists.apache.org/thread.html/r6880752fa918ca5515bd95a1656f1dd31c3547ed3a1f28741cc68391%40%3Cuser.flink.apache.org%3E>, task-local recovery isn’t currently supported for this, even when TaskManagers are guaranteed to be using the same persistent storage. That same thread also proposes a couple of solutions, the simplest of which is to persist the slot allocations of a TaskExecutor and use it to re-initialize a TaskExecutor on restart, so that it can offer its slots to the jobs it remembers. We’d love to have this functionality sooner rather than later, so I’d like to know if anyone experienced with Flink development is interested in implementing this on our behalf as a paid project and contributing it to mainline Flink? If so, please get in touch, making sure to CC nikhil.bys...@signifyd.com, nia.schm...@signifyd.com and myself. Thanks! Colman O'Sullivan Software Engineering Manager, Link^ Team, Signifyd. ^ The ‘f’ is silent ;)