Hi all, I’m running a complex (batch) workflow that has a step where it trains Fasttext models.
This is very CPU-intensive, to the point where it will use all available processing power on a server. The Flink configuration I’m using is one TaskManager per server, with N slots == available cores. So what I’d like to do is ensure that if I have N of these training operators running in parallel on N TaskManagers, slot assignment happens such that each TM has one such operator. Unfortunately, what typically happens now is that most/all of these operators get assigned to the same TM, which then struggles to stay alive under that load. I haven’t seen any solution to this, though I can imagine some helicopter stunts that could work around the issue. Any suggestions? Thanks, — Ken PS - I took a look through the list of FLIPs <https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals>, and didn’t see anything that covered this. I image it would need to be something like YARN’s support for per-node vCore capacity and per-task vCore requirements, but on a per-TM/per-operator basis. -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra