Hi all,

I’m running a complex (batch) workflow that has a step where it trains Fasttext 
models.

This is very CPU-intensive, to the point where it will use all available 
processing power on a server.

The Flink configuration I’m using is one TaskManager per server, with N slots 
== available cores.

So what I’d like to do is ensure that if I have N of these training operators 
running in parallel on N TaskManagers, slot assignment happens such that each 
TM has one such operator.

Unfortunately, what typically happens now is that most/all of these operators 
get assigned to the same TM, which then struggles to stay alive under that load.

I haven’t seen any solution to this, though I can imagine some helicopter 
stunts that could work around the issue.

Any suggestions?

Thanks,

— Ken

PS - I took a look through the list of FLIPs 
<https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals>,
 and didn’t see anything that covered this. I image it would need to be 
something like YARN’s support for per-node vCore capacity and per-task vCore 
requirements, but on a per-TM/per-operator basis.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply via email to