I know this seems a silly question but I am trying to figure out optimal set up for our flink jobs. We are using standalone cluster with 5 jobs. Each job has 3 asynch operators with Executors with thread counts of 20,20,100. Source is kafka and cassandra and rest sinks exist. Currently we are using parallelism = 1. So, at max load a single job spans at least 140 threads. Also we are using netty based libraries for cassandra and restcalls . (As I can see in thread dump flink also uses netty server).
What we see is that total thread count adds up to ~ 500 for a single job. The issue we faced is, all of a sudden all jobs began to fail in production and we saw that it was mainly due to ulimit user process. All jobs did started in one server in cluster ( I do not know why, as it is a cluster with 3 members). It was set to around 1500 in that server. We then set a higher value and problems seem to go away. Can you recommend an optional prod setting for standalone cluster? Or should there be a max limit on threads spawned by a single job? Regards -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/