There have been a couple of instances where one of our TMs was quarantined
( the cause is irrelevant to this discussion ).  And we had to bounce the
TM to bring back sanity to the cluster.  There have been discussions around
and am trying to distill them. My questions are


*  Based on https://issues.apache.org/jira/browse/FLINK-3347 is it
advisable to set the taskmanager.exit-on-fatal-akka-error  to true. ?

* Is the akka.ask.timeout relevant here ? We could increase the value to
greater than 10s but based on your experiences is it more of a  "mask the
issue" exercise or is 10s generally a low value that *should* be increased ?

* Is it possible or is there some effort being put into per job
memory/resource consumption for a multi job setup that is very normal with
flink ?

* Is there an effort to monitor ROCKSDB useage ( off heap and what not ) ?
It seems a black box to a user as of today.

Thank you and regards.

Reply via email to