I addressed the issues. On Mon, Jan 5, 2015 at 3:16 PM, Stephan Ewen <se...@apache.org> wrote:
> Hi! > > Thanks for clarifying. Here are some thoughts: > > 1) The akka URL should go through a non-exposes mechanism, true. In fact, > using the global configuration for the local embedded mode at all seems to > be a bad design that we should get rid of. Sooner or later we should also get rid of the global configuration. > > 2) Okay, so we keep our own hearbeats in place as a means for metric > reports. At some point, we can avoid having the JobManager actor watch the > TaskManager actor then, it seems. > That is right. We can choose depending which one performs better. > > 3) re: transport failure detector - makes sense > > 4) Yes, let's have a single timeout value that defines the ask timeout, tcp > timeout, and the interval of the watch failure detector, and allow to > override them by specifying the options. > I chose the following heuristics for the moment: ask timeout = tcp timeout = startup timeout = death watch pause = 10 * interval of death watch We can see how they behave and if necessary adapt. > > Stephan > > > On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <trohrm...@apache.org> > wrote: > > > Hi, > > > > you are right, the new implementation still lacks a lot of documentation > > which makes understanding the code harder than necessary. > > > > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <se...@apache.org> wrote: > > > > > Hi! > > > > > > Since the new distributed infrastructure is built on Akka, some > internal > > > concepts have changed now. > > > I think that this is currently not really document anywhere > > > > > > @Till Can you elaborate on the questions here: > > > > > > - What is the Akka URL in the global configuration > > ("jobmanager.akka.url") > > > From the perspective of the global configuration, don't we simply have > > the > > > address and port of the actor system? > > > > > > > The jobmanager.akka.url is used to overwrite the default akka url > > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in > > cases where we do not have remote actor systems but a single local, as in > > the case of local execution, and thus have to use a different url scheme. > > In case of a single actor system, the url would be > > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only > > used internally and should not be configured by the user. To make it > > fail-safe we should probably use a non exposed mechanism. > > > > > > > > > > - We currently have multiple competing failure-detection mechanisms: > For > > > one, the job manager actor watches the task manager actors. Also, we > > still > > > have the manual heart beats in place. Shouldn't we remove the old > manual > > > heartbeats and have the instance manager watch the task manager actors? > > > > > > > It's right that we still have the old heartbeats in place but they are > > stripped down. Currently, they are only used to update the > > lastReceivedHeartBeat field in the Instance object. Consequently, they > > could be simply removed at the price of not getting shown the time since > > the last heartbeat in the web interface. The failure detection mechanism > is > > currently realized exclusively by using Akka's death watch, meaning that > > the JobManager watches the TaskManagers and vice versa. I also thought > that > > some people wanted to piggy back on the heartbeat message to do > monitoring. > > Therefore I kept it for the moment. But I guess that a dedicated > monitoring > > message would be better. > > > > > > > - There are transport heartbeats and watch heartbeats. I could not > find > > a > > > good explanation of what the transport heartbeats are. Also, the > > heartbeat > > > interval is very large (1000 s) by default, so I am wondering what > there > > > purpose is. > > > > > > > Yes you're right that Akka has a lot of little knobs to turn and twist > and > > some of them are more obvious than others. The transport failure detector > > is Akka's own mechanism to detect lost messages. This is necessary for > UDP > > but not for TCP since it has its own failure detector. In order to > decrease > > the unnecessary network traffic, I set the heartbeat pause and heartbeat > > interval of the transport failure detector to these high numbers. > > > > > > > > > > - There are many different timeouts: > > > -> startup timeout > > > > > > > That is the timeout for creating an actor system. > > > > > > > -> watch heartbeat timeout > > > > > > > This timeout is used for the death watch. But the detector is actually > > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause > and > > akka.watch.threshold. In [1] it is described what these parameters do. > > > > > > > -> ask timeout > > > > > > > That is the general timeout which is used for all futures once the actor > > system has been started. > > > > > > > -> TCP timeout > > > > > > > The TCP timeout is the timeout which is used by Netty for all outbound > > connections. > > > > > > > How to the relate / interact? Does it make sense to define them > > relative > > > to one another? > > > > > > For the sake of simplicity and usability, it is a good idea to derive the > > individual timeouts by means of some heuristics from a single timeout > > value. Maybe we could use these heuristics as default values but still > > allow the user to define these values himself if he wants to. > > > > > > > > > > I think it makes a lot of sense to document these points somewhere. > > > > > > > I'll add an overview and details of the implementation to the internals > > section of the documentation. > > > > > > > > > > Greetings, > > > Stephan > > > > > > > [1] > > > > > http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors > > >