On Tue, Jan 17, 2012 at 10:11 AM, Lars Ellenberg <lars.ellenb...@linbit.com> wrote: > On Mon, Jan 16, 2012 at 11:42:32PM +1100, Andrew Beekhof wrote: >> >>> http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs >> >>> >> >>> iiuc, mainloop does something similar to (oversimplified): >> >>> timeout = -1; /* infinity */ >> >>> for s in all GSource >> >>> tmp_timeout = -1; >> >>> s->prepare(s, &tmp_timeout) >> >>> if (tmp_timeout >= 0 && tmp_timeout < timeout) >> >>> timeout = tmp_timeout; >> >>> >> >>> poll(GSource fd set, n, timeout); >> >> >> >> I'm looking at the glib code again now, and it still looks to me like >> >> the trigger and signal sources do not appear in this fd set. >> >> Their setup functions would have to have called g_source_add_poll() >> >> somewhere, which they don't. >> >> >> >> So I'm still not seeing why its a trigger or signal sources' fault >> >> that glib is doing a never ending call to poll(). >> >> poll() is going to get called regardless of whether our prepare >> >> function returns true or not. >> >> >> >> Looking closer, crm_trigger_prepare() returning TRUE results in: >> >> ready_source->flags |= G_SOURCE_READY; >> >> >> >> which in turn causes: >> >> context->timeout = 0; >> >> >> >> which is essentially what adding >> >> if (trig->trigger) >> >> *timeout = 0; >> >> >> >> to crm_trigger_prepare() was intended to achieve. >> >> >> >> Shouldn't the fd, ipc or wait sources (who do call g_source_add_poll() >> >> and could therefor cause poll() to block forever) have a sane timeout >> >> in their prepare functions? > > Probably should, but they usually have not. > The reasoning probably is, each GSource is responsible for *itself* only.
Well no, because this forces trigger to care about whether there is a fd based GSource too and what timeout, if any, is set. > > That is why first all sources are prepared. > > If no non-fd, non-pollable source feels the need to reduce the > *timeout to something finite in its prepare(), so be it. So something that doesn't use poll at all should set a timeout for poll, that doesn't sound right :-) > > Besides, what is sane? 1 second? 5? 120? 240? > > That's why G_CH_prepare_int() sets the *timeout to 1000, > and why I suggest to set it to 0 if prepare already knows that the > trigger is set, and to some finite amount to avoid getting stuck in > poll, in case no timeout or outher source source is active which also > set some finite timeout. > > BTW, if you have an *idle* sources, prepare should set timeout to 0. > > For those interested, all described below > http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs > > "For idle sources, the prepare and check functions always return TRUE to > indicate that the source is always ready to be processed. The prepare > function also returns a timeout value of 0 to ensure that the poll() > call doesn't block (since that would be time wasted which could have > been spent running the idle function)." > > "... timeout sources ... returns a timeout value to ensure that the > poll() call doesn't block too long ..." > > "... file descriptor sources ... timeout to -1 to indicate that is does > not mind how long the poll() call blocks ... " > >> >> Or is it because the signal itself is interrupting some essential part >> >> of G_CH_prepare_int() and friends? > > In the provided strace, it looks like the SIGTERM > is delivered while calling some G_CH_prepare_int, > the ->prepare() used by G_main_add_IPC_Channel. > > Since the signal sources are of higher priority, > we probably are passt those already in this iteration, > we will only notice the trigger in the next check(), > after the poll. > > So it is vital for any non-pollable source such as signals > to set a finite timeout in their prepare(), > even if we also mark that signal siginterrupt(). > >> >>> for s in all GSource >> >>> if s->check(s) >> >>> s->dispatch(s, ...) >> >>> >> >>> And at some stage it also orders by priority, of course. >> >>> >> >>> Also compare with the comment above /* Sigh... */ in glue >> >>> G_SIG_prepare(). >> >>> >> >>> BTW, the mentioned race between signal delivery and mainloop already >> >>> doing the poll stage could potentially be solved by using >> >>> cl_signal_set_interrupt(SIGTERM, 1), > > As I just wrote above, that race is not solved at all. > Only the (necessarily set) finite timeout of the poll > would be shortened in that case. > >> >> But I can't escape the feeling that calling this just masks the >> >> underlying "why is there a never-ending call to poll() in the first >> >> place" issue. >> >> G_CH_prepare_int() and friends /should/ be setting timeouts so that >> >> poll() can return and any sources created by g_idle_source_new() can >> >> execute. >> > >> > Actually, thinking further, I'm pretty convinced that poll() with an >> > infinite timeout is the default mode of operation for mainloops with >> > cluster-glue's IPC and FD sources. >> > And that this is not a good thing :) > > Well, if there are *only* pollable sources, it is. > If there are any other sources, they should have set > their limit on what they think is an acceptable timeout > int their prepare(). > >> Far too late, brain shutting down. > > ;-) > >> ...not a good thing, because it breaks the idle stuff, > > see above, explanation on developer.gnome.org, > idle stuff is expected to set timeout 0 (or "just a few ms"). > >> but most of all because it requires /all/ external events to come out >> of that poll() call. > > If you have any non-pollable event sources, > they need to set a finite timeout in their prepare(). > >> >>> which would mean we can condense the prepare to > > Nope, I have already corrected myself, not sufficient. > > And, actually, if the prepare() returns true, > it will be dispatched() right there, anyways, I think, > which makes the conditional *timeout = 0 useless. > > So what G_SIG_prepare() does it just the right thing: > unconditionally set *timeout = 1000 for a maximum > signal callback dispatch delay of one second. > > (plus whatever time it takes to call all the other prepare > functions, plut whatever time it takes to return from the > dispatch functions... so better not spend too much time there, > that's why my first suspicion in this thread was those blocking > IPC calls in some dispatch callbacks) > > Lars > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org