On Tue, Dec 01, 2009 at 02:30:08PM +0000, Michael Hanselmann wrote: > 2009/12/1 Iustin Pop <[email protected]>: > > On Tue, Dec 01, 2009 at 01:09:50PM +0000, Michael Hanselmann wrote: > >> Yes, for the initial connect. However, the HTTP client disables read > >> timeouts after connecting (see > >> lib/http/client.py:HttpClientRequestExecutor.READ_TIMEOUT and > >> HttpClientRequestExecutor._ReadResponse). Otherwise it would time out > >> for long-running RPCs, depending on how the timeout is chosen. Hence > >> the “while a request is being handled” above. > > > > I think you're confusing protocol-level (L7) timeouts with TCP-level > > (L4) timeouts; this is about timeouts at TCP stack level which the > > application doesn't see (and doesn't care about). > > Updated design doc after discussing this offline. > > @@ -55,14 +55,24 @@ categories (e.g. fast and slow), this is not reliable. > > If a node has an issue or the network connection fails while a request > is being handled, the master daemon can wait for a long time for the > -connection to time out (due to the operating system's underlying TCP > -keep-alive packets or timeouts). While the settings for keep-alive > -packets can be changed using Linux-specific socket options, we don't > -consider them reliable and responsive enough for our case. > +connection to time out (e.g. due to the operating system's underlying > +TCP keep-alive packets or timeouts). While the settings for keep-alive > +packets can be changed using Linux-specific socket options, we prefer to > +use application-level timeouts because these cover both machine down and > +unresponsive node daemon cases. > > >> Actually, it probably should be something like "%s-%s-%s" % > >> (time.time(), pid, unique_id). Otherwise, if the node daemon is > >> restarted, function calls can collide again. A UUID would be even > >> better, but probably be too expensive. The exact format or composition > >> of the function call ID should not be part of this rather high-level > >> proposal. > > > > Well, I argue again that design docs should include low-level decisions > > rather than leave them to be made arbitrarily at patch writing time ;-) > > I suggest we use "${pid}:${time}:${random}": > > $ python -m timeit -s 'import os, time, random' '"%s:%0.6f:%s" % > (os.getpid(), time.time(), random.getrandbits(16))' > 100000 loops, best of 3: 7.8 usec per loop > > $ PYTHONPATH=. python -m timeit -s 'from ganeti import utils' > 'utils.NewUUID()' > 10000 loops, best of 3: 58.3 usec per loop > > Are you okay with this choice?
Hmm, what pid is that? I was thinking parent_pid+child_pid+... so that we protect from both child pid recycling and parent restart. > > With the async functionality, OK then. But this was not mentioned initially, > > which is why I asked. > > Added a note about async I/O. > > >> If we handle SIGINT/SIGTERM, it could wait for its child processes. > >> Otherwise the function processes just run to the end. I don't think we > >> should kill them, otherwise things get even more complicated with > >> signal handling (assuming root won't send signals). > > > > OK… I think we should at least try to handle them. > > Okay, added this: > > +On process termination (e.g. after having been sent a ``SIGTERM`` or > +``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all > +function processes and wait for all of them to terminate. > > Will re-send the whole proposal as it changed in quite a few places. Thanks! iustin
