On 12/23/2015 08:35 PM, Morgan Fainberg wrote:
On Wed, Dec 23, 2015 at 10:32 AM, Jay Pipes <[email protected]
<mailto:[email protected]>> wrote:

    On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:

        I've been looking into the startup constraints involved when
        launching
        Nova services with systemd using Type=notify (which causes
        systemd to
        wait for an explicit notification from the service before
        considering
        it to be "started".  Some services (e.g., nova-conductor) will
        happily
        "start" even if the backing database is currently unavailable (and
        will enter a retry loop waiting for the database).

        Other services -- specifically, nova-scheduler -- will block waiting
        for the database *before* providing systemd with the necessary
        notification.

        nova-scheduler blocks because it wants to initialize a list of
        available aggregates (in
        scheduler.host_manager.HostManager.__init__),
        which it gets by calling objects.AggregateList.get_all.

        Does it make sense to block service startup at this stage?  The
        database disappearing during runtime isn't a hard error -- we will
        retry and reconnect when it comes back -- so should the same
        situation
        at startup be a hard error?  As an operator, I am more interested in
        "did my configuration files parse correctly?" at startup, and would
        generally prefer the service to start (and permit any dependent
        services to start) even when the database isn't up (because that's
        probably a situation of which I am already aware).


    If your configuration file parsed correctly but has the wrong
    database connection URI, what good is the service in an active
    state? It won't be able to do anything at all.

    This is why I think it's better to have hard checks like for
    connections on startup and not have services active if they won't be
    able to do anything useful.


Are you advocating that scheduler bails out and ceases to run or that it
doesn't mark itself as active? I am in favour of the second scenario but
not the first. There are cases where it would be nice to start the
scheduler and have it at least report "hey I can't contact the DB" but
not mark itself active, but continue to run and on <interval> report/try
to reconnect.

I am in favor of the service not starting at all if the database cannot be connected to in a "test connection" scenario.

It isn't clear which level of "hard check" you're advocating in your
response and I want to clarify for the sake of conversation.

If the scheduler cannot contact the database, it cannot do anything useful at all. I don't see the point of having the service daemon "up" if it cannot do anything useful.

Most monitoring tooling (Nagios or nginx for simple load balancing) and distributed service management (Zookeeper) look at whether a service is responding on some port to determine if the service is up. If the service responds on said port, but cannot do anything useful, the information is less than useful...it's harmful, IMHO.

For errors that are recoverable, sure keep the service up and running and retry the condition that is recoverable. But in the case of bad configuration, it's not a recoverable error, and I don't think the service should be started at all.

Hope that clears things up.

Best,
-jay

        It would be relatively easy to have the scheduler lazy-load the list
        of aggregates on first references, rather than at __init__.


    Sure, but if the root cause of the issue is a problem due to
    misconfigured connection string, then that lazy-load will just bomb
    out and the scheduler will be useless anyway. I'd rather have a
    fail-early/fast occur here than a fail-late.

    Best,
    -jay

     > I'm not

        familiar enough with the nova code to know if there would be any
        undesirable implications of this behavior.  We're already punting
        initializing the list of instances to an asynchronous task in
        order to
        avoid blocking service startup.

        Does it make sense to permit nova-scheduler to complete service
        startup in the absence of the database (and then retry the
        connection
        in the background)?



        
__________________________________________________________________________
        OpenStack Development Mailing List (not for usage questions)
        Unsubscribe:
        [email protected]?subject:unsubscribe
        <http://[email protected]?subject:unsubscribe>
        http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


    __________________________________________________________________________
    OpenStack Development Mailing List (not for usage questions)
    Unsubscribe:
    [email protected]?subject:unsubscribe
    <http://[email protected]?subject:unsubscribe>
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to