Hi Joe, On 26/02/2013, at 1:39 PM, Joe Gordon <j...@cloudscaling.com> wrote:
> > > On Mon, Feb 25, 2013 at 6:14 PM, Sam Morrison <sorri...@gmail.com> wrote: > Hi Joe, > > On 26/02/2013, at 11:19 AM, Joe Gordon <j...@cloudscaling.com> wrote: > >> On Sun, Feb 24, 2013 at 3:31 PM, Sam Morrison <sorri...@gmail.com> wrote: >> I have been playing with the AggregateInstanceExtraSpecs filter and can't >> get it to work. >> >> In our staging environment it works fine with 4 compute nodes, I have 2 >> aggregates to split them into 2. >> >> When I try to do the same in our production environment which has 80 compute >> nodes (splitting them again into 2 aggregates) it doesn't work. >> >> nova-scheduler starts to go very slow, I scheduled an instance and gave up >> after 5 minutes, it seemed to be taking ages and the host was at 100% cpu. >> Also got about 500 messages in rabbit that were unacknowledged. >> >> >> what does the nova-scheduler log say? Where is the unacknowledged rabbitmq >> messages sent from? > > Logs are below. Note the large time gap between selecting a host, this is > pretty much instantaneous without this filter. > > Can't figure out how to see an unacknowledged message in rabbit but my guess > is it is the compute service updates from all the compute nodes. These aren't > happening and I think this is the reason that the attempts to schedule > further down are rejected with "is disabled or has not been heard from in a > while" > > Do you see anything that could be an issue? Flags we use for scheduler are > below also: > > Thanks for your help, > Sam > > > It looks like the scheduler issues are related to the rabbitmq issues. > "host 'qh2-rcc77' ... is disabled or has not been heard from in a while" > > What does 'nova host-list' say? the clocks must all be synced up? > Yeah all the clocks are synced up fine. Doing a nova-manage service list gives me all :-) and updated at is correct. We only have one nova-scheduler. It gets locked up and goes at 100% CPU. nova-scheduler seems to take the compute service updates off the queue while this is happening but doesn't ack them and going by the logs doesn't process them. This is why I suspect the hosts are eventually being rejected with a "not been heard from in a while" message. This is a symptom though I believe as the real issue is nova-scheduler locking up, it seems to take 30-60 seconds for it to process each host to determine if it passes the filters. Does that make sense? Any other ideas on how to debug? Cheers, Sam _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp