[Openstack-operators] [neutron][connection tracking] OVS connection tracking for a DNS VNF

2018-02-11 Thread Ajay Kalambur (akalambu)
Hi
Has anyone had any experience running a DNS VNF on Openstack. Typically for 
these VNFs there is a really huge volume of DNS lookups and this translates to 
entries for udp in the conntrack table
Sometimes under load this can lead to
nf_conntrack table being FULL
The default max on most systems for conntrack is 65536. Some forums suggest 
increasing this to a very large value to handle large DNS scale.
Question I have is there a way to disable OVS connection tracking on a per port 
basis in neutron.

Also folks running this in production do you get this working by tweaking 
ip_conntrack_max and udp timeout?

Ajay


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [openstack][placement] Placement API service catalog

2017-06-10 Thread Ajay Kalambur (akalambu)
Hi Curtis
Thanks for the help. You were spot on in pointing out the issue
Copy pasted previous nova api haproxy config and forgot to update port


Thanks again 
Ajay




On 6/10/17, 1:52 PM, "Curtis" <serverasc...@gmail.com> wrote:

>On Sat, Jun 10, 2017 at 11:56 AM, Ajay Kalambur (akalambu)
><akala...@cisco.com> wrote:
>> Hi
>> I made all the changes as documented in
>> https://docs.openstack.org/ocata/install-guide-ubuntu/nova-controller-install.html
>> https://docs.openstack.org/ocata/install-guide-ubuntu/nova-compute-install.html
>>
>> The issue im facing is when nova compute comes up and queries the placement
>> API it seems to get a status 300 error code
>> 2017-06-10 10:48:27.236 33 ERROR nova.scheduler.client.report
>> [req-18ea91e0-a210-42af-a560-5c7697a20604 - - - - -] Failed to create
>> resource provider record in placement API for UUID
>> d2067675-062b-4550-8631-d23a3b13343b. Got 300: {"choices": [{"status":
>> "SUPPORTED", "media-types": [{"base": "application/json", "type":
>> "application/vnd.openstack.compute+json;version=2"}], "id": "v2.0", "links":
>> [{"href": "http://15.0.0.42:8778/v2/resource_providers;, "rel": "self"}]},
>> {"status": "CURRENT", "media-types": [{"base": "application/json", "type":
>> "application/vnd.openstack.compute+json;version=2.1"}], "id": "v2.1",
>> "links": [{"href": "http://15.0.0.42:8778/v2.1/resource_providers;, "rel":
>> "self"}]}]}.
>>
>
>If I do this against an ocata placement api:
>
>$ OS_TOKEN=$(openstack token issue -f value -c id)
>$ curl -s -H "X-Auth-Token: $OS_TOKEN" http://:8778/
>{"versions": [{"min_version": "1.0", "max_version": "1.4", "id": "v1.0"}]}
>
>Is you loadbalancer for listening on 8778 but pointing to your nova
>api port maybe? (Just a random guess.)
>
>Thanks,
>Curtis.
>
>>
>>
>> The symptoms look like service catalog is messed up as even if I stop
>> placement API I get this error
>>
>> Now when I looked at the keystone service catalog it seems fine
>> | placement  | placement   | RegionOne
>> |
>> || |   publicURL: https://172.29.86.12:8778
>> |
>> || |   internalURL: http://15.0.0.42:8778
>> |
>> || |   adminURL: http://15.0.0.42:8778
>>
>>
>> | nova   | compute | RegionOne
>> |
>> || |   publicURL:
>> https://172.29.86.12:8774/v2.1   |
>> || |   internalURL:
>> http://15.0.0.42:8774/v2.1 |
>> || |   adminURL: http://15.0.0.42:8774/v2.1
>>
>>
>> Not sure what I am doing wrong here
>>
>> Also nova-status upgrade check returns an error
>> nova-status upgrade check
>> Option "verbose" from group "DEFAULT" is deprecated for removal.  Its value
>> may be silently ignored in the future.
>> {u'versions': [{u'status': u'SUPPORTED', u'updated':
>> u'2011-01-21T11:33:21Z', u'links': [{u'href': u'http://15.0.0.42:8778/v2/',
>> u'rel': u'self'}], u'min_version': u'', u'version': u'', u'id': u'v2.0'},
>> {u'status': u'CURRENT', u'updated': u'2013-07-23T11:33:21Z', u'links':
>> [{u'href': u'http://15.0.0.42:8778/v2.1/', u'rel': u'self'}],
>> u'min_version': u'2.1', u'version': u'2.42', u'id': u'v2.1'}]}
>> Error:
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/site-packages/nova/cmd/status.py", line 457, in
>> main
>> ret = fn(*fn_args, **fn_kwargs)
>>   File "/usr/lib/python2.7/site-packages/nova/cmd/status.py", line 387, in
>> check
>> result = func(self)
>>   File "/usr/lib/python2.7/site-packages/nova/cmd/status.py", line 202, in
>> _check_placement
>> max_version = float(versions["versions"][0]["max_version"])
>> KeyError: 'max_version'
>>
>> This is with Ocata
>>
>> Ajay
>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
>
>-- 
>Blog: serverascode.com
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] [openstack][placement] Placement API service catalog

2017-06-10 Thread Ajay Kalambur (akalambu)
Hi
I made all the changes as documented in
https://docs.openstack.org/ocata/install-guide-ubuntu/nova-controller-install.html
https://docs.openstack.org/ocata/install-guide-ubuntu/nova-compute-install.html

The issue im facing is when nova compute comes up and queries the placement API 
it seems to get a status 300 error code
2017-06-10 10:48:27.236 33 ERROR nova.scheduler.client.report 
[req-18ea91e0-a210-42af-a560-5c7697a20604 - - - - -] Failed to create resource 
provider record in placement API for UUID d2067675-062b-4550-8631-d23a3b13343b. 
Got 300: {"choices": [{"status": "SUPPORTED", "media-types": [{"base": 
"application/json", "type": 
"application/vnd.openstack.compute+json;version=2"}], "id": "v2.0", "links": 
[{"href": "http://15.0.0.42:8778/v2/resource_providers;, "rel": "self"}]}, 
{"status": "CURRENT", "media-types": [{"base": "application/json", "type": 
"application/vnd.openstack.compute+json;version=2.1"}], "id": "v2.1", "links": 
[{"href": "http://15.0.0.42:8778/v2.1/resource_providers;, "rel": "self"}]}]}.



The symptoms look like service catalog is messed up as even if I stop placement 
API I get this error

Now when I looked at the keystone service catalog it seems fine
| placement  | placement   | RegionOne  
   |
|| |   publicURL: https://172.29.86.12:8778 
   |
|| |   internalURL: http://15.0.0.42:8778   
   |
|| |   adminURL: http://15.0.0.42:8778


| nova   | compute | RegionOne  
   |
|| |   publicURL: 
https://172.29.86.12:8774/v2.1   |
|| |   internalURL: http://15.0.0.42:8774/v2.1  
   |
|| |   adminURL: http://15.0.0.42:8774/v2.1


Not sure what I am doing wrong here

Also nova-status upgrade check returns an error
nova-status upgrade check
Option "verbose" from group "DEFAULT" is deprecated for removal.  Its value may 
be silently ignored in the future.
{u'versions': [{u'status': u'SUPPORTED', u'updated': u'2011-01-21T11:33:21Z', 
u'links': [{u'href': u'http://15.0.0.42:8778/v2/', u'rel': u'self'}], 
u'min_version': u'', u'version': u'', u'id': u'v2.0'}, {u'status': u'CURRENT', 
u'updated': u'2013-07-23T11:33:21Z', u'links': [{u'href': 
u'http://15.0.0.42:8778/v2.1/', u'rel': u'self'}], u'min_version': u'2.1', 
u'version': u'2.42', u'id': u'v2.1'}]}
Error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/cmd/status.py", line 457, in main
ret = fn(*fn_args, **fn_kwargs)
  File "/usr/lib/python2.7/site-packages/nova/cmd/status.py", line 387, in check
result = func(self)
  File "/usr/lib/python2.7/site-packages/nova/cmd/status.py", line 202, in 
_check_placement
max_version = float(versions["versions"][0]["max_version"])
KeyError: 'max_version'

This is with Ocata

Ajay

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

2016-04-21 Thread Ajay Kalambur (akalambu)
riodic_task 
context, filters, use_slave=True)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task   File 
"/usr/lib/python2.7/site-packages/nova/objects/base.py", line 161, in wrapper

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task args, 
kwargs)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task   File 
"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 335, in 
object_class_action

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task 
objver=objver, args=args, kwargs=kwargs)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in 
call

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task 
retry=self.retry)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in 
_send

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task 
timeout=timeout, retry=retry)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
381, in send

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task 
retry=retry)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
370, in _send

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task result 
= self._waiter.wait(msg_id, timeout)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
274, in wait

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task message 
= self.waiters.get(msg_id, timeout=timeout)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
180, in get

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task 'to 
message ID %s' % msg_id)

2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task 
MessagingTimeout: Timed out waiting for a reply to message ID 
c0c46bd3ebfb4441981617e089c5a18d



From: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 12:11 PM
To: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Thanks Kris that’s good information will try out your suggestions
Ajay


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 12:08 PM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

We just use heartbeat.  But from what I recall other people have good luck with 
both set. I would keep them if they are already set , maybe just dial down how 
aggressive they are.  One thing I should mention is that if you have a large 
number of RPC workers, enabling heartbeats will increase cpu consumption about 
1-2% per worker (in our experience).  Since its now doing something with 
rabbitmq every few seconds.  This can also increase load on the rabbitmq side 
as well.  For us having a stable rabbit environment is well worth the tradeoff.

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 1:04 PM
To: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you recommend both or can I do away with the system timers and just keep the 
heartbeat?
Ajay


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 11:54 AM
To:

Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

2016-04-21 Thread Ajay Kalambur (akalambu)
We are seeing issues only on client side as of now.
But we do have
net.ipv4.tcp_retries2 = 3 set

Ajay

From: "Edmund Rhudy (BLOOMBERG/ 731 LEX)" 
<erh...@bloomberg.net<mailto:erh...@bloomberg.net>>
Reply-To: "Edmund Rhudy (BLOOMBERG/ 731 LEX)" 
<erh...@bloomberg.net<mailto:erh...@bloomberg.net>>
Date: Thursday, April 21, 2016 at 12:11 PM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>
Cc: 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Are you seeing issues only on the client side, or anything on the broker side? 
We were having issues with nodes not successfully reconnecting and ended up 
making a number of changes on the broker side to improve resiliency (upgrading 
to RabbitMQ 3.5.5 or higher, reducing net.ipv4.tcp_retries2 to evict failed 
connections faster, configuring heartbeats in RabbitMQ to detect failed clients 
more quickly).

From: akala...@cisco.com<mailto:akala...@cisco.com>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo
Do you recommend both or can I do away with the system timers and just keep the 
heartbeat?
Ajay


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 11:54 AM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Yea, that only fixes part of the issue.  The other part is getting the 
openstack messaging code itself to figure out the connection its using is no 
longer valid.  Heartbeats by itself solved 90%+ of our issues with rabbitmq and 
nodes being disconnected and never reconnecting.

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 12:51 PM
To: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Trying that now. I had aggressive system keepalive timers before

net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 5


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 11:50 AM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you have rabbitmq/oslo messaging heartbeats enabled?

If you aren't using heartbeats it will take a long time  for the nova-compute 
agent to figure out that its actually no longer attached to anything.  
Heartbeat does periodic checks against rabbitmq and will catch this state and 
reconnect.

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 11:43 AM
To: 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo


Hi
I am seeing on Kilo if I bring down one contoller node sometimes some computes 
report down forever.
I need to restart the compute service on compute node to recover. Looks like 
oslo is not reconnecting in nova-compute
Here is the Trace from nova-compute
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in 
call
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 
retry=self.retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/s

Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

2016-04-21 Thread Ajay Kalambur (akalambu)
Thanks Kris that’s good information will try out your suggestions
Ajay


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 12:08 PM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

We just use heartbeat.  But from what I recall other people have good luck with 
both set. I would keep them if they are already set , maybe just dial down how 
aggressive they are.  One thing I should mention is that if you have a large 
number of RPC workers, enabling heartbeats will increase cpu consumption about 
1-2% per worker (in our experience).  Since its now doing something with 
rabbitmq every few seconds.  This can also increase load on the rabbitmq side 
as well.  For us having a stable rabbit environment is well worth the tradeoff.

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 1:04 PM
To: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you recommend both or can I do away with the system timers and just keep the 
heartbeat?
Ajay


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 11:54 AM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Yea, that only fixes part of the issue.  The other part is getting the 
openstack messaging code itself to figure out the connection its using is no 
longer valid.  Heartbeats by itself solved 90%+ of our issues with rabbitmq and 
nodes being disconnected and never reconnecting.

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 12:51 PM
To: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Trying that now. I had aggressive system keepalive timers before

net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 5


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 11:50 AM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you have rabbitmq/oslo messaging heartbeats enabled?

If you aren't using heartbeats it will take a long time  for the nova-compute 
agent to figure out that its actually no longer attached to anything.  
Heartbeat does periodic checks against rabbitmq and will catch this state and 
reconnect.

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 11:43 AM
To: 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo


Hi
I am seeing on Kilo if I bring down one contoller node sometimes some computes 
report down forever.
I need to 

Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

2016-04-21 Thread Ajay Kalambur (akalambu)
Do you recommend both or can I do away with the system timers and just keep the 
heartbeat?
Ajay


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 11:54 AM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Yea, that only fixes part of the issue.  The other part is getting the 
openstack messaging code itself to figure out the connection its using is no 
longer valid.  Heartbeats by itself solved 90%+ of our issues with rabbitmq and 
nodes being disconnected and never reconnecting.

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 12:51 PM
To: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Trying that now. I had aggressive system keepalive timers before

net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 5


From: "Kris G. Lindgren" <klindg...@godaddy.com<mailto:klindg...@godaddy.com>>
Date: Thursday, April 21, 2016 at 11:50 AM
To: Ajay Kalambur <akala...@cisco.com<mailto:akala...@cisco.com>>, 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you have rabbitmq/oslo messaging heartbeats enabled?

If you aren't using heartbeats it will take a long time  for the nova-compute 
agent to figure out that its actually no longer attached to anything.  
Heartbeat does periodic checks against rabbitmq and will catch this state and 
reconnect.

___
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akala...@cisco.com<mailto:akala...@cisco.com>>
Date: Thursday, April 21, 2016 at 11:43 AM
To: 
"openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>"
 
<openstack-operators@lists.openstack.org<mailto:openstack-operators@lists.openstack.org>>
Subject: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo


Hi
I am seeing on Kilo if I bring down one contoller node sometimes some computes 
report down forever.
I need to restart the compute service on compute node to recover. Looks like 
oslo is not reconnecting in nova-compute
Here is the Trace from nova-compute
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in 
call
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 
retry=self.retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in 
_send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 
timeout=timeout, retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
350, in send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
339, in _send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db result = 
self._waiter.wait(msg_id, timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
243, in wait
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db message = 
self.waiters.get(msg_id, timeout=timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
149, in get
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 'to message ID 
%s' % msg_id)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.driv

[Openstack-operators] [oslo]nova compute reconnection Issue Kilo

2016-04-21 Thread Ajay Kalambur (akalambu)

Hi
I am seeing on Kilo if I bring down one contoller node sometimes some computes 
report down forever.
I need to restart the compute service on compute node to recover. Looks like 
oslo is not reconnecting in nova-compute
Here is the Trace from nova-compute
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in 
call
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 
retry=self.retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in 
_send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 
timeout=timeout, retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
350, in send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
339, in _send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db result = 
self._waiter.wait(msg_id, timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
243, in wait
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db message = 
self.waiters.get(msg_id, timeout=timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 
149, in get
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 'to message ID 
%s' % msg_id)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db MessagingTimeout: 
Timed out waiting for a reply to message ID e064b5f6c8244818afdc5e91fff8ebf1


Any thougths. I am at stable/kilo for oslo

Ajay

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] [openstack-operators] Fernet key rotation

2016-03-19 Thread Ajay Kalambur (akalambu)
Hi
In a multi node HA deployment for production does key rotate need a keystone 
process reboot or should we just run the fernet rotate on one node and 
distribute it without restarting any process
I presume keystone can handle the rotation without a restart?

I also assume this key rotation can happen without a maintenance window

What do folks typically do in production and how often do you rotate keys

Ajay

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Keystone token HA

2015-12-17 Thread Ajay Kalambur (akalambu)
Hi
If we deploy Keystone using memcached as token backend we see that bringing 
down 1 of 3 memcache servers results in some tokens getting invalidated. Does 
memcached not support replication of tokens
So if we wanted HA w.r.t keystone tokens should we use SQL backend for tokens?

Ajay

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Keystone audit logs with haproxy

2015-11-24 Thread Ajay Kalambur (akalambu)
Hi
Have a deployment where keystone sits behind a ha proxy node. Now 
authentication requests are made to a vip. Problem is when there is an 
authentication failure we cannot track the remote ip that failed login as all 
authentication failures show the VIP ip since ha proxy fwds the request to a 
backend keystone server

How do we use a load balancer like ha proxy and also track the remote failed ip 
for authentication failures
We get all authentication failures showing up with remote ip as vip ip



Ajay

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Rabbit HA queues

2015-09-01 Thread Ajay Kalambur (akalambu)
Hi
How is the rabbit_ha_queues parameter used in configuration files like 
nova.conf, neutron.conf, cinder.conf etc

What happens if on the rabbit node the ha queue is set to mirrored but the ha 
queues is set to False on client side
[root@j10-controller-1 /]# rabbitmqctl list_policies
Listing policies ...
/ ha-all all {"ha-mode":"all","ha-sync-mode":"automatic"} 0
...done.
[root@j10-controller-1 /]


I have rabbit_ha_queues=False set on nova and neutron.conf and from what I can 
see the queues seem to be mirrored. So why is this needed in nova.conf, 
neutron.conf etc

This is Juno release

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Control exchange configuration

2015-09-01 Thread Ajay Kalambur (akalambu)
Hi
When we configure the control_exchange parameter in each of the openstack 
components it defaults to openstack
Is there a recommendation to have separate exchanges per component or just use 
the openstack exchange for rabbit

Is there any impact of using one vs the other


Ajay

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators