[Openstack-operators] User Survey: Pre-launch check

2017-01-09 Thread Heidi Joy Tretheway
ARE WE OFFERING THE RIGHT ANSWERS?
Thanks for your input over the past month to revise the User Survey. Many of 
our technical questions, such as “which network drivers do you use?” or “which 
container tools do you use?” have answers that change with the speed of 
innovation.

We’re in the final stages of preparing the OpenStack User Survey for launch in 
a couple of weeks, but before it goes live, we wanted to share the final 
question and answer content with you: 
https://docs.google.com/spreadsheets/d/1c8Osptcse1geEeHBHwZ-fxQA-OQ_dzwAHSnBD3CNbEQ/edit#gid=0
 


We’re primarily looking for errors in the answer choices (or missing answer 
choices) for the questions in rows 67-85 (please see column E for the answer 
options). Please feel free to leave feedback via a comment on the spreadsheet. 

USER SURVEY RESULTS
Also, we’re looking for volunteers to help produce the survey. Can you spend a 
few hours analyzing survey comments, charts, or reading a draft report? Be part 
of one of our most talked-about community assets by joining the User Survey 
work group (both technical and non-technical people are welcome!). 

Sign up here: https://goo.gl/forms/F0c1a9NOR1zK7sUJ3 




Heidi Joy Tretheway
Senior Marketing Manager, OpenStack Foundation
503 816 9769  | Skype: heidi.tretheway 

     




___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] RabbitMQ 3.6.x experience?

2017-01-09 Thread Mike Dorman
Great info, thanks so much for this.  We, too, have turned off stats collection 
some time ago (and haven’t really missed it.)

Tomáš, what minor version of 3.6 are you using?  We would probably go to 3.6.6 
if we upgrade.

Thanks again all!
Mike


On 1/9/17, 2:34 AM, "Ricardo Rocha"  wrote:

Same here, running 3.6.5 for (some) of the rabbit clusters.

It's been stable over the last month (fingers crossed!), though:
* gave up on stats collection (set to 6 which makes it not so useful)
* can still make it very sick with a couple of misconfigured clients
(rabbit_retry_interval=1 and rabbit_retry_backoff=60 currently
everywhere).

Some data from the neutron rabbit cluster (3 vm nodes, not all infra
currently talks to neutron):

* connections: ~8k
* memory used per node: 2.5GB, 1.7GB, 0.1GB (the last one is less used
due to a previous net partition i believe)
* rabbit hiera configuration
rabbitmq::cluster_partition_handling: 'autoheal'
rabbitmq::config_kernel_variables:
  inet_dist_listen_min: 41055
  inet_dist_listen_max: 41055
rabbitmq::config_variables:
  collect_statistics_interval: 6
  reverse_dns_lookups: true
  vm_memory_high_watermark: 0.8
rabbitmq::environment_variables:
  SERVER_ERL_ARGS: "'+K true +A 128 +P 1048576'"
rabbitmq::tcp_keepalive: true
rabbitmq::tcp_backlog: 4096

* package versions

erlang-kernel-18.3.4.4-1
rabbitmq-server-3.6.5-1

It's stable enough to keep scaling it up in the next couple months and
see how it goes.

Cheers,
  Ricardo

On Mon, Jan 9, 2017 at 3:54 AM, Sam Morrison  wrote:
> We’ve been running 3.6.5 for sometime now and it’s working well.
>
> 3.6.1 - 3.6.3 are unusable, we had lots of issues with stats DB and other
> weirdness.
>
> Our setup is a 3 physical node cluster with around 9k connections, average
> around the 300 messages/sec delivery. We have the stats sample rate set to
> default and it is working fine.
>
> Yes we did have to restart the cluster to upgrade.
>
> Cheers,
> Sam
>
>
>
> On 6 Jan 2017, at 5:26 am, Matt Fischer  wrote:
>
> MIke,
>
> I did a bunch of research and experiments on this last fall. We are 
running
> Rabbit 3.5.6 on our main cluster and 3.6.5 on our Trove cluster which has
> significantly less load (and criticality). We were going to upgrade to 
3.6.5
> everywhere but in the end decided not to, mainly because there was little
> perceived benefit at the time. Our main issue is unchecked memory growth 
at
> random times. I ended up making several config changes to the stats
> collector and then we also restart it after every deploy and that solved 
it
> (so far).
>
> I'd say these were my main reasons for not going to 3.6 for our control
> nodes:
>
> In 3.6.x they re-wrote the stats processor to make it parallel. In every 
3.6
> release since then, Pivotal has fixed bugs in this code. Then finally they
> threw up their hands and said "we're going to make a complete rewrite in
> 3.7/4.x" (you need to look through issues on Github to find this 
discussion)
> Out of the box with the same configs 3.6.5 used more memory than 3.5.6,
> since this was our main issue, I consider this a negative.
> Another issue is the ancient version of erlang we have with Ubuntu Trusty
> (which we are working on) which made upgrades more complex/impossible
> depending on the version.
>
> Given those negatives, the main one being that I didn't think there would 
be
> too many more fixes to the parallel statsdb collector in 3.6, we decided 
to
> stick with 3.5.6. In the end the devil we know is better than the devil we
> don't and I had no evidence that 3.6.5 would be an improvement.
>
> I did decide to leave Trove on 3.6.5 because this would give us some 
bake-in
> time if 3.5.x became untenable we'd at least have had it up and running in
> production and some data on it.
>
> If statsdb is not a concern for you, I think this changes the math and 
maybe
> you should use 3.6.x. I would however recommend at least going to 3.5.6,
> it's been better than 3.3/3.4 was.
>
> No matter what you do definitely read all the release notes. There are 
some
> upgrades which require an entire cluster shutdown. The upgrade to 3.5.6 
did
> not require this IIRC.
>
> Here's the hiera for our rabbit settings which I assume you can translate:
>
> rabbitmq::cluster_partition_handling: 'autoheal'
> rabbitmq::config_variables:
>   'vm_memory_high_watermark': '0.6'
>   'collect_statistics_interval': 3
> rabbitmq::config_management_variables:
>   'rates_mode': 'none'
> rabbitmq::file_limit: 

Re: [Openstack-operators] [nova] Automatically disabling compute service on RBD EMFILE failures

2017-01-09 Thread Daniel P. Berrange
On Sat, Jan 07, 2017 at 12:04:25PM -0600, Matt Riedemann wrote:
> A few weeks ago someone in the operators channel was talking about issues
> with ceph-backed nova-compute and OSErrors for too many open files causing
> issues.
> 
> We have a bug reported that's very similar sounding:
> 
> https://bugs.launchpad.net/nova/+bug/1651526
> 
> During the periodic update_available_resource audit, the call to RBD to get
> disk usage fails with the EMFILE OSError. Since this is in a periodic it
> doesn't cause any direct operations to fail, but it will cause issues with
> scheduling as that host is really down, however, nothing sets the service to
> down (disabled).
> 
> I had proposed a solution in the bug report that we could automatically
> disable the service for that host when this happens, and then automatically
> enable the service again if/when the next periodic task run is successful.
> Disabling the service would take that host out of contention for scheduling
> and may also trigger an alarm for the operator to investigate the failure
> (although if there are EMFILE errors from the ceph cluster I'm guessing
> alarms should already be going off).
> 
> Anyway, I wanted to see how hacky of an idea this is. We already
> automatically enable/disable the service from the libvirt driver when the
> connection to libvirt itself drops via an event callback. This would be
> similar albeit less sophisticated as it's not using an event listening
> mechanism, we'd have to maintain some local state in memory to know if we
> need to enable/disable the service again. And it seems pretty
> hacky/one-offish to handle this just for the RBD failure, but maybe we just
> generically handle it for any EMFILE error when collecting disk usage in the
> resource audit?

Presumably this deployment was using the default Linux file limits
which are at a ridiculously low value of 1024. Ceph with 900 OSDs
will potentially need 900 files, not really leaving any slack for
Nova todo other work. I'd be willing to bet there are other scenarios
in which Nova would hit the 1024 FD limit under high usage, not merely
Ceph. So perhaps regardless of whether Ceph is used, we should just
recommend that you always run Nova with 4096 fds, and check that in
initialize() on startup and log a warning if the num files is lower
than this.

With pretty much all distros using systemd, it would be nice if Nova
shipped a standard systemd unit file, which could then also contain
the recommended higher FD limit so people get sane limits out of the
box.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] RabbitMQ 3.6.x experience?

2017-01-09 Thread Ricardo Rocha
Same here, running 3.6.5 for (some) of the rabbit clusters.

It's been stable over the last month (fingers crossed!), though:
* gave up on stats collection (set to 6 which makes it not so useful)
* can still make it very sick with a couple of misconfigured clients
(rabbit_retry_interval=1 and rabbit_retry_backoff=60 currently
everywhere).

Some data from the neutron rabbit cluster (3 vm nodes, not all infra
currently talks to neutron):

* connections: ~8k
* memory used per node: 2.5GB, 1.7GB, 0.1GB (the last one is less used
due to a previous net partition i believe)
* rabbit hiera configuration
rabbitmq::cluster_partition_handling: 'autoheal'
rabbitmq::config_kernel_variables:
  inet_dist_listen_min: 41055
  inet_dist_listen_max: 41055
rabbitmq::config_variables:
  collect_statistics_interval: 6
  reverse_dns_lookups: true
  vm_memory_high_watermark: 0.8
rabbitmq::environment_variables:
  SERVER_ERL_ARGS: "'+K true +A 128 +P 1048576'"
rabbitmq::tcp_keepalive: true
rabbitmq::tcp_backlog: 4096

* package versions

erlang-kernel-18.3.4.4-1
rabbitmq-server-3.6.5-1

It's stable enough to keep scaling it up in the next couple months and
see how it goes.

Cheers,
  Ricardo

On Mon, Jan 9, 2017 at 3:54 AM, Sam Morrison  wrote:
> We’ve been running 3.6.5 for sometime now and it’s working well.
>
> 3.6.1 - 3.6.3 are unusable, we had lots of issues with stats DB and other
> weirdness.
>
> Our setup is a 3 physical node cluster with around 9k connections, average
> around the 300 messages/sec delivery. We have the stats sample rate set to
> default and it is working fine.
>
> Yes we did have to restart the cluster to upgrade.
>
> Cheers,
> Sam
>
>
>
> On 6 Jan 2017, at 5:26 am, Matt Fischer  wrote:
>
> MIke,
>
> I did a bunch of research and experiments on this last fall. We are running
> Rabbit 3.5.6 on our main cluster and 3.6.5 on our Trove cluster which has
> significantly less load (and criticality). We were going to upgrade to 3.6.5
> everywhere but in the end decided not to, mainly because there was little
> perceived benefit at the time. Our main issue is unchecked memory growth at
> random times. I ended up making several config changes to the stats
> collector and then we also restart it after every deploy and that solved it
> (so far).
>
> I'd say these were my main reasons for not going to 3.6 for our control
> nodes:
>
> In 3.6.x they re-wrote the stats processor to make it parallel. In every 3.6
> release since then, Pivotal has fixed bugs in this code. Then finally they
> threw up their hands and said "we're going to make a complete rewrite in
> 3.7/4.x" (you need to look through issues on Github to find this discussion)
> Out of the box with the same configs 3.6.5 used more memory than 3.5.6,
> since this was our main issue, I consider this a negative.
> Another issue is the ancient version of erlang we have with Ubuntu Trusty
> (which we are working on) which made upgrades more complex/impossible
> depending on the version.
>
> Given those negatives, the main one being that I didn't think there would be
> too many more fixes to the parallel statsdb collector in 3.6, we decided to
> stick with 3.5.6. In the end the devil we know is better than the devil we
> don't and I had no evidence that 3.6.5 would be an improvement.
>
> I did decide to leave Trove on 3.6.5 because this would give us some bake-in
> time if 3.5.x became untenable we'd at least have had it up and running in
> production and some data on it.
>
> If statsdb is not a concern for you, I think this changes the math and maybe
> you should use 3.6.x. I would however recommend at least going to 3.5.6,
> it's been better than 3.3/3.4 was.
>
> No matter what you do definitely read all the release notes. There are some
> upgrades which require an entire cluster shutdown. The upgrade to 3.5.6 did
> not require this IIRC.
>
> Here's the hiera for our rabbit settings which I assume you can translate:
>
> rabbitmq::cluster_partition_handling: 'autoheal'
> rabbitmq::config_variables:
>   'vm_memory_high_watermark': '0.6'
>   'collect_statistics_interval': 3
> rabbitmq::config_management_variables:
>   'rates_mode': 'none'
> rabbitmq::file_limit: '65535'
>
> Finally, if you do upgrade to 3.6.x please report back here with your
> results at scale!
>
>
> On Thu, Jan 5, 2017 at 8:49 AM, Mike Dorman  wrote:
>>
>> We are looking at upgrading to the latest RabbitMQ in an effort to ease
>> some cluster failover issues we’ve been seeing.  (Currently on 3.4.0)
>>
>>
>>
>> Anyone been running 3.6.x?  And what has been your experience?  Any
>> gottchas to watch out for?
>>
>>
>>
>> Thanks,
>>
>> Mike
>>
>>
>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
> ___
> 

Re: [Openstack-operators] RabbitMQ 3.6.x experience?

2017-01-09 Thread Tomáš Vondra
We have upgraded to RabbitMQ 3.6, and it resulted in one node crashing about 
every week on out of memory errors. To avoid this, we had to turn off the 
message rate collection. So no throughput graphs until it gets fixed. Avoid 
this version if you can.

Tomas

 

From: Sam Morrison [mailto:sorri...@gmail.com] 
Sent: Monday, January 09, 2017 3:55 AM
To: Matt Fischer
Cc: OpenStack Operators
Subject: Re: [Openstack-operators] RabbitMQ 3.6.x experience?

 

We’ve been running 3.6.5 for sometime now and it’s working well.

 

3.6.1 - 3.6.3 are unusable, we had lots of issues with stats DB and other 
weirdness. 

 

Our setup is a 3 physical node cluster with around 9k connections, average 
around the 300 messages/sec delivery. We have the stats sample rate set to 
default and it is working fine.

 

Yes we did have to restart the cluster to upgrade.

 

Cheers,

Sam

 

 

 

On 6 Jan 2017, at 5:26 am, Matt Fischer  wrote:

 

MIke,

 

I did a bunch of research and experiments on this last fall. We are running 
Rabbit 3.5.6 on our main cluster and 3.6.5 on our Trove cluster which has 
significantly less load (and criticality). We were going to upgrade to 3.6.5 
everywhere but in the end decided not to, mainly because there was little 
perceived benefit at the time. Our main issue is unchecked memory growth at 
random times. I ended up making several config changes to the stats collector 
and then we also restart it after every deploy and that solved it (so far). 

 

I'd say these were my main reasons for not going to 3.6 for our control nodes:

*   In 3.6.x they re-wrote the stats processor to make it parallel. In 
every 3.6 release since then, Pivotal has fixed bugs in this code. Then finally 
they threw up their hands and said "we're going to make a complete rewrite in 
3.7/4.x" (you need to look through issues on Github to find this discussion)
*   Out of the box with the same configs 3.6.5 used more memory than 3.5.6, 
since this was our main issue, I consider this a negative.
*   Another issue is the ancient version of erlang we have with Ubuntu 
Trusty (which we are working on) which made upgrades more complex/impossible 
depending on the version.

Given those negatives, the main one being that I didn't think there would be 
too many more fixes to the parallel statsdb collector in 3.6, we decided to 
stick with 3.5.6. In the end the devil we know is better than the devil we 
don't and I had no evidence that 3.6.5 would be an improvement.

 

I did decide to leave Trove on 3.6.5 because this would give us some bake-in 
time if 3.5.x became untenable we'd at least have had it up and running in 
production and some data on it.

 

If statsdb is not a concern for you, I think this changes the math and maybe 
you should use 3.6.x. I would however recommend at least going to 3.5.6, it's 
been better than 3.3/3.4 was.

 

No matter what you do definitely read all the release notes. There are some 
upgrades which require an entire cluster shutdown. The upgrade to 3.5.6 did not 
require this IIRC.

 

Here's the hiera for our rabbit settings which I assume you can translate:

 

rabbitmq::cluster_partition_handling: 'autoheal'

rabbitmq::config_variables:

  'vm_memory_high_watermark': '0.6'

  'collect_statistics_interval': 3

rabbitmq::config_management_variables:

  'rates_mode': 'none'

rabbitmq::file_limit: '65535'

 

Finally, if you do upgrade to 3.6.x please report back here with your results 
at scale!

 

 

On Thu, Jan 5, 2017 at 8:49 AM, Mike Dorman  wrote:

We are looking at upgrading to the latest RabbitMQ in an effort to ease some 
cluster failover issues we’ve been seeing.  (Currently on 3.4.0)

 

Anyone been running 3.6.x?  And what has been your experience?  Any gottchas to 
watch out for?

 

Thanks,

Mike

 


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

 

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

 

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators