Re: [Puppet Users] Re: puppetmaster 100%cpu usage on 2.6 (not on 0.24)

Micah Anderson Wed, 26 Jan 2011 12:41:10 -0800

Brice Figureau <brice-pup...@daysofwonder.com> writes:

> On Wed, 2011-01-26 at 10:11 -0500, Micah Anderson wrote:
>> Brice Figureau <brice-pup...@daysofwonder.com> writes:
>> 
>> > On Tue, 2011-01-25 at 17:11 -0500, Micah Anderson wrote:
>> >> Brice Figureau <brice-pup...@daysofwonder.com> writes:
>> >>
>> >> All four of my mongrels are constantly pegged, doing 40-50% of the CPU
>> >> each, occupying all available CPUs. They never settle down. I've got 74
>> >> nodes checking in now, it doesn't seem like its that many, but perhaps
>> >> i've reached a tipping point with my puppetmaster (its a dual 1ghz,
>> >> 2gigs of ram machine)?
>> >
>> > The puppetmaster is mostly CPU bound. Since you have only 2 CPUs, you
>> > shouldn't try to achieve a concurrency of 4 (which your mongrel are
>> > trying to do), otherwise what will happen is that more than one request
>> > will be accepted by one mongrel process and each thread will contend for
>> > the CPU. The bad news is that the ruby MRI uses green threading, so the
>> > second thread will only run when the first one will either sleep, do I/O
>> > or relinquish the CPU voluntary. In a word, it will only run when the
>> > first thread will finish its compilation.
>> 
>> Ok, that is a good thing to know. I wasn't aware that ruby was not able
>> to do that.
>> 
>> > Now you have 74 nodes, with the worst compilation time of 75s (which is
>> > a lot), that translates to 74*75 = 5550s of compilation time.
>> > With a concurrency of 2, that's still 2775s of compilation time per
>> > round of <insert here your default sleep time>. With the default 30min
>> > of sleep time and assuming a perfect scheduling, that's still larger
>> > than a round of sleep time, which means that you won't ever finish
>> > compiling nodes, when the first node will ask again for a catalog.
>> 
>> I'm doing 60 minutes of sleep time, which is 3600 seconds an hour, the
>> concurrency of 2 giving me 2775s of compile time per hour does keep me
>> under the 3600 seconds... assuming scheduling is perfect, which it very
>> likely is not.
>> 
>> > And I'm talking only about compilation. If your manifests use file
>> > sourcing, you must also add this to the equation.
>> 
>> As explained, I set up your nginx method for offloading file sourcing.
>> 
>> > Another explanation of the issue is swapping. You mention your server
>> > has 2GiB of RAM. Are you sure your 4 mongrel processes after some times
>> > still fit in the physical RAM (along with the other thing running on the
>> > server)?
>> > Maybe your server is constantly swapping.
>> 
>> I'm actually doing fine on memory, not dipping into swap. I've watched
>> i/o to see if I could identify either a swap or disk problem, but didn't
>> notice very much happening there. The CPU usage of the mongrel processes
>> is pretty much where everything is spending its time. 
>> 
>> I've been wondering if I have some loop in a manifest or something that
>> is causing them to just spin.
>
> I don't think it's the problem. There could be some ruby internals
> issues playing here, but I doubt something in your manifest creates a
> loop.
>
> What is strange is that you mentioned that the very first catalog
> compilations were fine, but then the compilation time increases.


Yes, and it increases quite rapidly. Interesting to note that the first
few compile times are basically within range of what I was experiencing
before things started to tip over (the last few days). I'm struggling to
try and think of anything I could have changed, but so far have not been
able to think of anything.

>> > So you can do several thing to get better performances:
>> > * reduce the number of nodes that check in at a single time (ie increase
>> > sleep time)
>> 
>> I've already reduced to once per hour, but I could consider reducing it
>> more. 
>
> That would be interesting. This would help us know if the problem is too
> many load/concurrency for your clients or a problem in the master
> itself.

I'll need to setup mcollective to do that I believe.

Right now I'm setting up a cronjob like this:

"<%= scope.function_fqdn_rand(['59']) %> * * * *"

which results in a cronjob (on one host):
6 * * * * root /usr/sbin/puppetd --onetime --no-daemonize 
--config=/etc/puppet/puppet.conf --color false | grep -E 
'(^err:|^alert:|^emerg:|^crit:)'

> BTW, what's the load on the server?

The server is dedicated to puppetmaster. When I had four mongrels
running it was basically at 4 constantly. Now that I've backed it down
to 2 mongrels, its:

11:57:41 up 58 days, 21:20,  2 users,  load average: 2.31, 1.97, 2.02

>> Not swapping.
>
> OK, good.

Just as a confirmation to this... vmstat shows no si/so happening, and
very high numbers in the CPU user column. Very little bi/bo, and low sys
values. Context switches are a bit high... this clearly points to the
process eating CPU, not any disk/memory/swap scenario.

>> >   + Reduce the number of mongrel instances, to artifically reduce the
>> > concurrency (this is counter-intuitive I know)
>> 
>> Ok, I'm backing off to two mongrels to see how well that works.
>
> Let me know if that changes something.

Doesn't seem to help. Compiles start out low, and are inching up
(started at 27, and now they are at 120 seconds).

>> >   + use a "better" ruby interpreter like Ruby Enterprise Edition (for
>> > several reasons this ones has better GC, better memory footprint).
>> 
>> I'm pretty sure my problem isn't memory, so I'm not sure if these will
>> help much.
>
> Well, having a better GC means that the ruby interpreter will become
> faster at allocating stuff or recycling object. That in the end means
> the overall memory footprint can be better, but that also means it will
> spend much less time doing garbage stuff (ie better use the CPU for your
> code and not for tidying stuff).

That could be interesting. I haven't tried REE or jruby on debian
before, I suppose its worth a try.

>> >> 3. tried to upgrade rails from 2.3.5 (the debian version) to 2.3.10
>> >> 
>> >>    I didn't see any appreciable difference here. I ended up going back to
>> >> 2.3.5 because that was the packaged version.
>> >
>> > Since you seem to use Debian, make sure you use either the latest ruby
>> > lenny backports (or REE) as they fixed an issue with pthreads and CPU
>> > consumption:
>> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=579229
>> 
>> I'm using Debian Squeeze, which has the same version you are mentioning
>> from lenny backports (2.3.5).
>
> I was talking about the ruby1.8 package, not rails. Make sure you use
> the squeeze version or the lenny-backports one.

Yep, I'm using the squeeze ruby1.8, which is 1.8.7.302-2

>> >> 5. tried to cache catalogs through adding a http front-end cache and
>> >> expiring that cache when manifests are updated[1] 
>> >> 
>> >>    I'm not sure this works at all.
>> >
>> > This should have helped because this would prevent the puppetmaster to
>> > even be called. You might check your nginx configuration then.

It wasn't really caching before, because of the nginx parameter you
pointed out in a previous message. But now it seems like it is:

find /var/cache/nginx/cache -type f |wc
     29      29    1769

> What version of nginx are you using?

0.7.67-3

> Make that:
> ssl_verify_client       optional;
>
> And remove the second server{} block, and make sure your clients do not
> use a different ca_port. But only if you use nginx >= 0.7.64

Ok, that second server block was for the cert request... but sounds like
if I tweak the verify to optional, I dont need that. I'm sure the
clients aren't using a different ca_port (except for the initial node
bootstrap). I've changed that and removed the block.

> If you used ssl_verify_client as I explained above, this should be:
> proxy_set_header           X-Client-Verify   $ssl_client_verify

Changed.

>>     # we handle catalog differently
>>     # because we want to cache them
>>     location /production/catalog {
>
> Warning: this ^^ will work only if your nodes are in the "production"
> environment. Adjust for your environments.

/etc/puppet/puppet.conf has:

environment = production

I do occasionally use development environments, but rarely enough that
not having caching is ok.

> You already have a location '/' above.
> Are you sure nginx is correctly using this configuration?
> Try:
>  nginx -t
> it will check your configuration

Hm, good catch. nginx -t seems ok with it, but I've removed the extra
location '/' just in case.

> This server{} wouldn't be needed if you use the ssl_verify_client as
> explained above.

Removed.

>> >> 7. set --http_compression
>> >>    
>> >>    I'm not sure if this actually hurts the master or not (because it has
>> >>    to now occupy the CPU compressing catalogs?)
>> >
>> > This is a client option, and you need the collaboration of nginx for it
>> > to work. This will certainly add more burden on your master CPU, because
>> > nginx now has to gzip everything you're sending.
>> 
>> Yeah, I have the gzip compression turned on in nginx, but I dont really
>> need it and my master could use the break.
>
> Actually your nginx are only compressing text/plain documents, so it
> won't compress your catalogs.

Ah, interesting! Well, again... I'm turning it off on the nodes, its not
needed.

>> >> 8. tried to follow the introspection technique[2] 
>> >> 
>> >>    this wasn't so easy to do, I had to operate really fast, because if I
>> >>    was too slow the thread would exit, or it would get hung up on:
>> >> 
>> >> [Thread 0xb6194b70 (LWP 25770) exited]
>> >> [New Thread 0xb6194b70 (LWP 25806)]
>> >
>> > When you attach gdb, how many threads are running?
>> 
>> I'm not sure, how can I determine that? I just had the existing 4
>> mongrel processes.
>
> Maybe you can first try to display the full C trace for all threads:
> thread apply all bt
>
> Then, resume everything, and 2 to 5s take another snapshot with the
> command above. Comparing the two trace might help us understand what the
> process is doing.

Now that I've fixed up the nginx.conf and caching is actually happening,
I've noticed that catalog compiles are 10s, 14s, 19s, 10s, 25s, 8s and
things haven't fallen over yet, so its much better right now. I'm going
to let this run for an hour or two and if things are still bad, I'll
look at the thread traces.

m

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To post to this group, send email to puppet-users@googlegroups.com.
To unsubscribe from this group, send email to 
puppet-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/puppet-users?hl=en.

Re: [Puppet Users] Re: puppetmaster 100%cpu usage on 2.6 (not on 0.24)

Reply via email to