Re: [Puppet Users] Puppetserver 6.0.2 timeouts in the puppetserver log and on the agent side

2019-02-12 Thread Henrik Lindberg

On 2019-02-11 21:59, Mike Sharpton wrote:

Hello Henrik,

The heap being at 4GB is all the higher I would raise it, as you say GC 
becomes costly with big heaps.  The memory usage ramps up quite quickly 
to well above the configured max heap within minutes.  It comes up to 
about 5.8GB of usage quickly as we manage many resources on many nodes.  
We do not have many environments.  We normally have a production branch 
and only use a preprod branch to move changes up our environment with a 
module that managed the puppet.conf on our nodes.  We will keep looking, 
but I don't see a smoking gun.  Anyone else have any ideas?  Puppet 4 
was able to handle this load with only 4 JRuby workers and 4 Puppet 
servers.  Thanks for your help,


Mike


I suppose you have read this: 
https://puppet.com/docs/puppetserver/6.0/tuning_guide.html


Read your description again. Think it is of value to look at the 
stacktrace you get when the timeout occur. This to figure out what it is

that is timing out.

Best,
- henrik



On Monday, February 11, 2019 at 2:06:11 PM UTC-6, Henrik Lindberg wrote:

On 2019-02-11 14:42, Mike Sharpton wrote:
 > Hey all,
 >
 > We have recently upgraded our environment from Puppetserver 4.2.2 to
 > Puppetserver 6.0.2.  We are running a mix of Puppet 4 and Puppet 6
 > agents until we can get them all upgraded to 6.  We have around 6000
 > nodes, and we had 4 Puppetservers, but we added two more due to
capacity
 > issues with Puppet 6.  The load is MUCH higher with Puppet 6.  To
the
 > question, I am seeing longer and longer agent run times after
about two
 > days of the services running.  The only error in the logs that
seems to
 > have any relation to this is this string.
 >
 > 2019-02-11T04:32:28.409-06:00 ERROR [qtp1148783071-4075] [p.r.core]
 > Internal Server Error: java.io.IOException:
 > java.util.concurrent.TimeoutException: Idle timeout expired:
30001/3 ms
 >
 >
 > After I restart the puppetserver service, this goes away for
about two
 > days.  I think Puppetserver is dying a slow death under this load
(load
 > average of around 5-6).  We are running Puppetserver on vm's that
are
 > 10X8GB and using 6 Jruby workers per Puppetserver and a 4GB heap.  I
 > have not seen any OOM exceptions and the process never crashes.  Has
 > anyone else seen anything like this?  I did some Googling and didn't
 > find a ton of relevant stuff.  Perhaps we need to upgrade to the
latest
 > version to see if this helps?  Even more capacity?  Seems silly. 
Thanks

 > in advance!
 >

There may be a slow memory leak that over time makes the server busy
win
non productive work (scanning for garbage on an ever increasing heap).
If you were to increase capacity you would risk only changing the 2
days
to a couple more, but not actually solving the issue.

Try to look at server memory usage over the two days.

Also, naturally, upgrade to latest and make sure modules are updated as
well.

Do you by any chance have many environments with different versions of
ruby code? The environment isolation "puppet generate types" may be of
help if that is the case as loaded ruby resource types become sticky in
memory.

- henrik

 > Mike
 >
 > --
 > You received this message because you are subscribed to the Google
 > Groups "Puppet Users" group.
 > To unsubscribe from this group and stop receiving emails from it,
send
 > an email to puppet-users...@googlegroups.com 
 > .
 > To view this discussion on the web visit
 >

https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com



 >

>.

 > For more options, visit https://groups.google.com/d/optout
.


-- 


Visit my Blog "Puppet on the Edge"
http://puppet-on-the-edge.blogspot.se/


--
You received this message because you are subscribed to the Google 
Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to puppet-users+unsubscr...@googlegroups.com 
.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/59932f88-0416-45d3-b938-0f3c417051fa%40googlegroups.com 


Re: [Puppet Users] Puppetserver 6.0.2 timeouts in the puppetserver log and on the agent side

2019-02-12 Thread Mike Sharpton
Hello Justin,

We were thinking the same thing with the JRuby workers.  Perhaps lowering 
them back to 4, and lowering the heap size back to three, which worked fine 
before now that we have added 2 more Puppet servers.  The behavior we see 
is failing Puppet runs like this on random modules.  

Could not evaluate: Could not retrieve file metadata for 
puppet:///modules/modulename/resourcename: SSL_connect returned=6 errno=0 
state=unknown state

Something took far too long to answer is our guess.  Reports are fine, 
PuppetDB is fine.  Things always make it there.  We see the failures.  It 
is likely we have herd that comes in and sometimes makes the situation 
worse.  Based on this fact, this should happen every 30 minutes.  It 
doesn't.  I can't think of any other server settings we are managing 
besides JRuby instances, heap size, and a tmp dir for Java to work with  
(/tmp is noexec here).   We are using this JVM, and we do not have any 
custom tuning.

openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

We will try and mess with the JRuby/heap ratio now that we have more 
Puppetservers.  We consistently see all JRuby instances being utilized even 
when being set a 6.  Another thing we may consider is doing Puppet runs 
every 45 mins instead of 30.  This will lower load as well.  Thanks for 
your thoughts,

Mike


On Monday, February 11, 2019 at 5:21:06 PM UTC-6, Justin Stoller wrote:
>
>
>
> On Mon, Feb 11, 2019 at 5:42 AM Mike Sharpton  > wrote:
>
>> Hey all,
>>
>> We have recently upgraded our environment from Puppetserver 4.2.2 to 
>> Puppetserver 6.0.2.  We are running a mix of Puppet 4 and Puppet 6 agents 
>> until we can get them all upgraded to 6.  We have around 6000 nodes, and we 
>> had 4 Puppetservers, but we added two more due to capacity issues with 
>> Puppet 6.  The load is MUCH higher with Puppet 6.  To the question, I am 
>> seeing longer and longer agent run times after about two days of the 
>> services running.  The only error in the logs that seems to have any 
>> relation to this is this string.
>>
>> 2019-02-11T04:32:28.409-06:00 ERROR [qtp1148783071-4075] [p.r.core] 
>> Internal Server Error: java.io.IOException: 
>> java.util.concurrent.TimeoutException: Idle timeout expired: 30001/3 ms
>>
>>
>> After I restart the puppetserver service, this goes away for about two 
>> days.  I think Puppetserver is dying a slow death under this load (load 
>> average of around 5-6).  We are running Puppetserver on vm's that are 
>> 10X8GB and using 6 Jruby workers per Puppetserver and a 4GB heap.  I have 
>> not seen any OOM exceptions and the process never crashes.  Has anyone else 
>> seen anything like this?  I did some Googling and didn't find a ton of 
>> relevant stuff.  Perhaps we need to upgrade to the latest version to see if 
>> this helps?  Even more capacity?  Seems silly.  Thanks in advance!
>>
>
> Off the top of my head:
> 1. Have you tried lowering the JRuby workers to JVM heap ratio? (I would 
> try 1G to 1worker to see if it really is worker performance)
> 2. That error is most likely from Jetty (it can be tuned with 
> idle-timeout-milliseconds[1]). Are agent runs failing with a 500 from the 
> server when that happens? Are clients failing to post their facts or 
> reports in a timely manner? Is Puppet Server failing its connections to 
> PuppetDB?
> 3. Are you managing any other server settings? Having a low 
> max-requests-per-instance is problematic for newer servers (they more 
> aggressively compile/optimize the Ruby code the worker loads, so with 
> shorter lifetimes it does a bunch of work to then throw it a way and start 
> over - and that can cause much more load).
> 4. What version of java are you using/do you have any custom tuning of 
> Java that maybe doesn't work well with newer servers? Server 5+ only has 
> support for Java 8 and will use more non-heap memory/code cache for those 
> new optimizations mentioned above.
>
> HTH,
> Justin
>
>
> 1. 
> https://github.com/puppetlabs/trapperkeeper-webserver-jetty9/blob/master/doc/jetty-config.md#idle-timeout-milliseconds
>
>
>> Mike
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Puppet Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to puppet-users...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsub

Re: [Puppet Users] Puppetserver 6.0.2 timeouts in the puppetserver log and on the agent side

2019-02-11 Thread Justin Stoller
On Mon, Feb 11, 2019 at 5:42 AM Mike Sharpton  wrote:

> Hey all,
>
> We have recently upgraded our environment from Puppetserver 4.2.2 to
> Puppetserver 6.0.2.  We are running a mix of Puppet 4 and Puppet 6 agents
> until we can get them all upgraded to 6.  We have around 6000 nodes, and we
> had 4 Puppetservers, but we added two more due to capacity issues with
> Puppet 6.  The load is MUCH higher with Puppet 6.  To the question, I am
> seeing longer and longer agent run times after about two days of the
> services running.  The only error in the logs that seems to have any
> relation to this is this string.
>
> 2019-02-11T04:32:28.409-06:00 ERROR [qtp1148783071-4075] [p.r.core]
> Internal Server Error: java.io.IOException:
> java.util.concurrent.TimeoutException: Idle timeout expired: 30001/3 ms
>
>
> After I restart the puppetserver service, this goes away for about two
> days.  I think Puppetserver is dying a slow death under this load (load
> average of around 5-6).  We are running Puppetserver on vm's that are
> 10X8GB and using 6 Jruby workers per Puppetserver and a 4GB heap.  I have
> not seen any OOM exceptions and the process never crashes.  Has anyone else
> seen anything like this?  I did some Googling and didn't find a ton of
> relevant stuff.  Perhaps we need to upgrade to the latest version to see if
> this helps?  Even more capacity?  Seems silly.  Thanks in advance!
>

Off the top of my head:
1. Have you tried lowering the JRuby workers to JVM heap ratio? (I would
try 1G to 1worker to see if it really is worker performance)
2. That error is most likely from Jetty (it can be tuned with
idle-timeout-milliseconds[1]). Are agent runs failing with a 500 from the
server when that happens? Are clients failing to post their facts or
reports in a timely manner? Is Puppet Server failing its connections to
PuppetDB?
3. Are you managing any other server settings? Having a low
max-requests-per-instance is problematic for newer servers (they more
aggressively compile/optimize the Ruby code the worker loads, so with
shorter lifetimes it does a bunch of work to then throw it a way and start
over - and that can cause much more load).
4. What version of java are you using/do you have any custom tuning of Java
that maybe doesn't work well with newer servers? Server 5+ only has support
for Java 8 and will use more non-heap memory/code cache for those new
optimizations mentioned above.

HTH,
Justin


1.
https://github.com/puppetlabs/trapperkeeper-webserver-jetty9/blob/master/doc/jetty-config.md#idle-timeout-milliseconds


> Mike
>
> --
> You received this message because you are subscribed to the Google Groups
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to puppet-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/CA%2B%3DBEqXhSaod%2BkJHx23YpPVd3DMc8gSofvU2D6bbv%3Dt4%3DJKDxQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [Puppet Users] Puppetserver 6.0.2 timeouts in the puppetserver log and on the agent side

2019-02-11 Thread Mike Sharpton
Hello Henrik,

The heap being at 4GB is all the higher I would raise it, as you say GC 
becomes costly with big heaps.  The memory usage ramps up quite quickly to 
well above the configured max heap within minutes.  It comes up to about 
5.8GB of usage quickly as we manage many resources on many nodes.  We do 
not have many environments.  We normally have a production branch and only 
use a preprod branch to move changes up our environment with a module that 
managed the puppet.conf on our nodes.  We will keep looking, but I don't 
see a smoking gun.  Anyone else have any ideas?  Puppet 4 was able to 
handle this load with only 4 JRuby workers and 4 Puppet servers.  Thanks 
for your help,

Mike

On Monday, February 11, 2019 at 2:06:11 PM UTC-6, Henrik Lindberg wrote:
>
> On 2019-02-11 14:42, Mike Sharpton wrote: 
> > Hey all, 
> > 
> > We have recently upgraded our environment from Puppetserver 4.2.2 to 
> > Puppetserver 6.0.2.  We are running a mix of Puppet 4 and Puppet 6 
> > agents until we can get them all upgraded to 6.  We have around 6000 
> > nodes, and we had 4 Puppetservers, but we added two more due to capacity 
> > issues with Puppet 6.  The load is MUCH higher with Puppet 6.  To the 
> > question, I am seeing longer and longer agent run times after about two 
> > days of the services running.  The only error in the logs that seems to 
> > have any relation to this is this string. 
> > 
> > 2019-02-11T04:32:28.409-06:00 ERROR [qtp1148783071-4075] [p.r.core] 
> > Internal Server Error: java.io.IOException: 
> > java.util.concurrent.TimeoutException: Idle timeout expired: 30001/3 
> ms 
> > 
> > 
> > After I restart the puppetserver service, this goes away for about two 
> > days.  I think Puppetserver is dying a slow death under this load (load 
> > average of around 5-6).  We are running Puppetserver on vm's that are 
> > 10X8GB and using 6 Jruby workers per Puppetserver and a 4GB heap.  I 
> > have not seen any OOM exceptions and the process never crashes.  Has 
> > anyone else seen anything like this?  I did some Googling and didn't 
> > find a ton of relevant stuff.  Perhaps we need to upgrade to the latest 
> > version to see if this helps?  Even more capacity?  Seems silly.  Thanks 
> > in advance! 
> > 
>
> There may be a slow memory leak that over time makes the server busy win 
> non productive work (scanning for garbage on an ever increasing heap). 
> If you were to increase capacity you would risk only changing the 2 days 
> to a couple more, but not actually solving the issue. 
>
> Try to look at server memory usage over the two days. 
>
> Also, naturally, upgrade to latest and make sure modules are updated as 
> well. 
>
> Do you by any chance have many environments with different versions of 
> ruby code? The environment isolation "puppet generate types" may be of 
> help if that is the case as loaded ruby resource types become sticky in 
> memory. 
>
> - henrik 
>
> > Mike 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> > Groups "Puppet Users" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> > an email to puppet-users...@googlegroups.com  
> > . 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com
>  
> > <
> https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com?utm_medium=email&utm_source=footer>.
>  
>
> > For more options, visit https://groups.google.com/d/optout. 
>
>
> -- 
>
> Visit my Blog "Puppet on the Edge" 
> http://puppet-on-the-edge.blogspot.se/ 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/59932f88-0416-45d3-b938-0f3c417051fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [Puppet Users] Puppetserver 6.0.2 timeouts in the puppetserver log and on the agent side

2019-02-11 Thread Henrik Lindberg

On 2019-02-11 14:42, Mike Sharpton wrote:

Hey all,

We have recently upgraded our environment from Puppetserver 4.2.2 to 
Puppetserver 6.0.2.  We are running a mix of Puppet 4 and Puppet 6 
agents until we can get them all upgraded to 6.  We have around 6000 
nodes, and we had 4 Puppetservers, but we added two more due to capacity 
issues with Puppet 6.  The load is MUCH higher with Puppet 6.  To the 
question, I am seeing longer and longer agent run times after about two 
days of the services running.  The only error in the logs that seems to 
have any relation to this is this string.


2019-02-11T04:32:28.409-06:00 ERROR [qtp1148783071-4075] [p.r.core] 
Internal Server Error: java.io.IOException: 
java.util.concurrent.TimeoutException: Idle timeout expired: 30001/3 ms



After I restart the puppetserver service, this goes away for about two 
days.  I think Puppetserver is dying a slow death under this load (load 
average of around 5-6).  We are running Puppetserver on vm's that are 
10X8GB and using 6 Jruby workers per Puppetserver and a 4GB heap.  I 
have not seen any OOM exceptions and the process never crashes.  Has 
anyone else seen anything like this?  I did some Googling and didn't 
find a ton of relevant stuff.  Perhaps we need to upgrade to the latest 
version to see if this helps?  Even more capacity?  Seems silly.  Thanks 
in advance!




There may be a slow memory leak that over time makes the server busy win 
non productive work (scanning for garbage on an ever increasing heap).
If you were to increase capacity you would risk only changing the 2 days 
to a couple more, but not actually solving the issue.


Try to look at server memory usage over the two days.

Also, naturally, upgrade to latest and make sure modules are updated as 
well.


Do you by any chance have many environments with different versions of 
ruby code? The environment isolation "puppet generate types" may be of 
help if that is the case as loaded ruby resource types become sticky in 
memory.


- henrik


Mike

--
You received this message because you are subscribed to the Google 
Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send 
an email to puppet-users+unsubscr...@googlegroups.com 
.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com 
.

For more options, visit https://groups.google.com/d/optout.



--

Visit my Blog "Puppet on the Edge"
http://puppet-on-the-edge.blogspot.se/

--
You received this message because you are subscribed to the Google Groups "Puppet 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/q3shrm%241nb9%241%40blaine.gmane.org.
For more options, visit https://groups.google.com/d/optout.


[Puppet Users] Puppetserver 6.0.2 timeouts in the puppetserver log and on the agent side

2019-02-11 Thread Mike Sharpton
Hey all,

We have recently upgraded our environment from Puppetserver 4.2.2 to 
Puppetserver 6.0.2.  We are running a mix of Puppet 4 and Puppet 6 agents 
until we can get them all upgraded to 6.  We have around 6000 nodes, and we 
had 4 Puppetservers, but we added two more due to capacity issues with 
Puppet 6.  The load is MUCH higher with Puppet 6.  To the question, I am 
seeing longer and longer agent run times after about two days of the 
services running.  The only error in the logs that seems to have any 
relation to this is this string.

2019-02-11T04:32:28.409-06:00 ERROR [qtp1148783071-4075] [p.r.core] 
Internal Server Error: java.io.IOException: 
java.util.concurrent.TimeoutException: Idle timeout expired: 30001/3 ms


After I restart the puppetserver service, this goes away for about two 
days.  I think Puppetserver is dying a slow death under this load (load 
average of around 5-6).  We are running Puppetserver on vm's that are 
10X8GB and using 6 Jruby workers per Puppetserver and a 4GB heap.  I have 
not seen any OOM exceptions and the process never crashes.  Has anyone else 
seen anything like this?  I did some Googling and didn't find a ton of 
relevant stuff.  Perhaps we need to upgrade to the latest version to see if 
this helps?  Even more capacity?  Seems silly.  Thanks in advance!

Mike

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.