Re: [Dovecot] POP3 error

2011-03-10 Thread Thierry de Montaudry
On 09 Mar 2011, at 20:16, Timo Sirainen wrote:

> On 8.3.2011, at 19.42, Thierry de Montaudry wrote:
> 
>>> So... if the httpd process is the one consuming all of the CPU, doesn't
>>> it stand to reason that it might be something to do with one of your web
>>> apps, and not dovecot?
>>> 
>> But then why was it fine with 1.1.13, which never had once this problem in 2 
>> years? or is 2.0.9 slower, or consuming more resources to create the problem?
> 
> One possibility is that maybe v2.0 works a bit differently.. Maybe it causes 
> webmail to use a new feature that wasn't yet in v1.1, which causes more CPU?
Yes, possibly. I will investigate the features that the webmail might use now 
that it was not previously.
> 
> I also just heard that apparently this "Resource temporarily unavailable" can 
> happen if service imap/pop3-login's client_limit is too large. I'm not really 
> sure why, but you could try reducing them to e.g. 50.
I reduced the limits with process_limit. I'm wondering if I should use the 
client_limit as well, but couldn't find much documentation, would you have any 
light on that?
> 
> Do you remember how high the CPU usage was at peak times in v1.1? Has that 
> changed? Is the problem maybe that v2.0 just fails in a different way by 
> logging these failures, where v1.1 wouldn't even accept as many incoming 
> connections?
v1.1 was about the same as current, load avg between 3 and 4 from 9pm to 4am, 
no change on that side. It looks like it's just when there is spikes, the new 
version is reaching some limit.



Re: [Dovecot] POP3 error

2011-03-09 Thread Timo Sirainen
On 8.3.2011, at 19.42, Thierry de Montaudry wrote:

>> So... if the httpd process is the one consuming all of the CPU, doesn't
>> it stand to reason that it might be something to do with one of your web
>> apps, and not dovecot?
>> 
> But then why was it fine with 1.1.13, which never had once this problem in 2 
> years? or is 2.0.9 slower, or consuming more resources to create the problem?

One possibility is that maybe v2.0 works a bit differently.. Maybe it causes 
webmail to use a new feature that wasn't yet in v1.1, which causes more CPU?

I also just heard that apparently this "Resource temporarily unavailable" can 
happen if service imap/pop3-login's client_limit is too large. I'm not really 
sure why, but you could try reducing them to e.g. 50.

Do you remember how high the CPU usage was at peak times in v1.1? Has that 
changed? Is the problem maybe that v2.0 just fails in a different way by 
logging these failures, where v1.1 wouldn't even accept as many incoming 
connections?



Re: [Dovecot] POP3 error

2011-03-08 Thread Attila Nagy

 On 03/08/2011 10:37 PM, Robert Schetterer wrote:

Am 08.03.2011 21:38, schrieb Attila Nagy:

  On 03/08/2011 06:51 PM, Charles Marcus wrote:

On 2011-03-08 12:42 PM, Thierry de Montaudry wrote:

On 08 Mar 2011, at 19:37, Charles Marcus wrote:

So... if the httpd process is the one consuming all of the CPU, doesn't
it stand to reason that it might be something to do with one of your
web
apps, and not dovecot?

But then why was it fine with 1.1.13, which never had once this
problem in 2 years? or is 2.0.9 slower, or consuming more resources
to create the problem?

You don't see how it might be possible that 2.0.x does something that
1.1.x didn't do that your webmail app might not like, without it being a
dovecot bug?

I'm not saying it is or it isn't, but I'd look there first - see if an
update is available for your webmail app... since you were running an
ancient version of dovecot, maybe you're also running an ancient version
of it too?


I can see similar problems (subject: "Restarting dovecot-auth stops
authentication"), on a different OS, and nothing common in the webmail
area.

I think this is clearly related to Dovecot. It handles load very badly
(well, it seems at least on common OS settings), doesn't just slow down,
but starts to refuse clients.
It seems to be obvious that the interprocess socket communication is
where it fails, so this is what needs to be investigated.
Sadly, doing this on a machine, which cries for a deep breath already is
not always easy.

you might upgrade to the latest 2.x code
as it might possible your using more stuff
then you had in older versions, after all there was a long performance
thread on this list , look for it in archives

I'm running the latest 2.x code (well, sort of, I haven't upgraded to 
2.0.10, because of the LDAP bug, so I have both .9 and .11), I've never 
run 1.x on these machines.
I've run qmail and courier. They are pretty different in their 
architecture, where these kind of stuff (unix socket communication 
between persisently running daemons) is not touched, so there can't be a 
problem, where for example five thousand connections are made in the 
same moment to a single socket/process.
There there will be five thousand forks/execs, which won't fail with 
connection refused, they will be served as fast as the machine can 
handle them (modulo available memory/file descriptors/etc of course).




Re: [Dovecot] POP3 error

2011-03-08 Thread Attila Nagy

 On 03/08/2011 09:58 PM, Charles Marcus wrote:

I think this is clearly related to Dovecot. It handles load very badly

Whoa, pardner, fyi, there are many, many installations humming along
smoothly.
No offense. It may be more correct to say situations, where the OS can't 
deliver prompt resources to Dovecot, like saturated disk IO and similar 
stuff.
I can't see such problems with moderate load, and maybe there aren't so 
many installations, which handle a lot of traffic. I don't know.
I don't think it's a bug, currently to me it seems to be a 
tuning/configuration issue. But maybe it's a common design related 
issue, which is not yet fully explored.

(well, it seems at least on common OS settings), doesn't just slow down,
but starts to refuse clients.

Maybe there is a bug somewhere that only becomes evident under certain
circumstances, but it is also possibly due to config problems caused by...

Sure.


Re: [Dovecot] POP3 error

2011-03-08 Thread Robert Schetterer
Am 08.03.2011 21:38, schrieb Attila Nagy:
>  On 03/08/2011 06:51 PM, Charles Marcus wrote:
>> On 2011-03-08 12:42 PM, Thierry de Montaudry wrote:
>>> On 08 Mar 2011, at 19:37, Charles Marcus wrote:
 So... if the httpd process is the one consuming all of the CPU, doesn't
 it stand to reason that it might be something to do with one of your
 web
 apps, and not dovecot?
>>> But then why was it fine with 1.1.13, which never had once this
>>> problem in 2 years? or is 2.0.9 slower, or consuming more resources
>>> to create the problem?
>> You don't see how it might be possible that 2.0.x does something that
>> 1.1.x didn't do that your webmail app might not like, without it being a
>> dovecot bug?
>>
>> I'm not saying it is or it isn't, but I'd look there first - see if an
>> update is available for your webmail app... since you were running an
>> ancient version of dovecot, maybe you're also running an ancient version
>> of it too?
>>
> I can see similar problems (subject: "Restarting dovecot-auth stops
> authentication"), on a different OS, and nothing common in the webmail
> area.
> 
> I think this is clearly related to Dovecot. It handles load very badly
> (well, it seems at least on common OS settings), doesn't just slow down,
> but starts to refuse clients.
> It seems to be obvious that the interprocess socket communication is
> where it fails, so this is what needs to be investigated.
> Sadly, doing this on a machine, which cries for a deep breath already is
> not always easy.

you might upgrade to the latest 2.x code
as it might possible your using more stuff
then you had in older versions, after all there was a long performance
thread on this list , look for it in archives

-- 
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria


Re: [Dovecot] POP3 error

2011-03-08 Thread Charles Marcus
On 2011-03-08 3:38 PM, Attila Nagy wrote:
>  On 03/08/2011 06:51 PM, Charles Marcus wrote:
>> You don't see how it might be possible that 2.0.x does something that
>> 1.1.x didn't do that your webmail app might not like, without it being a
>> dovecot bug?
>>
>> I'm not saying it is or it isn't, but I'd look there first - see if an
>> update is available for your webmail app... since you were running an
>> ancient version of dovecot, maybe you're also running an ancient version
>> of it too?

> I can see similar problems (subject: "Restarting dovecot-auth stops
> authentication"), on a different OS, and nothing common in the webmail
> area.

Similar problem? I just read that entire thread, and there was
absolutely no mention of high resource usage, and it was the 4th or 5th
email before you finally provided system details (which should always be
provided in the first email to save time) and Timo noticed that you had
changed some defaults that you shouldn't have... so I don't think that
thread qualifies as being anywhere near similar.

> I think this is clearly related to Dovecot. It handles load very badly

Whoa, pardner, fyi, there are many, many installations humming along
smoothly.

> (well, it seems at least on common OS settings), doesn't just slow down,
> but starts to refuse clients.

Maybe there is a bug somewhere that only becomes evident under certain
circumstances, but it is also possibly due to config problems caused by...



-- 

Best regards,

Charles


Re: [Dovecot] POP3 error

2011-03-08 Thread Attila Nagy

 On 03/08/2011 06:51 PM, Charles Marcus wrote:

On 2011-03-08 12:42 PM, Thierry de Montaudry wrote:

On 08 Mar 2011, at 19:37, Charles Marcus wrote:

So... if the httpd process is the one consuming all of the CPU, doesn't
it stand to reason that it might be something to do with one of your web
apps, and not dovecot?

But then why was it fine with 1.1.13, which never had once this
problem in 2 years? or is 2.0.9 slower, or consuming more resources
to create the problem?

You don't see how it might be possible that 2.0.x does something that
1.1.x didn't do that your webmail app might not like, without it being a
dovecot bug?

I'm not saying it is or it isn't, but I'd look there first - see if an
update is available for your webmail app... since you were running an
ancient version of dovecot, maybe you're also running an ancient version
of it too?

I can see similar problems (subject: "Restarting dovecot-auth stops 
authentication"), on a different OS, and nothing common in the webmail area.


I think this is clearly related to Dovecot. It handles load very badly 
(well, it seems at least on common OS settings), doesn't just slow down, 
but starts to refuse clients.
It seems to be obvious that the interprocess socket communication is 
where it fails, so this is what needs to be investigated.
Sadly, doing this on a machine, which cries for a deep breath already is 
not always easy.


Re: [Dovecot] POP3 error

2011-03-08 Thread Charles Marcus
On 2011-03-08 12:42 PM, Thierry de Montaudry wrote:
> On 08 Mar 2011, at 19:37, Charles Marcus wrote:
>> So... if the httpd process is the one consuming all of the CPU, doesn't
>> it stand to reason that it might be something to do with one of your web
>> apps, and not dovecot?

> But then why was it fine with 1.1.13, which never had once this
> problem in 2 years? or is 2.0.9 slower, or consuming more resources
> to create the problem?

You don't see how it might be possible that 2.0.x does something that
1.1.x didn't do that your webmail app might not like, without it being a
dovecot bug?

I'm not saying it is or it isn't, but I'd look there first - see if an
update is available for your webmail app... since you were running an
ancient version of dovecot, maybe you're also running an ancient version
of it too?

-- 

Best regards,

Charles


Re: [Dovecot] POP3 error

2011-03-08 Thread Charles Marcus
On 2011-03-08 12:30 PM, Thierry de Montaudry wrote:
> On 08 Mar 2011, at 19:11, Charles Marcus wrote:
>> The reason I asked about your webmail server is you had specifically
>> said that it was the httpd process that was consuming all of the CPU...

> Yes, because they were in the top of the top list.

And they were on the top of the list because... they were consuming all
of the CPU?

-- 

Best regards,

Charles


Re: [Dovecot] POP3 error

2011-03-08 Thread Thierry de Montaudry

On 08 Mar 2011, at 19:37, Charles Marcus wrote:

> On 2011-03-08 12:30 PM, Thierry de Montaudry wrote:
>> On 08 Mar 2011, at 19:11, Charles Marcus wrote:
>>> The reason I asked about your webmail server is you had specifically
>>> said that it was the httpd process that was consuming all of the CPU...
> 
>> Yes, because they were in the top of the top list.
> 
> So... if the httpd process is the one consuming all of the CPU, doesn't
> it stand to reason that it might be something to do with one of your web
> apps, and not dovecot?
> 
But then why was it fine with 1.1.13, which never had once this problem in 2 
years? or is 2.0.9 slower, or consuming more resources to create the problem?



Re: [Dovecot] POP3 error

2011-03-08 Thread Charles Marcus
On 2011-03-08 12:30 PM, Thierry de Montaudry wrote:
> On 08 Mar 2011, at 19:11, Charles Marcus wrote:
>> The reason I asked about your webmail server is you had specifically
>> said that it was the httpd process that was consuming all of the CPU...

> Yes, because they were in the top of the top list.

So... if the httpd process is the one consuming all of the CPU, doesn't
it stand to reason that it might be something to do with one of your web
apps, and not dovecot?

-- 

Best regards,

Charles


Re: [Dovecot] POP3 error

2011-03-08 Thread Thierry de Montaudry

On 08 Mar 2011, at 19:12, Charles Marcus wrote:

> On 2011-03-08 12:00 PM, Thierry de Montaudry wrote:
>> but moving from dovecot 1.10.13 to 2.0.9
> 
> First time I thought it was a typo and ignored it...
> 
> There has never been a version 1.10.xxx
> 
> Maybe you mean 1.0.13?
>  

Sorry, my mistake, 1.1.13, version integrated in CentOS 5.



Re: [Dovecot] POP3 error

2011-03-08 Thread Thierry de Montaudry

On 08 Mar 2011, at 19:11, Charles Marcus wrote:

> On 2011-03-08 11:49 AM, Thierry de Montaudry wrote:
>> Using HastyMail2-1.0. But the problem only started when we moved to
>> dovecot 2.0.9 (from 1.10.13), without changing anything else on any
>> of our 7 machines, and now it's happening randomly on any of them. So
>> that's why I suspect it has to do with dovecot.
> 
> Or an interaction of the new version of Dovecot and HastyMail.
> 
> The reason I asked about your webmail server is you had specifically
> said that it was the httpd process that was consuming all of the CPU...
> 

Yes, because they were in the top of the top list.


Re: [Dovecot] POP3 error

2011-03-08 Thread Charles Marcus
On 2011-03-08 12:00 PM, Thierry de Montaudry wrote:
> but moving from dovecot 1.10.13 to 2.0.9

First time I thought it was a typo and ignored it...

There has never been a version 1.10.xxx

Maybe you mean 1.0.13?

-- 

Best regards,

Charles


Re: [Dovecot] POP3 error

2011-03-08 Thread Charles Marcus
On 2011-03-08 11:49 AM, Thierry de Montaudry wrote:
> Using HastyMail2-1.0. But the problem only started when we moved to
> dovecot 2.0.9 (from 1.10.13), without changing anything else on any
> of our 7 machines, and now it's happening randomly on any of them. So
> that's why I suspect it has to do with dovecot.

Or an interaction of the new version of Dovecot and HastyMail.

The reason I asked about your webmail server is you had specifically
said that it was the httpd process that was consuming all of the CPU...

-- 

Best regards,

Charles


Re: [Dovecot] POP3 error

2011-03-08 Thread Thierry de Montaudry

On 08 Mar 2011, at 18:26, Chris Wilson wrote:

> Hi Thierry,
> 
> On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
>> On 08 Mar 2011, at 13:24, Chris Wilson wrote:
>>> 
 top - 11:10:14 up 14 days, 12:04,  2 users,  load average: 55.04, 29.13, 
 14.55
 Tasks: 474 total,  60 running, 414 sleeping,   0 stopped,   0 zombie
 Cpu(s): 99.6%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  
 0.0%st
 Mem:  16439812k total, 16353268k used,86544k free,33268k buffers
 Swap:  4192956k total,  140k used,  4192816k free,  8228744k cached
>> 
>> As you can see the numbers (55.04, 29.13, 14.55) the load was busy 
>> getting higher when I took this snapshot and this was not a normal 
>> situation. Usually this machine's load is only between 1 and 4, which is 
>> quite ok for a quad core. It only happens when dovecot start generating 
>> errors, and pop3, imap and http get stuck.  It went up to 200, and I was 
>> still able to stop web and mail daemons, then restart them, and 
>> everything was back to normal.
> 
> I don't have a definite answer, but I remember that there has been a 
> long-running bug in the Linux kernel with schedulers behaving badly under 
> heavy writes:
> 
> "One of the problems commonly talked about in our forums and elsewhere is 
> the poor responsiveness of the Linux desktop when dealing with significant 
> disk activity on systems where there is insufficient RAM or the disks are 
> slow. The GUI basically drops to its knees when there is too much disk 
> activity..." [http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw] 
> (note, it's not just the GUI, all other tasks can starve when a disk I/O 
> queue builds up).
> 
> "There are a few options to tune the linux IO scheduler that can help a 
> bunch... Typically CFQ stalls too long under heavy writes, especially if 
> your disk subsystem sucks, so particularly if you have several spindles 
> deadline is worth a try." [http://communities.vmware.com/thread/82544]
> 
> "I run Ubuntu on a moderately powerful quad-core x86-64 system and the 
> desktop response is basically crippled whenever something is reading or 
> writing large files as fast as it can (at normal priority)... For example, 
> cat /path/to/LARGE_FILE > /dev/null ... Everything else gets completely 
> unusable because of the I/O latency."
> [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343371]
> 
> "I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O 
> ^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result 
> was, well, absolutely, positively _DEVASTATING_. The entire system became 
> _FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1 
> worked (took 20 seconds for even this key to be respected)." 
> [http://lkml.org/lkml/2010/4/4/86]
> 
> "This regression has been around since about the 2.6.18 timeframe and has 
> eluded a lot of testing to isolate the root cause. The most promising fix 
> is in the VM subsystem (mm) where the LRU scan has been changed to favor 
> keeping executable pages active longer. Most of these symptoms come down 
> to VM thrashing to make room for I/O pages. The key change/commit is 
> ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable 
> pages the first class citizen"... This change was merged into the 2.6.31r1 
> kernel." 
> [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235]
> 
> One possible cause is that writing to a slow device can block the write 
> queue for other devices, causing the machine to come to a standstill when 
> there's plenty of useful work that it could be doing.
> 
> This could cause a cascading failure in your server as soon as disk 
> I/O write load goes over a certain point, a bit like a swap death. I'm not 
> sure if the fact that you're using NFS makes a difference; perhaps only if 
> you memory-map files?
> 
> You could test this by booting with the NOOP or anticipatory scheduler 
> instead of the default CFQ to see if it makes any difference.
> 
> Cheers, Chris.

Hi Chris,

Thanks for your (long) comment and tech details, but having not changed 
anything on the 7 machines, but moving from dovecot 1.10.13 to 2.0.9, without 
increasing our traffic, I don't want to start changing tricky stuff in the 
system when it worked fine for almost 2 years. And the fact that all mails are 
stored on multiple NFS servers, all machine having 16G RAM, makes me think that 
it's not an IO problem.
I though it might be the system running out of resources, but there nothing 
about it in the logs...
For now, we might consider reversing to 1.10.13... but that would be with the 
loss of the new features that made us upgrade, so not good.




Re: [Dovecot] POP3 error

2011-03-08 Thread Thierry de Montaudry

On 08 Mar 2011, at 18:14, Charles Marcus wrote:

> On 2011-03-08 10:40 AM, Thierry de Montaudry wrote:
>> On 08 Mar 2011, at 13:24, Chris Wilson wrote:
>>> There's nothing to debug in dovecot here. Your server is overloaded
>>> by about 55 times. Buy 55 times as many servers or do something
>>> about your webmail interface (maybe a separate webmail cluster).
> 
>> As you can see the numbers (55.04, 29.13, 14.55) the load was busy
>> getting higher when I took this snapshot and this was not a normal
>> situation. Usually this machine's load is only between 1 and 4, which
>> is quite ok for a quad core. It only happens when dovecot start
>> generating errors, and pop3, imap and http get stuck.  It went up to
>> 200, and I was still able to stop web and mail daemons, then restart
>> them, and everything was back to normal.
> 
> What is your webmail server (and version)? Maybe it is buggy?
> 
Using HastyMail2-1.0. But the problem only started when we moved to dovecot 
2.0.9 (from 1.10.13), without changing anything else on any of our 7 machines, 
and now it's happening randomly on any of them. So that's why I suspect it has 
to do with dovecot.



Re: [Dovecot] POP3 error

2011-03-08 Thread Eric Shubert

On 03/08/2011 09:26 AM, Chris Wilson wrote:

Hi Thierry,

On Tue, 8 Mar 2011, Thierry de Montaudry wrote:

On 08 Mar 2011, at 13:24, Chris Wilson wrote:



top - 11:10:14 up 14 days, 12:04,  2 users,  load average: 55.04, 29.13, 14.55
Tasks: 474 total,  60 running, 414 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.6%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  16439812k total, 16353268k used,86544k free,33268k buffers
Swap:  4192956k total,  140k used,  4192816k free,  8228744k cached


As you can see the numbers (55.04, 29.13, 14.55) the load was busy
getting higher when I took this snapshot and this was not a normal
situation. Usually this machine's load is only between 1 and 4, which is
quite ok for a quad core. It only happens when dovecot start generating
errors, and pop3, imap and http get stuck.  It went up to 200, and I was
still able to stop web and mail daemons, then restart them, and
everything was back to normal.


I don't have a definite answer, but I remember that there has been a
long-running bug in the Linux kernel with schedulers behaving badly under
heavy writes:

"One of the problems commonly talked about in our forums and elsewhere is
the poor responsiveness of the Linux desktop when dealing with significant
disk activity on systems where there is insufficient RAM or the disks are
slow. The GUI basically drops to its knees when there is too much disk
activity..." [http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw]
(note, it's not just the GUI, all other tasks can starve when a disk I/O
queue builds up).

"There are a few options to tune the linux IO scheduler that can help a
bunch... Typically CFQ stalls too long under heavy writes, especially if
your disk subsystem sucks, so particularly if you have several spindles
deadline is worth a try." [http://communities.vmware.com/thread/82544]

"I run Ubuntu on a moderately powerful quad-core x86-64 system and the
desktop response is basically crippled whenever something is reading or
writing large files as fast as it can (at normal priority)... For example,
cat /path/to/LARGE_FILE>  /dev/null ... Everything else gets completely
unusable because of the I/O latency."
[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343371]

"I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O
^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result
was, well, absolutely, positively _DEVASTATING_. The entire system became
_FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1
worked (took 20 seconds for even this key to be respected)."
[http://lkml.org/lkml/2010/4/4/86]

"This regression has been around since about the 2.6.18 timeframe and has
eluded a lot of testing to isolate the root cause. The most promising fix
is in the VM subsystem (mm) where the LRU scan has been changed to favor
keeping executable pages active longer. Most of these symptoms come down
to VM thrashing to make room for I/O pages. The key change/commit is
ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable
pages the first class citizen"... This change was merged into the 2.6.31r1
kernel."
[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235]

One possible cause is that writing to a slow device can block the write
queue for other devices, causing the machine to come to a standstill when
there's plenty of useful work that it could be doing.

This could cause a cascading failure in your server as soon as disk
I/O write load goes over a certain point, a bit like a swap death. I'm not
sure if the fact that you're using NFS makes a difference; perhaps only if
you memory-map files?

You could test this by booting with the NOOP or anticipatory scheduler
instead of the default CFQ to see if it makes any difference.

Cheers, Chris.


You can change it on the fly with:
`echo noop > /sys/block/${DEVICE}/queue/scheduler`

--
-Eric 'shubes'



Re: [Dovecot] POP3 error

2011-03-08 Thread Chris Wilson
Hi Thierry,

On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
> On 08 Mar 2011, at 13:24, Chris Wilson wrote:
> >
> >> top - 11:10:14 up 14 days, 12:04,  2 users,  load average: 55.04, 29.13, 
> >> 14.55
> >> Tasks: 474 total,  60 running, 414 sleeping,   0 stopped,   0 zombie
> >> Cpu(s): 99.6%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  
> >> 0.0%st
> >> Mem:  16439812k total, 16353268k used,86544k free,33268k buffers
> >> Swap:  4192956k total,  140k used,  4192816k free,  8228744k cached
>
> As you can see the numbers (55.04, 29.13, 14.55) the load was busy 
> getting higher when I took this snapshot and this was not a normal 
> situation. Usually this machine's load is only between 1 and 4, which is 
> quite ok for a quad core. It only happens when dovecot start generating 
> errors, and pop3, imap and http get stuck.  It went up to 200, and I was 
> still able to stop web and mail daemons, then restart them, and 
> everything was back to normal.

I don't have a definite answer, but I remember that there has been a 
long-running bug in the Linux kernel with schedulers behaving badly under 
heavy writes:

"One of the problems commonly talked about in our forums and elsewhere is 
the poor responsiveness of the Linux desktop when dealing with significant 
disk activity on systems where there is insufficient RAM or the disks are 
slow. The GUI basically drops to its knees when there is too much disk 
activity..." [http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw] 
(note, it's not just the GUI, all other tasks can starve when a disk I/O 
queue builds up).

"There are a few options to tune the linux IO scheduler that can help a 
bunch... Typically CFQ stalls too long under heavy writes, especially if 
your disk subsystem sucks, so particularly if you have several spindles 
deadline is worth a try." [http://communities.vmware.com/thread/82544]

"I run Ubuntu on a moderately powerful quad-core x86-64 system and the 
desktop response is basically crippled whenever something is reading or 
writing large files as fast as it can (at normal priority)... For example, 
cat /path/to/LARGE_FILE > /dev/null ... Everything else gets completely 
unusable because of the I/O latency."
[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343371]

"I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O 
^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result 
was, well, absolutely, positively _DEVASTATING_. The entire system became 
_FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1 
worked (took 20 seconds for even this key to be respected)." 
[http://lkml.org/lkml/2010/4/4/86]

"This regression has been around since about the 2.6.18 timeframe and has 
eluded a lot of testing to isolate the root cause. The most promising fix 
is in the VM subsystem (mm) where the LRU scan has been changed to favor 
keeping executable pages active longer. Most of these symptoms come down 
to VM thrashing to make room for I/O pages. The key change/commit is 
ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable 
pages the first class citizen"... This change was merged into the 2.6.31r1 
kernel." 
[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235]

One possible cause is that writing to a slow device can block the write 
queue for other devices, causing the machine to come to a standstill when 
there's plenty of useful work that it could be doing.

This could cause a cascading failure in your server as soon as disk 
I/O write load goes over a certain point, a bit like a swap death. I'm not 
sure if the fact that you're using NFS makes a difference; perhaps only if 
you memory-map files?

You could test this by booting with the NOOP or anticipatory scheduler 
instead of the default CFQ to see if it makes any difference.

Cheers, Chris.
-- 
Aptivate | http://www.aptivate.org | Phone: +44 1223 760887
The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES

Aptivate is a not-for-profit company registered in England and Wales
with company number 04980791.


Re: [Dovecot] POP3 error

2011-03-08 Thread Charles Marcus
On 2011-03-08 10:40 AM, Thierry de Montaudry wrote:
> On 08 Mar 2011, at 13:24, Chris Wilson wrote:
>> There's nothing to debug in dovecot here. Your server is overloaded
>> by about 55 times. Buy 55 times as many servers or do something
>> about your webmail interface (maybe a separate webmail cluster).

> As you can see the numbers (55.04, 29.13, 14.55) the load was busy
> getting higher when I took this snapshot and this was not a normal
> situation. Usually this machine's load is only between 1 and 4, which
> is quite ok for a quad core. It only happens when dovecot start
> generating errors, and pop3, imap and http get stuck.  It went up to
> 200, and I was still able to stop web and mail daemons, then restart
> them, and everything was back to normal.

What is your webmail server (and version)? Maybe it is buggy?

-- 

Best regards,

Charles


Re: [Dovecot] POP3 error

2011-03-08 Thread Thierry de Montaudry

On 08 Mar 2011, at 13:24, Chris Wilson wrote:

> Hi Thierry,
> 
> On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
>> On 07 Mar 2011, at 19:15, Timo Sirainen wrote:
>>> On Mon, 2011-03-07 at 19:03 +0200, Thierry de Montaudry wrote:
> Mar  7 11:19:51 xxx dovecot: pop3-login: Error: 
> net_connect_unix(pop3) failed: Resource temporarily unavailable
> ..
 As it is happening at least once a day, is there anything I can do to 
 trace it? and whatever I'll do, will it slow down those machines?
>>> 
>>> Set verbose_proctitle=yes (won't slow down) and get list of all 
>>> Dovecot processes when it happens. And check how much user and system 
>>> CPU it's using and what the load is.
>> 
>> Got the same problem this morning, here is the CPU usage and ps aux for 
>> dovecot. plus the different error I could pick up in the log, most of 
>> them are repeated a couple of times.
>> 
>> I suspect it a problem with system resources, but can find any message 
>> to tell me what. Mail are stored on 17 NFS servers (CentOS), plus 3 
>> servers for indexes only.
>> 
>> CPU load is very high, but mainly from httpd running our webmail 
>> interface, which uses the local imap server.
> [...]
>> top - 11:10:14 up 14 days, 12:04,  2 users,  load average: 55.04, 29.13, 
>> 14.55
>> Tasks: 474 total,  60 running, 414 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 99.6%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  
>> 0.0%st
>> Mem:  16439812k total, 16353268k used,86544k free,33268k buffers
>> Swap:  4192956k total,  140k used,  4192816k free,  8228744k cached
> 
> You're lucky this server is still alive and that you could even run top 
> and ps on it.
> 
> There's nothing to debug in dovecot here. Your server is overloaded by 
> about 55 times. Buy 55 times as many servers or do something about your 
> webmail interface (maybe a separate webmail cluster).
> 
> Cheers, Chris.
> 
As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting 
higher when I took this snapshot and this was not a normal situation. Usually 
this machine's load is only between 1 and 4, which is quite ok for a quad core. 
It only happens when dovecot start generating errors, and pop3, imap and http 
get stuck.  It went up to 200, and I was still able to stop web and mail 
daemons, then restart them, and everything was back to normal.



Re: [Dovecot] POP3 error

2011-03-08 Thread Thierry de Montaudry
On 07 Mar 2011, at 19:15, Timo Sirainen wrote:

> On Mon, 2011-03-07 at 19:03 +0200, Thierry de Montaudry wrote:
>>> Mar  7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) 
>>> failed: Resource temporarily unavailable
>>> ..
>> As it is happening at least once a day, is there anything I can do to trace 
>> it? and whatever I'll do, will it slow down those machines?
> 
> Set verbose_proctitle=yes (won't slow down) and get list of all Dovecot
> processes when it happens. And check how much user and system CPU it's
> using and what the load is.
> 
Got the same problem this morning, here is the CPU usage and ps aux for 
dovecot. plus the different error I could pick up in the log, most of them are 
repeated a couple of times.
I suspect it a problem with system resources, but can find any message to tell 
me what. Mail are stored on 17 NFS servers (CentOS), plus 3 servers for indexes 
only.
CPU load is very high, but mainly from httpd running our webmail interface, 
which uses the local imap server.

Mar  8 11:08:02 xxx dovecot: imap-login: Error: net_connect_unix(imap) failed: 
Resource temporarily unavailable
Mar  8 11:08:02 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: 
Resource temporarily unavailable
Mar  8 11:08:52 xxx dovecot: pop3-login: Error: master(pop3): Auth request 
timed out (received 0/12 bytes)
Mar  8 11:12:54 xxx dovecot: pop3(xyz@wm): Error: 
net_connect_unix(/var/run/dovecot/dict) failed: Connection refused
Mar  8 11:12:55 xxx dovecot: pop3-login: Error: read(pop3) failed: Connection 
reset by peer
Mar  8 11:12:56 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: 
Connection refused
Mar  8 11:12:59 xxx dovecot: pop3(xyz@wm): Error: 
net_connect_unix(/var/run/dovecot/dict) failed: Connection refused


top - 11:10:14 up 14 days, 12:04,  2 users,  load average: 55.04, 29.13, 14.55
Tasks: 474 total,  60 running, 414 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.6%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  16439812k total, 16353268k used,86544k free,33268k buffers
Swap:  4192956k total,  140k used,  4192816k free,  8228744k cached


vmail  313  0.0  0.0  24660  2260 ?S10:47   0:00 dovecot/imap 
[gabs002@wm 127.0.0.1 APPEND]
vmail 1376  0.0  0.0  24432  2136 ?S10:48   0:00 dovecot/imap 
[phillippapi@wm 127.0.0.1 LOGOUT UID COPY]
vmail 1738  0.0  0.0  24432  2196 ?S10:49   0:00 dovecot/imap 
[herlo@wm 127.0.0.1 APPEND]
vmail 2053  0.0  0.0  24588  2188 ?S10:49   0:00 dovecot/imap 
[kifi@wm 127.0.0.1 APPEND]
vmail 3224  0.0  0.0  24592  2192 ?S10:50   0:00 dovecot/imap 
[briankajengo@wm 127.0.0.1 APPEND]
vmail 3267  0.0  0.0  24664  2268 ?S10:50   0:00 dovecot/imap 
[gabs002@wm 127.0.0.1 APPEND]
vmail 4023  0.0  0.0  24572  2168 ?S10:50   0:00 dovecot/imap 
[mmakutloano@hm 127.0.0.1 APPEND]
vmail 4025  0.0  0.0  24592  2188 ?S10:50   0:00 dovecot/imap 
[buhlungum@wm 127.0.0.1 APPEND]
vmail 4066  0.0  0.0  24424  2192 ?S10:50   0:00 dovecot/imap 
[mowee@xm 127.0.0.1 APPEND]
vmail 4181  0.0  0.0  24648  2212 ?S10:50   0:00 dovecot/imap 
[sophieh@wm 127.0.0.1 APPEND]
vmail 4399  0.0  0.0  24620  2224 ?S10:51   0:00 dovecot/imap 
[tcc.dbn@wm 127.0.0.1 APPEND]
vmail 4866  0.0  0.0  24592  2196 ?S10:51   0:00 dovecot/imap 
[kifi@wm 127.0.0.1 APPEND]
vmail 5049  0.0  0.0  24584  2228 ?S10:51   0:00 dovecot/imap 
[malinga@sm 127.0.0.1 APPEND]
vmail 5961  0.0  0.0  24588  2192 ?S10:52   0:00 dovecot/imap 
[briankajengo@wm 127.0.0.1 APPEND]
vmail 6819  0.0  0.0  24624  2268 ?S10:52   0:00 dovecot/imap 
[ferns2004@wm 127.0.0.1 APPEND]
vmail 6832  0.0  0.0  24636  2308 ?S10:52   0:00 dovecot/imap 
[lib@mm 127.0.0.1 APPEND]
vmail 6854  0.0  0.0  24496  2216 ?S10:52   0:00 dovecot/imap 
[amawele@wm 127.0.0.1 UID]
vmail 7164  0.0  0.0  24620  2224 ?S10:53   0:00 dovecot/imap 
[tcc.dbn@wm 127.0.0.1 APPEND]
vmail 8441  0.0  0.0  24440  2124 ?S10:54   0:00 dovecot/imap 
[apheeha@wm 127.0.0.1 APPEND]
root  8736  0.0  0.0  61736  2940 ?S07:05   0:00 dovecot/auth 
[0 wait, 0 passdb, 0 userdb]
vmail 9559  0.0  0.0  24588  2192 ?S10:54   0:00 dovecot/imap 
[lib@mm 127.0.0.1 APPEND]
vmail 9716  0.0  0.0  24628  2224 ?S10:55   0:00 dovecot/imap 
[buhlungum@wm 127.0.0.1 APPEND]
vmail 9939  0.0  0.0  24624  2224 ?S10:55   0:00 dovecot/imap 
[tcc.dbn@wm 127.0.0.1 APPEND]
vmail12112  0.0  0.0  24592  2200 ?S10:56   0:00 dovecot/imap 
[lib@mm 127.0.0.1 APPEND]
vmail12558  0.0  0.0  24592  2196 ?S10:57   0:00 dovecot/imap 
[kifi@wm 127.0.0.1 APPEND]
vmail13437  0.0  0.0  2  2128 ?S10:57   0:00 dovecot/imap 
[pmagqibelo@ut 127

Re: [Dovecot] POP3 error

2011-03-07 Thread Timo Sirainen
On Mon, 2011-03-07 at 19:03 +0200, Thierry de Montaudry wrote:
> > Mar  7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) 
> > failed: Resource temporarily unavailable
> > ..
> As it is happening at least once a day, is there anything I can do to trace 
> it? and whatever I'll do, will it slow down those machines?

Set verbose_proctitle=yes (won't slow down) and get list of all Dovecot
processes when it happens. And check how much user and system CPU it's
using and what the load is.




Re: [Dovecot] POP3 error

2011-03-07 Thread Thierry de Montaudry

On 07 Mar 2011, at 17:17, Timo Sirainen wrote:

> On Mon, 2011-03-07 at 13:40 +0200, Thierry de Montaudry wrote:
> Mar  7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) 
> failed: Resource temporarily unavailable
> ..
>>> Do you see any warning messages in logs containing "client connections are 
>>> being dropped"?
>>> 
>> I did not see it on any machines. 
> 
> Hmh. Could you upgrade to 2.0.11? It splits the two causes of "Resource
> temporarily unavailable" errors to two separate error messages. It would
> help figuring out the problem.
> 
>> But for this specific one, I got the following after those errors, before 
>> restarting dovecot:
>> 
>> Mar  7 11:20:09 web4 dovecot: pop3(x@y): Error: 
>> net_connect_unix(/var/run/dovecot/dict) failed: Connection refused
>> Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: 
>> Connection reset by peer
>> Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: 
>> Connection reset by peer
>> Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: 
>> Connection reset by peer
>> Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: 
>> Connection reset by peer
> 
> This looks like processes started dying.
> 
As it is happening at least once a day, is there anything I can do to trace it? 
and whatever I'll do, will it slow down those machines?




Re: [Dovecot] POP3 error

2011-03-07 Thread Timo Sirainen
On Mon, 2011-03-07 at 13:40 +0200, Thierry de Montaudry wrote:
> >>> Mar  7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) 
> >>> failed: Resource temporarily unavailable
..
> > Do you see any warning messages in logs containing "client connections are 
> > being dropped"?
> > 
> I did not see it on any machines. 

Hmh. Could you upgrade to 2.0.11? It splits the two causes of "Resource
temporarily unavailable" errors to two separate error messages. It would
help figuring out the problem.

> But for this specific one, I got the following after those errors, before 
> restarting dovecot:
> 
> Mar  7 11:20:09 web4 dovecot: pop3(x@y): Error: 
> net_connect_unix(/var/run/dovecot/dict) failed: Connection refused
> Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: 
> Connection reset by peer
> Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: 
> Connection reset by peer
> Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: 
> Connection reset by peer
> Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: 
> Connection reset by peer

This looks like processes started dying.




Re: [Dovecot] POP3 error

2011-03-07 Thread Thierry de Montaudry

On 07 Mar 2011, at 12:01, Timo Sirainen wrote:

> On 7.3.2011, at 11.51, Thierry de Montaudry wrote:
> 
>> Since we upgraded to 2.0.9 (from 1.10 stock CentOS release), we are getting 
>> some errors with pop3. When the machines get busy, now and then it start 
>> with the following:
>>> Mar  7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) 
>>> failed: Resource temporarily unavailable
>> And it generates hundreds of those before the machines dies, with the web 
>> server getting stuck as well on imap sessions, even though there is no imap 
>> error messages.
> 
> Do you see any warning messages in logs containing "client connections are 
> being dropped"?
> 
I did not see it on any machines. But for this specific one, I got the 
following after those errors, before restarting dovecot:

Mar  7 11:20:09 web4 dovecot: pop3(x@y): Error: 
net_connect_unix(/var/run/dovecot/dict) failed: Connection refused
Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection 
reset by peer
Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection 
reset by peer
Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection 
reset by peer
Mar  7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection 
reset by peer



Re: [Dovecot] POP3 error

2011-03-07 Thread Timo Sirainen
On 7.3.2011, at 11.51, Thierry de Montaudry wrote:

> Since we upgraded to 2.0.9 (from 1.10 stock CentOS release), we are getting 
> some errors with pop3. When the machines get busy, now and then it start with 
> the following:
>> Mar  7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) 
>> failed: Resource temporarily unavailable
> And it generates hundreds of those before the machines dies, with the web 
> server getting stuck as well on imap sessions, even though there is no imap 
> error messages.

Do you see any warning messages in logs containing "client connections are 
being dropped"?



Re: [Dovecot] POP3 Error

2009-03-09 Thread Jeff Grossman

On 3/9/2009 8:07 PM, Mark Sapiro wrote:

Jeff Grossman wrote:

   

I just looked over my logs and noticed the following error:

Mar  9 19:07:34 apple dovecot: Panic: POP3(april): Trying to allocate 0
bytes
Mar  9 19:07:34 apple dovecot: POP3(april): Raw backtrace: pop3
[0x492952] ->  pop3 [0x4929d3] ->  pop3 [0x4920e6] ->  pop3 [0x49cb8d] ->
pop3(client_create+0x452)
   [0x41aef2] ->  pop3(main+0x393) [0x41c9e3] ->
/lib/libc.so.6(__libc_start_main+0xe6) [0x7f6bb46e11a6] ->  pop3 [0x41a1b9]
Mar  9 19:07:34 apple dovecot: child 9616 (pop3) killed with signal 6
 

It's a known problem. The fix is at
.

   

Thank you.  Applying the fix right now.

Jeff


Re: [Dovecot] POP3 Error

2009-03-09 Thread Mark Sapiro
Jeff Grossman wrote:

> I just looked over my logs and noticed the following error:
> 
> Mar  9 19:07:34 apple dovecot: Panic: POP3(april): Trying to allocate 0 
> bytes
> Mar  9 19:07:34 apple dovecot: POP3(april): Raw backtrace: pop3 
> [0x492952] -> pop3 [0x4929d3] -> pop3 [0x4920e6] -> pop3 [0x49cb8d] -> 
> pop3(client_create+0x452)
>   [0x41aef2] -> pop3(main+0x393) [0x41c9e3] -> 
> /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6bb46e11a6] -> pop3 [0x41a1b9]
> Mar  9 19:07:34 apple dovecot: child 9616 (pop3) killed with signal 6

It's a known problem. The fix is at
.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan