Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-06-20 Thread Urban Loesch

Hi,

yesterday I disabled the inotify as mentioned in the previous post
and it works for me also. Thanks to all for the hint.

On 20.06.2012 08:35, Jesper Dahl Nyerup wrote:

On Jun 11  23:37, Jesper Dahl Nyerup wrote:

We're still chasing the root cause in the kernel or the VServer patch
set. We'll of course make sure to post our findings here, and I'd very
much appreciate to hear about other people's progress.


We still haven't found a solution, but here's what we've got thus far:

  - The issue is not VServer specific. We're able to reproduce it on
recent vanilla kernels.

  - The issue has a strong correlation with the number of processor cores
in the machine. The behavior is impossible to provoke on a dual core
workstation, but is very widespread on 16 or 24 core machines.


For the records:
I have the problem on 2 different machines with different CPU's
- PE2950 with 2x Intel Xeon X5450 3.00Ghz (8) CPU's (problem happens not so 
often as with PER610)
- PER610 with 2x Intel Xeon X5650 2.67GHz (24) CPU's



One of my colleagues has written a snippet of code that reproduces and
exposes the problem, and we've sent this to the Inotify maintainers and
the kernel mailing list, hoping that someone more familiar with the code
will be quicker to figure out what is broken.

If anyone's interested - either in following the issue or the code
snippet that reproduces it - here's the post:
http://thread.gmane.org/gmane.linux.kernel/1315430


As you described on the kernel maillinglist, I can confirm. The higher the
number of cpu's, the worse it gets.



As this is clearly a kernel issue, we're going to try to keep the
discussion there, and I'll probably not follow up here, until the issue
has been resolved.

Jesper.


Thanks
Urban


Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-06-19 Thread Jesper Dahl Nyerup
On Jun 11  23:37, Jesper Dahl Nyerup wrote:
> We're still chasing the root cause in the kernel or the VServer patch
> set. We'll of course make sure to post our findings here, and I'd very
> much appreciate to hear about other people's progress.

We still haven't found a solution, but here's what we've got thus far:

 - The issue is not VServer specific. We're able to reproduce it on
   recent vanilla kernels.

 - The issue has a strong correlation with the number of processor cores
   in the machine. The behavior is impossible to provoke on a dual core
   workstation, but is very widespread on 16 or 24 core machines.

One of my colleagues has written a snippet of code that reproduces and
exposes the problem, and we've sent this to the Inotify maintainers and
the kernel mailing list, hoping that someone more familiar with the code
will be quicker to figure out what is broken.

If anyone's interested - either in following the issue or the code
snippet that reproduces it - here's the post:
http://thread.gmane.org/gmane.linux.kernel/1315430

As this is clearly a kernel issue, we're going to try to keep the
discussion there, and I'll probably not follow up here, until the issue
has been resolved.

Jesper.


signature.asc
Description: Digital signature


Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-06-11 Thread Timo Sirainen
On 12.6.2012, at 0.37, Jesper Dahl Nyerup wrote:

>> Yeah. Looks like a kernel bug. You could try if it goes away by disabling 
>> inotify in Dovecot. Either recompile with "configure --with-notify=none" or 
>> maybe you can disable inotify globally with:
>> 
>> echo 0 > /proc/sys/fs/inotify/max_user_watches
>> echo 0 > /proc/sys/fs/inotify/max_user_instances
> 
> I can confirm that this removes the symptoms, and that it doesn't affect
> the service. Obviously IDLEing users are now only notified upon polling
> of the file system, but the I/O overhead of doing this seems minimal.

It actually doesn't increase I/O overhead at all. Dovecot always does polling, 
even with inotify, since inotify doesn't necessarily work with shared 
filesystems (e.g. NFS). The main difference is that users don't get immediate 
notifications of new mails now, but have to wait for 
mailbox_idle_check_interval.



Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-06-11 Thread Jesper Dahl Nyerup
On Jun 11  14:51, Timo Sirainen wrote:
> On 11.6.2012, at 11.09, Jesper Dahl Nyerup wrote:
>
> > In short, as far as we can tell, all the processes in D state appear to
> > be waiting to close the file handle they got from their inotify_init(),
> > and eventually all these close()s go through almost simultaneously.
> 
> Yeah. Looks like a kernel bug. You could try if it goes away by disabling 
> inotify in Dovecot. Either recompile with "configure --with-notify=none" or 
> maybe you can disable inotify globally with:
> 
> echo 0 > /proc/sys/fs/inotify/max_user_watches
> echo 0 > /proc/sys/fs/inotify/max_user_instances

I can confirm that this removes the symptoms, and that it doesn't affect
the service. Obviously IDLEing users are now only notified upon polling
of the file system, but the I/O overhead of doing this seems minimal.

It may be important to note, that even though load on our servers
surpass 2000, both Dovecot and the server as a whole is responsive and
servicing requests, up until the point where Dovecot reaches its
configured maximal number of child processes.

We're still chasing the root cause in the kernel or the VServer patch
set. We'll of course make sure to post our findings here, and I'd very
much appreciate to hear about other people's progress.

Jesper.


signature.asc
Description: Digital signature


Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-06-11 Thread Timo Sirainen
On 11.6.2012, at 11.09, Jesper Dahl Nyerup wrote:

> Stracing the processes in D state from before they hang has just
> revealed something interesting, however, pointing to an issue with
> inotify rather than epoll.
> 
> [snip]
> [...]
> 15414 23:27:36 inotify_init()   = 12 <0.24>
> [...]
> 15414 23:27:36 close(12 
> 15414 23:28:51 <... close resumed> )= 0 <74.593917>
> 15414 23:28:51 close(9 
> 15414 23:28:51 <... close resumed> )= 0 <0.80>
> 15414 23:28:51 exit_group(0)= ?
> [/snip]
> 
> In short, as far as we can tell, all the processes in D state appear to
> be waiting to close the file handle they got from their inotify_init(),
> and eventually all these close()s go through almost simultaneously.

Yeah. Looks like a kernel bug. You could try if it goes away by disabling 
inotify in Dovecot. Either recompile with "configure --with-notify=none" or 
maybe you can disable inotify globally with:

echo 0 > /proc/sys/fs/inotify/max_user_watches
echo 0 > /proc/sys/fs/inotify/max_user_instances


Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-06-11 Thread Jesper Dahl Nyerup
On May 20  16:29, Urban Loesch wrote:
> I checked my kernel and the patch mentioned in
> https://bugzilla.redhat.com/show_bug.cgi?id=681578
> 
> (comment 31) is not applied. It comes in version 3.0.30 and 3.2.17.
> 
> I will see what tomorrow happens under more load.
> If I have the problem again, I give 3.2.17 a chance.

We've seen similar behavior on a similar system with a similar workload.

We've tried a 3.0.31 - after the epoll patch was applied upstream -
without seeing a difference. Right now we're running a 3.3.7 with
vs2.3.3.4, and this has reduced the problem quite a bit, but not
eliminated it completely.

Stracing the processes in D state from before they hang has just
revealed something interesting, however, pointing to an issue with
inotify rather than epoll.

[snip]
[...]
15414 23:27:36 inotify_init()   = 12 <0.24>
[...]
15414 23:27:36 close(12 
15414 23:28:51 <... close resumed> )= 0 <74.593917>
15414 23:28:51 close(9 
15414 23:28:51 <... close resumed> )= 0 <0.80>
15414 23:28:51 exit_group(0)= ?
[/snip]

In short, as far as we can tell, all the processes in D state appear to
be waiting to close the file handle they got from their inotify_init(),
and eventually all these close()s go through almost simultaneously.

Right now we're trawling for locking issues related to inotify, with our
focus mainly at the VServer patch set. I would very much appreciate
updates on your - or anyone else's - findings and progress.

Yours,

Jesper Nyerup.


signature.asc
Description: Digital signature


Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-05-20 Thread Urban Loesch

Hi Javier,

thanks for your help.

Am 20.05.2012 13:58, schrieb Javier Miguel Rodríguez:



I know that you are NOT running RHEL / CentOS, but this problem with

1000 child processes bit us hard, read this red hat kernel bugzilla

(Timo has comments inside):


https://bugzilla.redhat.com/show_bug.cgi?id=681578

Maybe you are
hitting the same limit?



yes maybe.
The only strange thing is that I don't see any erros in my dovecot logs.
I don't see erros like "Panic: epoll_ctl" ore something else.

I checked my kernel and the patch mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=681578

(comment 31) is not applied. It comes in version 3.0.30 and 3.2.17.

I will see what tomorrow happens under more load.
If I have the problem again, I give 3.2.17 a chance.

thanks
Urban



Regards

Javier

El 20/05/2012 11:59, Urban
Loesch escribió:


Am 19.05.2012 21:05, schrieb Timo Sirainen:




On Wed, 2012-05-16 at 08:59 +0200, Urban Loesch wrote:



The

Server was running about 1 year without any problems. 15Min Load was
between 0,5 and max 8. No high IOWAIT. CPU Idletime about 98%.

..




# iostat -k Linux 3.0.28-vs2.3.2.3-rol-em64t (mailstore4)

16.05.2012 _x86_64_ (24 CPU)

Did you change the kernel just before it

broke? I'd try another version.


The first time it brokes with

kernel 2.6.38.8-vs2.3.0.37-rc17.

Then I tried it with 3.0.28 and it

brokes again.

On friday evening I disabled the cgroup feature

compleetly and until now

it seems to work normally.
But this could

be because we have weekend and now there are not many

connections

active. So I have

to wait until monday. If it happens again I will try

version 3.2.17.


On the other side it could be that the server is

overloaded, because

this problem happens only when there are
more

than 1000 tasks active. Sounds strange for me, because it has been



working without problems since 1 year

and we made no changes. Also

there were almost more than 1000 tasks

active over the last year and

we had no problems.


thanks
Urban





Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-05-20 Thread Javier Miguel Rodríguez
 

I know that you are NOT running RHEL / CentOS, but this problem with
> 1000 child processes bit us hard, read this red hat kernel bugzilla
(Timo has comments inside):


https://bugzilla.redhat.com/show_bug.cgi?id=681578 

Maybe you are
hitting the same limit? 

Regards 

Javier 

El 20/05/2012 11:59, Urban
Loesch escribió: 

> Am 19.05.2012 21:05, schrieb Timo Sirainen:
> 
>>
On Wed, 2012-05-16 at 08:59 +0200, Urban Loesch wrote: 
>> 
>>> The
Server was running about 1 year without any problems. 15Min Load was
between 0,5 and max 8. No high IOWAIT. CPU Idletime about 98%.
>> .. 
>>

>>> # iostat -k Linux 3.0.28-vs2.3.2.3-rol-em64t (mailstore4)
16.05.2012 _x86_64_ (24 CPU)
>> Did you change the kernel just before it
broke? I'd try another version.
> 
> The first time it brokes with
kernel 2.6.38.8-vs2.3.0.37-rc17.
> Then I tried it with 3.0.28 and it
brokes again.
> On friday evening I disabled the cgroup feature
compleetly and until now 
> it seems to work normally.
> But this could
be because we have weekend and now there are not many 
> connections
active. So I have
> to wait until monday. If it happens again I will try
version 3.2.17.
> 
> On the other side it could be that the server is
overloaded, because 
> this problem happens only when there are
> more
than 1000 tasks active. Sounds strange for me, because it has been 
>
working without problems since 1 year
> and we made no changes. Also
there were almost more than 1000 tasks 
> active over the last year and
we had no problems.
> 
> thanks
> Urban

 

Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-05-20 Thread Urban Loesch



Am 19.05.2012 21:05, schrieb Timo Sirainen:

On Wed, 2012-05-16 at 08:59 +0200, Urban Loesch wrote:


The Server was running about 1 year without any problems. 15Min Load was 
between 0,5 and max 8.
No high IOWAIT. CPU Idletime about 98%.

..

#  iostat -k
Linux 3.0.28-vs2.3.2.3-rol-em64t (mailstore4)   16.05.2012  _x86_64_
(24 CPU)


Did you change the kernel just before it broke? I'd try another version.





The first time it brokes with kernel 2.6.38.8-vs2.3.0.37-rc17.
Then I tried it with 3.0.28 and it brokes again.
On friday evening I disabled the cgroup feature compleetly and until now 
it seems to work normally.
But this could be because we have weekend and now there are not many 
connections active. So I have

to wait until monday. If it happens again I will try version 3.2.17.

On the other side it could be that the server is overloaded, because 
this problem happens only when there are
more than 1000 tasks active. Sounds strange for me, because it has been 
working without problems since 1 year
and we made no changes. Also there were almost more than 1000 tasks 
active over the last year and we had no problems.


thanks
Urban


Re: [Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-05-19 Thread Timo Sirainen
On Wed, 2012-05-16 at 08:59 +0200, Urban Loesch wrote:

> The Server was running about 1 year without any problems. 15Min Load was 
> between 0,5 and max 8.
> No high IOWAIT. CPU Idletime about 98%.
..
> #  iostat -k
> Linux 3.0.28-vs2.3.2.3-rol-em64t (mailstore4) 16.05.2012  
> _x86_64_(24 CPU)

Did you change the kernel just before it broke? I'd try another version.




[Dovecot] Very High Load on Dovecot 2 and Errors in mail.err.

2012-05-15 Thread Urban Loesch

Hi,

I have a DELL PE R610 (32GB RAM 2x Six Core CPU and about 1,4 TB RAID 10)
running with 20.000 Mailaccounts behind 2 Dovecot IMAP/POP3 Proxies on a Debian 
Lenny.

The Server was running about 1 year without any problems. 15Min Load was 
between 0,5 and max 8.
No high IOWAIT. CPU Idletime about 98%.

But since yesterday morning the Systemload on the Server has been increased 
over 500. I Think this is
very high. The strange thing: there was no IOWAIT and the CPU Idle time was 
allways the same on about 98%.

The total amount of IMAP Sessions is about 300 - 600.

Current vmstat and iostat:

#  vmstat 1
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 1  0  0 27040576 635460 34956000014152   31  0  0 99  0
 0  0  0 27040320 635468 349606400   804   455 1383 1281  0  0 98  1
 0  0  0 27047016 634964 348931200   216   156 1841 1292  1  0 98  1
 0  0  0 27047140 635028 348901200   240   619 1629 1658  0  0 96  3
 0  0  0 27047264 635120 34891720092 0 1069  881  0  0 100  0
 0  0  0 27047388 635120 348925600 046 1404 1265  0  0 100  0
 0  0  0 27047512 635136 348931200   128   471 1539 1354  0  0 99  1
 0  0  0 27047388 635156 34893840012   360 1108  952  0  0 99  0
 0  0  0 27047516 635160 348940800   10412  893  677  0  0 99  0
^C

#  iostat -k
Linux 3.0.28-vs2.3.2.3-rol-em64t (mailstore4)   16.05.2012  _x86_64_
(24 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0,080,000,090,340,00   99,49

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sda  67,35   636,28  1080,48   31337690   53215361
dm-0 73,53   591,55   893,93   29134837   44027496
dm-1 15,0642,40   181,7920884938953661
drbd151,21   316,06   277,94   15566325   13689004
drbd0 9,2223,5080,3811576413958796

Current Load:
08:55:44 up 14:04,  3 users,  load average: 19,64, 14,47, 10,49
 Load is increasing.


The only strange thing I can see is this:
# ps -ostat,pid,time,wchan='WCHAN-',cmd ax  |grep D
STAT   PID TIME WCHAN- CMD
D18713 00:00:00 synchronize_srcu   dovecot/imap
D18736 00:00:00 synchronize_srcu   dovecot/imap
D18775 00:00:05 synchronize_srcu   dovecot/imap
D20330 00:00:00 synchronize_srcu   dovecot/imap
D20357 00:00:00 synchronize_srcu   dovecot/imap
D20422 00:00:00 synchronize_srcu   dovecot/imap
D20687 00:00:00 synchronize_srcu   dovecot/imap
S+   20913 00:00:00 pipe_wait  grep D

There are many imap processes in D State. Amount is increasing.
I think they are delayed and are wating for some event.
I have no idea on which event they are waiting.

And many of this in "mail.err" Log:
May 16 08:52:11 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 17468 is ignoring idle SIGINT
May 16 08:52:24 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 20307 is ignoring idle SIGINT
May 16 08:52:25 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 20318 is ignoring idle SIGINT
May 16 08:52:26 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 18964 is ignoring idle SIGINT
May 16 08:52:28 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 19244 is ignoring idle SIGINT
May 16 08:54:22 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 21177 is ignoring idle SIGINT
May 16 08:54:41 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 20647 is ignoring idle SIGINT
May 16 08:55:10 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 18836 is ignoring idle SIGINT
May 16 08:55:17 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 18857 is ignoring idle SIGINT
May 16 08:55:19 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 21176 is ignoring idle SIGINT
May 16 08:55:24 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 20688 is ignoring idle SIGINT
May 16 08:55:25 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 20973 is ignoring idle SIGINT
May 16 08:56:44 dcot-rolmail-1.rolmail.net dovecot: master: Error: 
service(imap): Process 20326 is ignoring idle SIGINT


Have you any Idea ho I can troubleshoot this problem?



Some more technical information about the system:
# uname -a
Linux mailstore4 3.0.28-vs2.3.2.3-rol-em64t #1 SMP Thu May 3 09:31:08 CEST 2012 
x86_64 GNU/Linux

Dovecot Packages on Backend: installed:
#  dpkg -l |grep dovecot
ii  debian-dovecot-auto-keyring 2010.01.30