Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-03-24 Thread list
Timo  friends,

We have not posted in a long time regarding this issue, but I wanted to
update everyone that it was indeed an NTP issue.  We have since put
together stratum 2 servers on physical hardware (as opposed to virtual
machines) which now act as the source for our cluster of dovecot machines
and everything is working flawlessly.  I just wanted to give my thanks to
everyone that helped us trouble shoot this issue and show my gratitude for
such a great community!

Michael

On Fri, 28 Jan 2011 03:40:38 +0200, Timo Sirainen t...@iki.fi wrote:
 On Tue, 2011-01-25 at 12:34 -0600, l...@airstreamcomm.net wrote:
 
 Jan 25 11:30:11 1295976611 POP3(4eagles): Warning: Created dotlock
file's
 timestamp is different than current time (1295976643 vs 1295976607):
 /mail/4/e/4eagles/Maildir/dovecot-uidlist
 
 We added the epoch time to the log_timestamp setting to compare it to
the
 dotlock error, and as you can see the Created dotlock epoch time is
32
 seconds in the future compared to the epoch time of the log event.  At
 this
 point I hope you can help us understand where a timestamp from the
future
 might be generated from. 
 
 The first timestamp comes from the .lock file's ctime as reported by
 fstat(). If that's in the future, then the possibilities are:
 
 a) Bug in the kernel's NFS client code (unlikely)
 b) Bug in NFS server reporting or setting wrong ctime
 c) NFS server's clock went into future
 
 No possibility of it being Dovecot's fault anymore.



Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-27 Thread Timo Sirainen
On Tue, 2011-01-25 at 12:34 -0600, l...@airstreamcomm.net wrote:
 
 Jan 25 11:30:11 1295976611 POP3(4eagles): Warning: Created dotlock file's
 timestamp is different than current time (1295976643 vs 1295976607):
 /mail/4/e/4eagles/Maildir/dovecot-uidlist
 
 We added the epoch time to the log_timestamp setting to compare it to the
 dotlock error, and as you can see the Created dotlock epoch time is 32
 seconds in the future compared to the epoch time of the log event.  At this
 point I hope you can help us understand where a timestamp from the future
 might be generated from. 

The first timestamp comes from the .lock file's ctime as reported by
fstat(). If that's in the future, then the possibilities are:

a) Bug in the kernel's NFS client code (unlikely)
b) Bug in NFS server reporting or setting wrong ctime
c) NFS server's clock went into future

No possibility of it being Dovecot's fault anymore.




Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-25 Thread list
On Tue, 25 Jan 2011 01:11:53 +0200, Timo Sirainen t...@iki.fi wrote:
 On 25.1.2011, at 1.06, l...@airstreamcomm.net wrote:
 
 Multi-server setup that tries to flush NFS caches:
 
 dotlock_use_excl = no # only needed with NFSv2, NFSv3+ supports O_EXCL
 and
 it's faster
 
 You're probably using NFSv3, right? Then this isn't needed.
 
 Also not the tries word. It doesn't work perfectly, although in your
 case it seems to be working than expected. Still, these NFS problems are
 the reason I created director: http://wiki2.dovecot.org/Director

We are using NFSv3, and for five months the system worked with four
dovecot servers and three postfix servers all accessing the same NFS server
simultaneously.  We cannot pick out a change in our network or on the
virtual environment that our machines resides that would have impacted the
system this drastically.  We have also confirmed that our clocks on all
systems accessing the NFS server and the NFS server itself are within 1
second of each other.  It's confounding us why the logs show such strange
time stamps:

Jan 25 11:30:11 1295976611 POP3(4eagles): Warning: Created dotlock file's
timestamp is different than current time (1295976643 vs 1295976607):
/mail/4/e/4eagles/Maildir/dovecot-uidlist

We added the epoch time to the log_timestamp setting to compare it to the
dotlock error, and as you can see the Created dotlock epoch time is 32
seconds in the future compared to the epoch time of the log event.  At this
point I hope you can help us understand where a timestamp from the future
might be generated from.

As for the director, we will be considering the option after doing some
heavy testing.

Thanks,

Michael



Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-24 Thread list
Timo,

Thanks for the quick reply!  We are building an rpm with the patch and
will test this week and report back with our findings.

We are grateful for your help and the chance to communicate directly with
the author of dovecot!

Michael

On Thu, 20 Jan 2011 23:18:16 +0200, Timo Sirainen t...@iki.fi wrote:
 On Thu, 2011-01-20 at 08:32 -0600, l...@airstreamcomm.net wrote:
 
 Created dotlock file's timestamp is different than current time
 (1295480202 vs 1295479784): /mail/user/Maildir/dovecot-uidlist
 
 Hmm. This may be a bug that happens when dotlocking has to wait for a
 long time for dotlock. See if
 http://hg.dovecot.org/dovecot-1.2/raw-rev/9a50a9dc905f fixes this.



Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-24 Thread list
On Mon, 24 Jan 2011 08:41:31 -0600, l...@airstreamcomm.net wrote:
 Timo,
 
 Thanks for the quick reply!  We are building an rpm with the patch and
 will test this week and report back with our findings.
 
 We are grateful for your help and the chance to communicate directly
with
 the author of dovecot!
 
 Michael
 
 On Thu, 20 Jan 2011 23:18:16 +0200, Timo Sirainen t...@iki.fi wrote:
 On Thu, 2011-01-20 at 08:32 -0600, l...@airstreamcomm.net wrote:
 
 Created dotlock file's timestamp is different than current time
 (1295480202 vs 1295479784): /mail/user/Maildir/dovecot-uidlist
 
 Hmm. This may be a bug that happens when dotlocking has to wait for a
 long time for dotlock. See if
 http://hg.dovecot.org/dovecot-1.2/raw-rev/9a50a9dc905f fixes this.

Timo,

We tested the patch you suggested with no success.  We are seeing
timestamps straying into the 100's of seconds of difference, which does not
reflect the perceivable drift that shows on our systems clock (which is
negligible, sub 1 second).  Currently we run against two stratum 2 servers
that get their time from two stratum 1 servers, and per Stan's suggestions
earlier we are rebuilding the stratum 2 machines on bare metal hardware
(not in a virtual machine) to be sure the clocks will be super stable.  I
guess what I am asking is if you have ever seen an issue similar to this
and if NTP actually played a role.  I have spent some time reviewing the
mailing list archives and have not found a definitive answer from other
Dovecot user's experiences. 

Thanks again,

Michael

Here's the error showing a difference of 239 seconds:

Created dotlock file's timestamp is different than current time
(1295900137 vs 1295899898): /mail/username/Maildir/dovecot-uidlist

Here's the output of dovecot -n after patching:

# 1.2.16: /etc/dovecot.conf
# OS: Linux 2.6.18-92.el5 x86_64 CentOS release 5.5 (Final) 
protocols: imap pop3
listen(default): *:143
listen(imap): *:143
listen(pop3): *:110
shutdown_clients: no
login_dir: /var/run/dovecot/login
login_executable(default): /usr/libexec/dovecot/imap-login
login_executable(imap): /usr/libexec/dovecot/imap-login
login_executable(pop3): /usr/libexec/dovecot/pop3-login
login_process_per_connection: no
login_process_size: 128
login_processes_count: 4
login_max_processes_count: 256
login_max_connections: 386
first_valid_uid: 300
mail_location: maildir:~/Maildir
mmap_disable: yes
dotlock_use_excl: no
mail_nfs_storage: yes
mail_nfs_index: yes
mail_executable(default): /usr/libexec/dovecot/imap
mail_executable(imap): /usr/libexec/dovecot/imap
mail_executable(pop3): /usr/libexec/dovecot/pop3
mail_plugin_dir(default): /usr/lib64/dovecot/imap
mail_plugin_dir(imap): /usr/lib64/dovecot/imap
mail_plugin_dir(pop3): /usr/lib64/dovecot/pop3
auth default:
  username_format: %Ln
  worker_max_count: 50
  passdb:
driver: pam
  userdb:
driver: passwd




Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-24 Thread Harlan Stenn
Michael wrote:

 We tested the patch you suggested with no success.  We are seeing
 timestamps straying into the 100's of seconds of difference, which
 does not reflect the perceivable drift that shows on our systems clock
 (which is negligible, sub 1 second).  Currently we run against two
 stratum 2 servers that get their time from two stratum 1 servers, and
 per Stan's suggestions earlier we are rebuilding the stratum 2
 machines on bare metal hardware (not in a virtual machine) to be sure
 the clocks will be super stable.  I guess what I am asking is if you
 have ever seen an issue similar to this and if NTP actually played a
 role.  I have spent some time reviewing the mailing list archives and
 have not found a definitive answer from other Dovecot user's
 experiences.

(posting from my work account...)

Barring strange hardware or OS behavior, NTP is generally really good at
what it does.

There are known problems with some hardware and some OSes (and virtual
environments in particular).

See http://support.ntp.org/bin/view/Support/TroubleshootingNTP for more
information.

And just to try the direct approach, if correct time and NTP are
important to you, please join the NTP Forum.  We need the support.

--
Harlan Stenn st...@ntp.org
http://ntpforum.isc.org  - be a member!


Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-24 Thread Timo Sirainen
On 24.1.2011, at 23.06, l...@airstreamcomm.net wrote:

 We tested the patch you suggested with no success.  We are seeing
 timestamps straying into the 100's of seconds of difference, which does not
 reflect the perceivable drift that shows on our systems clock (which is
 negligible, sub 1 second).
..
 Created dotlock file's timestamp is different than current time
 (1295900137 vs 1295899898): /mail/username/Maildir/dovecot-uidlist

I see you have:

 dotlock_use_excl: no

There was another reason why the above error could happen with that setting. 
http://hg.dovecot.org/dovecot-1.2/rev/ab81fbb195e2 fixes it.

But anyway, the main problem you have is that Dovecot is waiting for hundreds 
of seconds for the uidlist lock. It shouldn't. Meaning your NFS server is 
answering way too slowly to Dovecot.



Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-24 Thread Harlan Stenn
I may have missed something - if this is NFS related, you are running
NTP on *all* of your machines (clients and servers) that are involved in
this NFS share, right?

NFS cares that the time is sync'd on the computers it works with.

H


Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-24 Thread list
On Tue, 25 Jan 2011 00:13:53 +0200, Timo Sirainen t...@iki.fi wrote:
 On 24.1.2011, at 23.06, l...@airstreamcomm.net wrote:
 
 We tested the patch you suggested with no success.  We are seeing
 timestamps straying into the 100's of seconds of difference, which does
 not
 reflect the perceivable drift that shows on our systems clock (which is
 negligible, sub 1 second).
 ..
 Created dotlock file's timestamp is different than current time
 (1295900137 vs 1295899898): /mail/username/Maildir/dovecot-uidlist
 
 I see you have:
 
 dotlock_use_excl: no
 
 There was another reason why the above error could happen with that
 setting. http://hg.dovecot.org/dovecot-1.2/rev/ab81fbb195e2 fixes it.
 
 But anyway, the main problem you have is that Dovecot is waiting for
 hundreds of seconds for the uidlist lock. It shouldn't. Meaning your NFS
 server is answering way too slowly to Dovecot.

Timo,

Thanks for the quick reply.  

Per the dovecot wiki on nfs http://wiki.dovecot.org/NFS:

Multi-server setup that tries to flush NFS caches:

mmap_disable = yes
dotlock_use_excl = no # only needed with NFSv2, NFSv3+ supports O_EXCL and
it's faster
mail_nfs_storage = yes # v1.1+ only
mail_nfs_index = yes # v1.1+ only

We configured our systems to match these settings, as we are trying to use
a multi-server setup.  

We certainly have not excluded our NFS server as the culprit for the long
wait, but at this point we are running two server's against that NFS share,
a postfix machine and a dovecot machine, both accessing the disk without
error or any latency.  It's only when we add a second dovecot machine to
the mix when we start to see the dotlock errors.  In our configuration we
have a load balancer that is distributing traffic automatically to the pop
and imap services on our cluster of machines.  Once we add a new server to
the mix the load balancer automatically directs new traffic, and we start
seeing the errors:

Jan 24 13:20:39 10.123.128.105 dovecot: pop3-login: Login:
user=mpbixler, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105
Jan 24 13:20:42 10.123.128.105 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=0/0, del=0/8234, size=1450555980
Jan 24 13:40:46 10.123.128.105 dovecot: pop3-login: Login:
user=mpbixler, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105
Jan 24 13:40:47 10.123.128.105 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=0/0, del=0/8234, size=1450555980
Jan 24 14:14:28 10.123.128.108 dovecot: pop3-login: Login:
user=mpbixler, method=PLAIN, rip=68.65.35.174, lip=64.33.128.108
Jan 24 14:14:28 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock
file's timestamp is different than current time (1295900068 vs 1295899828):
/mail/m/p/mpbixler/Maildir/dovecot-uidlist
Jan 24 14:14:28 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock
file's timestamp is different than current time (1295900068 vs 1295899828):
/mail/m/p/mpbixler/Maildir/dovecot-uidlist
Jan 24 14:14:32 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock
file's timestamp is different than current time (1295900072 vs 1295899833):
/mail/m/p/mpbixler/Maildir/dovecot-uidlist
Jan 24 14:14:33 10.123.128.108 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=1/7449, del=0/8235, size=1450563412
Jan 24 14:34:36 10.123.128.105 dovecot: pop3-login: Login:
user=mpbixler, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105
Jan 24 14:34:40 10.123.128.105 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=1/10774, del=0/8236, size=1450574168
Jan 24 14:54:53 10.123.128.105 dovecot: pop3-login: Login:
user=mpbixler, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105
Jan 24 14:54:54 10.123.128.105 dovecot: POP3(mpbixler): Disconnected:
Logged out top=0/0, retr=0/0, del=0/8236, size=1450574168


As you can see here, our current mail server 10.123.128.105 is handling
the traffic without incident, and once we add 10.123.128.108 to the cluster
the dovecot service errors out.  From the last time the user was logged in
at 13:40:47 to the login on the new server at 14:14:28 is about 34 minutes,
which from my understanding of the code is not possible for a lock file to
exist that long, and does not reflect in the time stamps in the error log. 


Pretty stumped at this point, as our NFS server appears to be running
smoothly.  All the packet captures we have run show that all the NFS
related traffic is moving back and forth quickly and without error.  

Thanks,

Michael



Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-24 Thread Timo Sirainen
On 25.1.2011, at 1.06, l...@airstreamcomm.net wrote:

 Multi-server setup that tries to flush NFS caches:
 
 dotlock_use_excl = no # only needed with NFSv2, NFSv3+ supports O_EXCL and
 it's faster

You're probably using NFSv3, right? Then this isn't needed.

Also not the tries word. It doesn't work perfectly, although in your case it 
seems to be working than expected. Still, these NFS problems are the reason I 
created director: http://wiki2.dovecot.org/Director

Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-20 Thread Stan Hoeppner
l...@airstreamcomm.net put forth on 1/20/2011 8:32 AM:

 Secondly we thought the issues were due to NTP as the time stamps vary so
 widely, so we rebuilt our NTP servers and found closer stratum 1 source
 clocks to synchronize to hoping it would alleviate the problem but the
 dotlock errors returned after about 12 hours.  We have fcntl locking set in
 our configuration file, but it is our understanding from look at the source
 code that this file is locked with dotlock.  
 
 Any help troubleshooting is appreciated.

From your description it sounds as if you're ntpd syncing each of the 4 servers
against an external time source, first stratum 2/3 sources, then stratum 1
sources in an attempt to cure this problem.

In a clustered server environment, _always_ run a local physical box/router ntpd
server (preferably two) that queries a set of external sources, and services
your internal machine queries.  With RTTs all on your LAN, and using the same
internal time sources for every query, this clock drift issue should be
eliminated.  Obviously, when you first set this up, stop ntpd and run ntpdate to
get an initial time sync for each cluster host.

If after setting this up, and we're dealing with bare metal cluster member
servers, then I'd guess you've got a failed/defective clock chip on one host.
If this is Linux, you can work around that by changing the local time source.
There are something like 5 options.  Google for Linux time or similar.  Or,
simply replace the hardware--RTC chip, mobo, etc.

If any of these cluster members are virtual machines, regardless of hypervisor,
I'd recommend disabling using ntpd, and cron'ing ntpdate to run once every 5
minutes, or once a a minute, whatever it takes to get the times to remain
synced, against your local ntpd server mentioned above.  I got to the point with
VMWare ESX that I could make any Linux distro VM of 2.4 or 2.6 stay within one
minute a month before needing a manual ntdate against our local time source.
The time required to get to that point is a total waste.  Cron'ing ntpdate as I
mentioned is the quick, reliable way to solve this issue, if you're using VMs.

-- 
Stan


Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-20 Thread Timo Sirainen
On Thu, 2011-01-20 at 08:32 -0600, l...@airstreamcomm.net wrote:

 Created dotlock file's timestamp is different than current time
 (1295480202 vs 1295479784): /mail/user/Maildir/dovecot-uidlist

Hmm. This may be a bug that happens when dotlocking has to wait for a
long time for dotlock. See if
http://hg.dovecot.org/dovecot-1.2/raw-rev/9a50a9dc905f fixes this.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-20 Thread Gregory Finch
On 2011-01-20 8:57 AM, Stan Hoeppner wrote:
 l...@airstreamcomm.net put forth on 1/20/2011 8:32 AM:

 Secondly we thought the issues were due to NTP as the time stamps vary so
 widely, so we rebuilt our NTP servers and found closer stratum 1 source
 clocks to synchronize to hoping it would alleviate the problem but the
 dotlock errors returned after about 12 hours.  We have fcntl locking set in
 our configuration file, but it is our understanding from look at the source
 code that this file is locked with dotlock.  

 Any help troubleshooting is appreciated.
 From your description it sounds as if you're ntpd syncing each of the 4 
 servers
 against an external time source, first stratum 2/3 sources, then stratum 1
 sources in an attempt to cure this problem.

 In a clustered server environment, _always_ run a local physical box/router 
 ntpd
 server (preferably two) that queries a set of external sources, and services
 your internal machine queries.  With RTTs all on your LAN, and using the same
 internal time sources for every query, this clock drift issue should be
 eliminated.  Obviously, when you first set this up, stop ntpd and run ntpdate 
 to
 get an initial time sync for each cluster host.
You're much better off running one ntp server than two. With just two
servers providing time, if they drift from one another, for whatever
reason, there is no way to tell which one has the correct time. If you
need to ensure the time is correct, peer at least 3 machines together,
then they can take care of themselves if one drifts.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-20 Thread Stan Hoeppner
Gregory Finch put forth on 1/20/2011 3:27 PM:

 You're much better off running one ntp server than two. With just two
 servers providing time, if they drift from one another, for whatever
 reason, there is no way to tell which one has the correct time. If you
 need to ensure the time is correct, peer at least 3 machines together,
 then they can take care of themselves if one drifts.

The reason for two local ntpd servers instead of one is simply redundancy in
case one dies.  In your scenario, you first have to realize you have a drifting
server.  Having 3 servers won't help in this case.  Once you identify the drift,
you simply run a manual ntpdate against external servers and you know instantly
which of your local servers is drifting.

-- 
Stan