Re: [Dovecot] director monitoring?

2011-08-05 Thread Kelsey Cummings
On Fri, Aug 05, 2011 at 11:12:03AM +0200, Jan-Frode Myklebust wrote:
> On Thu, Jun 02, 2011 at 12:29:10PM -0700, Kelsey Cummings wrote:
> > I'm using a hacked up version of poolmon.  The only important changes
> > are that it actually logs into the real server rather than just making a
> > connection to it and that has heuristics to prevent the real servers
> > from flapping and added a timeout to scan_host so if a real server
> > blocks after the connection is established it won't hang indefinitely.
> 
> Could you share your hacks ? :-)

Sure.  You'll probably want to change the regex at line 194 to match
whatever your server says after the login is complete.  My postlogin
script puts out some extra info that I'm looking for instead of the
deafult.  Otherwise, YMMV, works for me so far.

http://kgc.users.sonic.net/imapdmon


-- 
Kelsey Cummings - k...@corp.sonic.net  sonic.net, inc.
System Architect  2260 Apollo Way
707.522.1000  Santa Rosa, CA 95407


Re: [Dovecot] director monitoring?

2011-08-05 Thread Jan-Frode Myklebust
On Thu, Jun 02, 2011 at 10:37:23AM +0200, Cor Bosman wrote:
> We use a setup as seen on http://grab.by/agCb for about 30.000 
> simultaneous(!) imap connections. 

Are you doing NFS against the Netapp(s)? I've always assumed that
maildir wouldn't work on NFS (to slow fstat's), but would be interested
to learn otherwise.  Could you say something about how many email accounts
and how many files you have in your maildirs ?


  -jf


Re: [Dovecot] director monitoring?

2011-08-05 Thread Jan-Frode Myklebust
On Thu, Jun 02, 2011 at 12:29:10PM -0700, Kelsey Cummings wrote:
> I'm using a hacked up version of poolmon.  The only important changes
> are that it actually logs into the real server rather than just making a
> connection to it and that has heuristics to prevent the real servers
> from flapping and added a timeout to scan_host so if a real server
> blocks after the connection is established it won't hang indefinitely.

Could you share your hacks ? :-)

We're often seeing poolmon not noticing when our backend servers are
hanging on busy filesystem. They're probably to busy to complete a login,
but not busy enough to fail a connect, so a poolmon that does a full login
sounds interesting.



  -jf


Re: [Dovecot] director monitoring?

2011-06-02 Thread Kelsey Cummings
On Thu, Jun 02, 2011 at 10:37:23AM +0200, Cor Bosman wrote:
> We use a setup as seen on http://grab.by/agCb for about 30.000 
> simultaneous(!) imap connections. 

This might as well be a diagram of my network, although, if I remember,
you're running quite a few more netapps clusters than I am. ;)

> We have 2 Foundry loadbalancers. They check the health of the directors. We 
> have 3 directors, and each one runs Brandon's poolmon script 
> (https://github.com/brandond/poolmon). This script removes real servers out 
> of the director pool. The dovecot imap servers are monitored with nagios just 
> to tell us when they're down. 

I'm using a hacked up version of poolmon.  The only important changes
are that it actually logs into the real server rather than just making a
connection to it and that has heuristics to prevent the real servers
from flapping and added a timeout to scan_host so if a real server
blocks after the connection is established it won't hang indefinitely.

> This setup has been absolutely rock solid for us. I have not touched the 
> whole system since november and we have not seen any more corruption of meta 
> data, which is the whole reason for the directors.  Kudos to Timo for fixing 
> this difficult problem.

That is always good to hear!

I'd be a lot happier if I was able to monitor the directors and make
sure that they were connected and correctly synced with eachother - even
as a protection from human error rather than anticipated software failure.

-- 
Kelsey Cummings - k...@corp.sonic.net  sonic.net, inc.
System Architect  2260 Apollo Way
707.522.1000  Santa Rosa, CA 95407


Re: [Dovecot] director monitoring?

2011-06-02 Thread Cor Bosman
We use a setup as seen on http://grab.by/agCb for about 30.000 simultaneous(!) 
imap connections. 

We have 2 Foundry loadbalancers. They check the health of the directors. We 
have 3 directors, and each one runs Brandon's poolmon script 
(https://github.com/brandond/poolmon). This script removes real servers out of 
the director pool. The dovecot imap servers are monitored with nagios just to 
tell us when they're down. 

This setup has been absolutely rock solid for us. I have not touched the whole 
system since november and we have not seen any more corruption of meta data, 
which is the whole reason for the directors.  Kudos to Timo for fixing this 
difficult problem.

Cor





[Dovecot] director monitoring?

2011-06-01 Thread Kelsey Cummings
I'm working the kinks of a new director based setup for the eventual
migration away from courier.  At this point, with everything basically
working I'm trying to ensure that things are properly monitored and I've
run into an issue.  There doesn't appear to be a way to get dovecot to
tell if it is (or is not) connected and properly synced with the other
director servers in the ring apart from the logs.  It seems like this is
an important piece of information -- without it, it isn't apparent how
you would be able to tell if your director servers have lost track of
each other.

I'm also curious what people are doing to health check their director
servers when they are running load balancing upstream of them as well.
It doesn't seem like it is a good idea to let the load balancers check
all the way through to the real servers since a failure on the target
real server could end up leading to a director being dropped from the
pool (if so, it is most likely that the other directors would be dropped
as well.)  Otherwise, the health check failure tolerance at the load
balancer must be greater than the tolerance for failure of the real
servers on the director- a dead director could end up in the pool for
longer than desired, or anyway, long enough to be sure that it isn't a
transient failure on the real server behind it.

A better method would seem to be for the load balancers to query the
director for the number of active back-end servers and, so long as it was
over a given threshold, to assume that the director is otherwise able to
do its job and rely on external monitoring to pickup internal failures
where dovecot isn't able to successfully proxy the connection to one of
the real servers.

So, how are people doing this in the real world?

-- 
Kelsey Cummings - k...@corp.sonic.net  sonic.net, inc.
System Architect  2260 Apollo Way
707.522.1000  Santa Rosa, CA 95407