Re: Problems with load balancing cluster on GFS

2008-06-06 Thread Jens Hoffrichter
Hello Jorey

2008/6/5 Jorey Bump [EMAIL PROTECTED]:

 At first I thought that this was a problem related to entropy, but it
 even persisted after I turned off allowapop, and unconfigured
 everything relating to TLS (as SSL/TLS will be handled completely by
 the perdition, we don't need it)

 To rule it out completely, watch it during your test:

  watch -n 0 'cat /proc/sys/kernel/random/entropy_avail'

 It might start blocking when it gets as low as 100 (healthy seems to be
 above 1000). If you're at the console (not a remote terminal), type on the
 keyboard to add entropy and see if this helps. If it does, you may have a
 cyrus-sasl that uses /dev/random (the default). Check the source RPM to
 verify, and adjust it to use /dev/urandom to stop the blocking.
Thanks for that hint, I didn't know that you could monitor available
entropy that way, that is very useful to know :)

But it doesn't seem to be related to entropy. Though on one of the
nodes entropy is usually quite low (between 100 and 300), it never
drops below the 100 mark, and when running a load test, that node and
another failed, and on the one failing was more than 3000 entropy
available.

To rule it out completely I started rngd on all the nodes, feeding
from /dev/urandom (I know, not perfect, but better than nothing ;) ),
but that didn't change anything. And I checked the compilation
settings for my cyrus-sasl package, it already takes /dev/urandom as
entropy source. So I think I can rule it out mostly

But thanks for the input.

Regards,
Jens

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Problems with load balancing cluster on GFS

2008-06-06 Thread Jorey Bump
Jens Hoffrichter wrote, at 06/06/2008 09:46 AM:

 But it doesn't seem to be related to entropy. Though on one of the
 nodes entropy is usually quite low (between 100 and 300), it never
 drops below the 100 mark, and when running a load test, that node and
 another failed, and on the one failing was more than 3000 entropy
 available.
 
 To rule it out completely I started rngd on all the nodes, feeding
 from /dev/urandom (I know, not perfect, but better than nothing ;) ),
 but that didn't change anything. And I checked the compilation
 settings for my cyrus-sasl package, it already takes /dev/urandom as
 entropy source. So I think I can rule it out mostly

Yeah, it shouldn't lock with urandom. You might want to play around with 
poptimeout and popminpoll, to see if that has any effect on your load 
balancing test. Is jakarta-jmeter distributing these logins among enough 
different users to simulate real-world conditions? What do your 
imap/debug logs say when the lockup occurs?

While I support POP3, I encourage all of my users to use IMAP, so I 
don't have many problems with pop3d (except for brute force attacks, 
which I solved by increasing sasl_minimum_layer, but that won't help you 
here).



Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Problems with load balancing cluster on GFS

2008-06-06 Thread Klaus Steinberger



I'm seeing some weird behaviour with the pop3 daemon on a GFS HA
cluster with load balancing.


I would not advise running cyrus-imapd on top of GFS. GFS is even with  
the best tuning possible very slow regarding small files (the typical  
load type of a cyrus-imapd). GFS runs into heavy locking with that  
type of load. So don't do it.


What I'm currently doing:

I run on top of a RH Cluster (using Scientific Linux) virtual machines  
with XEN. rgmanager handles very well the failover of XEN instances.  
So I just run one VM with a cyrus-imapd. (this cluster handles all of  
my DMZ servers, e.g. it runs VM's for static webpages, typo3 and so  
on, currently around 15 VM's). The cluster is a 3 node setup with an  
SAN Storage.


A Logical Volume is exported to the XEN VM, inside this Volume i  
create again a Volume Group. A Logical Volume is created for  
/var/spool/imap, which is formated just as ext3. There is no cluster  
locking necessary as just one virtual machine accesses this Volume. As  
the Volume Group is inside the VM, I can use Snapshots also (not  
possible on clvm).


 Current size of my Imap Server is 2500 users and currently 250 GByte  
of Mailboxes used (growing and growing).


I don't see how to avoid a murder setup if you need more than one  
machine running cyrus-imapd in parallel.


Sincerly,
Klaus


This message was sent using IMP, the Internet Messaging Program.


smime.p7s
Description: S/MIME krytographische Unterschrift

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: Problems with load balancing cluster on GFS

2008-06-06 Thread Jens Hoffrichter
Hello,

2008/6/6 Jorey Bump [EMAIL PROTECTED]:

 Yeah, it shouldn't lock with urandom. You might want to play around with
 poptimeout and popminpoll, to see if that has any effect on your load
 balancing test. Is jakarta-jmeter distributing these logins among enough
 different users to simulate real-world conditions? What do your imap/debug
 logs say when the lockup occurs?
Yes, I have configured jmeter to use all those 100 mailbox users in a
round robin fashion, so this should be close to a real world setup.

The log simply stops saying anything, especially about pop3 connections.

But I think I have solved the current problem:

The problem appears to be related to the Berkeley DB environment in
/var/lib/imap/db . Although I don't use that format, as all of the
databases are configured using skiplist, cyrus still initializes the
environment on every connection. And if some other process has locked
the database, it does a futex call on the mmap region, and goes to
sleep. The problem seems to be that with using GFS, it doesn't get a
signal that the database is unlocked, and stays sleeping forever.

I discovered this today when I systematically strace'd (with strace
-p, which apparently sends some kind of signal to the process) all
pop3d processes on one of the hanging machines, and suddenly
everything started to work again, including the hanging note. A closer
examination told me that it then does the futex call again, unlocks
that and just continues.

My solution for this is now that I disabled bdb while compiling, and
everything works like a charm now, though the performance is not yet
there where I expected it to be. But I'm not sure if that is my
loadbalancing test or the cluster config :)

 While I support POP3, I encourage all of my users to use IMAP, so I don't
 have many problems with pop3d (except for brute force attacks, which I
 solved by increasing sasl_minimum_layer, but that won't help you here).
Not an option here, the customer I'm building the cluster for supports
only POP3 to the outside, and IMAP only for the internal webmail app.
So POP3 HAS to run ;)

Regards,
Jens

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Problems with load balancing cluster on GFS

2008-06-06 Thread Jens Hoffrichter
Hallo Klaus,

2008/6/6 Klaus Steinberger [EMAIL PROTECTED]:

 I'm seeing some weird behaviour with the pop3 daemon on a GFS HA
 cluster with load balancing.

 I would not advise running cyrus-imapd on top of GFS. GFS is even with the
 best tuning possible very slow regarding small files (the typical load type
 of a cyrus-imapd). GFS runs into heavy locking with that type of load. So
 don't do it.
Thanks for the advice, but currently I am tied to that setup, due that
we are operating on a schedule, and are nearly going live with that.
And I just can't afford to redo everything at the moment. But I will
monitor performance very closely, will have a fallback plan if it just
doesn't do what I expect it to do, and I will start with a low load on
it. If you guys are interested in that setup, I will keep you updated
how the things progress :)

  Current size of my Imap Server is 2500 users and currently 250 GByte of
 Mailboxes used (growing and growing).
Well, we will be talking about something in the range of above 50k
mailboxes, so a single machine is just out of question. And some sort
of standby will be needed. I didn't do the concept for this system,
though, I'm just the one who has to implement it ;)

 I don't see how to avoid a murder setup if you need more than one machine
 running cyrus-imapd in parallel.
Well, there are other possibilities I have seen, especially together
with perdition and an LDAP server (which we have here anyways). But
that is more in the region of an active-passive setup instead of an
active-active setup. And I must admit that I don't know murder that
well, only that it logs very little into the logfiles when delivering
a mail ;)

I don't think I can easily go away from the current setup I'm working
on, but I will monitor it very closely. As I said in the other mail, I
have solved the problem I had, but the performance is behind my
expectations. So I will need to do some more testing to confirm if I
can go live with that cluster.

Regards,
Jens

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Problems with load balancing cluster on GFS

2008-06-06 Thread John Madden
   Current size of my Imap Server is 2500 users and currently 250 GByte of
  Mailboxes used (growing and growing).
 Well, we will be talking about something in the range of above 50k
 mailboxes, so a single machine is just out of question. And some sort
 of standby will be needed. I didn't do the concept for this system,
 though, I'm just the one who has to implement it ;)

A single machine is not out of the question for that number of
mailboxes, but is perhaps for the amount of traffic driven by your user
behavior -- that's what you need to determine.  We happily run 350k
mailboxes on a single system with the determining factor being I/O
contention during mail delivery.  Depending on your storage, you won't
necessarily be able to fix that contention by running multiple machines.
I wouldn't count out a single machine with lots of (relatively small)
storage pools to build performance.  

John




-- 
John Madden
Sr. UNIX Systems Engineer
Ivy Tech Community College of Indiana
[EMAIL PROTECTED]


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Problems with load balancing cluster on GFS

2008-06-05 Thread Jens Hoffrichter
Hello everyone,

I hope this is the correct mailing list to post this problem on.

I'm seeing some weird behaviour with the pop3 daemon on a GFS HA
cluster with load balancing.

The general situation is as follows:

I have 3 servers here, everyone installed with CentOS 5.1 and the
latest RedHat cluster. On every server is a cyrus 2.3.12p2 from the
Invoca distribution.
he
The servers share two common partitions for data storage on an SAN,
one 1 GB partition mounted on /var/lib/imap, and one 1.2TB partition
mounted on /var/spool/imap. On the /var/lib/imap partition I have set
up the following directories so they point to individual directories
for each node: backup, proc and socket. The backup directory was made
separately because some cron.daily entries locked each other up in the
night, rendering the cluster useless.

In front of the three backend servers is a load balancer, which
balances pop3, imap, lmtp and timsieved on a round robin basis to each
node.

The load balancer is used (or will be used ;) ) by two perdition
servers which connect to the pop or imap port on the LB, which
distributes them to a running node.

The idea behind this is that we can shut down any node without a
notable service interruption, and we only have one backend system
instead of several one. We want to migrate away from a murder based
setup, so any comments in that direction won't be very useful for me
at this stage ;)

The problematic behaviour I see at the moment:

I have migrated ~100 test mailboxes from the old backend system, and
I'm in the process of performing load tests on the new system to get
an impression how the performance will be, and if we are on the right
track. From the mailboxes around 80 are empty, 10 are medium filled
and 10 are filled to the maximum storage, which is about the
distribution we will be talking about after putting the system live.

The load test is performed with jakarta-jmeter from apache.org, which
chooses one of the mailboxes, and performs either a pop-3 or imap
login to the backend, using the load balancer. The distribution is
roughly that I do 5 pop3 logins for 1 imap login, with a performance
about 5 logins/sec.

After 30 to 60 seconds into the test, randomly one of the backend
servers pop3ds will stop working. It is still accepting connections,
but doesn't send a banner anymore. This is recognized by the load
balancer as working (as the port is still open), but one after
another all my connections will hit the malfunctioning server and the
test basically stalls.

A restart of the cyrus service stops the problem for another 30 - 60
seconds. If I just stop the one offending server, so it won't be used
by the LB anymore, the test usually finishes without a problem..

At first I thought that this was a problem related to entropy, but it
even persisted after I turned off allowapop, and unconfigured
everything relating to TLS (as SSL/TLS will be handled completely by
the perdition, we don't need it)

My personal guess is that it is somehow related to the port tests by
the load balancer, as normally a connection from the load balancer is
the last thing I see in the log of the offending backend server. The
port tests are easily distinguishable, as the LB just opens a TCP
connection and instantly resets it before it reads any data from the
pop3d, not even waiting for a banner. After this happens, there are no
more log entries regarding pop3d, or log entries from the master that
it spawns new pop3 processes.

My second guess was that it is related to locking, but the IMAP server
just continues to run fine, and doesn't have a problem.

At the moment, I'm running out of ideas where to look, and my
knowledge about cyrus debugging is quite limited (never had such a
problem before ;) ), so any ideas or points how to debug the problem
would be appreciated.

Oh yes, I tried to strace the pop3d, and from the pop3d which
generates the last log entry normally comes a SIGPIPE, as the end
point isn't connected anymore to the pop3d.

It looks a bit like master doesn't recognize that there is a problem
regarding spawning off new children, and assigns new connections to a
dysfunctional pop3d.

Any ideas, hints, questions will be greatly appreciated, if
information is missing I will provide what I can :)

Thanks in advance!

Regards,
Jens

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: Problems with load balancing cluster on GFS

2008-06-05 Thread Jorey Bump
Jens Hoffrichter wrote, at 06/05/2008 04:03 PM:

 At first I thought that this was a problem related to entropy, but it
 even persisted after I turned off allowapop, and unconfigured
 everything relating to TLS (as SSL/TLS will be handled completely by
 the perdition, we don't need it)

To rule it out completely, watch it during your test:

   watch -n 0 'cat /proc/sys/kernel/random/entropy_avail'

It might start blocking when it gets as low as 100 (healthy seems to be 
above 1000). If you're at the console (not a remote terminal), type on 
the keyboard to add entropy and see if this helps. If it does, you may 
have a cyrus-sasl that uses /dev/random (the default). Check the source 
RPM to verify, and adjust it to use /dev/urandom to stop the blocking.



Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html