Re: Problems with load balancing cluster on GFS
Hello Jorey 2008/6/5 Jorey Bump [EMAIL PROTECTED]: At first I thought that this was a problem related to entropy, but it even persisted after I turned off allowapop, and unconfigured everything relating to TLS (as SSL/TLS will be handled completely by the perdition, we don't need it) To rule it out completely, watch it during your test: watch -n 0 'cat /proc/sys/kernel/random/entropy_avail' It might start blocking when it gets as low as 100 (healthy seems to be above 1000). If you're at the console (not a remote terminal), type on the keyboard to add entropy and see if this helps. If it does, you may have a cyrus-sasl that uses /dev/random (the default). Check the source RPM to verify, and adjust it to use /dev/urandom to stop the blocking. Thanks for that hint, I didn't know that you could monitor available entropy that way, that is very useful to know :) But it doesn't seem to be related to entropy. Though on one of the nodes entropy is usually quite low (between 100 and 300), it never drops below the 100 mark, and when running a load test, that node and another failed, and on the one failing was more than 3000 entropy available. To rule it out completely I started rngd on all the nodes, feeding from /dev/urandom (I know, not perfect, but better than nothing ;) ), but that didn't change anything. And I checked the compilation settings for my cyrus-sasl package, it already takes /dev/urandom as entropy source. So I think I can rule it out mostly But thanks for the input. Regards, Jens Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Problems with load balancing cluster on GFS
Jens Hoffrichter wrote, at 06/06/2008 09:46 AM: But it doesn't seem to be related to entropy. Though on one of the nodes entropy is usually quite low (between 100 and 300), it never drops below the 100 mark, and when running a load test, that node and another failed, and on the one failing was more than 3000 entropy available. To rule it out completely I started rngd on all the nodes, feeding from /dev/urandom (I know, not perfect, but better than nothing ;) ), but that didn't change anything. And I checked the compilation settings for my cyrus-sasl package, it already takes /dev/urandom as entropy source. So I think I can rule it out mostly Yeah, it shouldn't lock with urandom. You might want to play around with poptimeout and popminpoll, to see if that has any effect on your load balancing test. Is jakarta-jmeter distributing these logins among enough different users to simulate real-world conditions? What do your imap/debug logs say when the lockup occurs? While I support POP3, I encourage all of my users to use IMAP, so I don't have many problems with pop3d (except for brute force attacks, which I solved by increasing sasl_minimum_layer, but that won't help you here). Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Problems with load balancing cluster on GFS
I'm seeing some weird behaviour with the pop3 daemon on a GFS HA cluster with load balancing. I would not advise running cyrus-imapd on top of GFS. GFS is even with the best tuning possible very slow regarding small files (the typical load type of a cyrus-imapd). GFS runs into heavy locking with that type of load. So don't do it. What I'm currently doing: I run on top of a RH Cluster (using Scientific Linux) virtual machines with XEN. rgmanager handles very well the failover of XEN instances. So I just run one VM with a cyrus-imapd. (this cluster handles all of my DMZ servers, e.g. it runs VM's for static webpages, typo3 and so on, currently around 15 VM's). The cluster is a 3 node setup with an SAN Storage. A Logical Volume is exported to the XEN VM, inside this Volume i create again a Volume Group. A Logical Volume is created for /var/spool/imap, which is formated just as ext3. There is no cluster locking necessary as just one virtual machine accesses this Volume. As the Volume Group is inside the VM, I can use Snapshots also (not possible on clvm). Current size of my Imap Server is 2500 users and currently 250 GByte of Mailboxes used (growing and growing). I don't see how to avoid a murder setup if you need more than one machine running cyrus-imapd in parallel. Sincerly, Klaus This message was sent using IMP, the Internet Messaging Program. smime.p7s Description: S/MIME krytographische Unterschrift Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Problems with load balancing cluster on GFS
Hello, 2008/6/6 Jorey Bump [EMAIL PROTECTED]: Yeah, it shouldn't lock with urandom. You might want to play around with poptimeout and popminpoll, to see if that has any effect on your load balancing test. Is jakarta-jmeter distributing these logins among enough different users to simulate real-world conditions? What do your imap/debug logs say when the lockup occurs? Yes, I have configured jmeter to use all those 100 mailbox users in a round robin fashion, so this should be close to a real world setup. The log simply stops saying anything, especially about pop3 connections. But I think I have solved the current problem: The problem appears to be related to the Berkeley DB environment in /var/lib/imap/db . Although I don't use that format, as all of the databases are configured using skiplist, cyrus still initializes the environment on every connection. And if some other process has locked the database, it does a futex call on the mmap region, and goes to sleep. The problem seems to be that with using GFS, it doesn't get a signal that the database is unlocked, and stays sleeping forever. I discovered this today when I systematically strace'd (with strace -p, which apparently sends some kind of signal to the process) all pop3d processes on one of the hanging machines, and suddenly everything started to work again, including the hanging note. A closer examination told me that it then does the futex call again, unlocks that and just continues. My solution for this is now that I disabled bdb while compiling, and everything works like a charm now, though the performance is not yet there where I expected it to be. But I'm not sure if that is my loadbalancing test or the cluster config :) While I support POP3, I encourage all of my users to use IMAP, so I don't have many problems with pop3d (except for brute force attacks, which I solved by increasing sasl_minimum_layer, but that won't help you here). Not an option here, the customer I'm building the cluster for supports only POP3 to the outside, and IMAP only for the internal webmail app. So POP3 HAS to run ;) Regards, Jens Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Problems with load balancing cluster on GFS
Hallo Klaus, 2008/6/6 Klaus Steinberger [EMAIL PROTECTED]: I'm seeing some weird behaviour with the pop3 daemon on a GFS HA cluster with load balancing. I would not advise running cyrus-imapd on top of GFS. GFS is even with the best tuning possible very slow regarding small files (the typical load type of a cyrus-imapd). GFS runs into heavy locking with that type of load. So don't do it. Thanks for the advice, but currently I am tied to that setup, due that we are operating on a schedule, and are nearly going live with that. And I just can't afford to redo everything at the moment. But I will monitor performance very closely, will have a fallback plan if it just doesn't do what I expect it to do, and I will start with a low load on it. If you guys are interested in that setup, I will keep you updated how the things progress :) Current size of my Imap Server is 2500 users and currently 250 GByte of Mailboxes used (growing and growing). Well, we will be talking about something in the range of above 50k mailboxes, so a single machine is just out of question. And some sort of standby will be needed. I didn't do the concept for this system, though, I'm just the one who has to implement it ;) I don't see how to avoid a murder setup if you need more than one machine running cyrus-imapd in parallel. Well, there are other possibilities I have seen, especially together with perdition and an LDAP server (which we have here anyways). But that is more in the region of an active-passive setup instead of an active-active setup. And I must admit that I don't know murder that well, only that it logs very little into the logfiles when delivering a mail ;) I don't think I can easily go away from the current setup I'm working on, but I will monitor it very closely. As I said in the other mail, I have solved the problem I had, but the performance is behind my expectations. So I will need to do some more testing to confirm if I can go live with that cluster. Regards, Jens Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Problems with load balancing cluster on GFS
Current size of my Imap Server is 2500 users and currently 250 GByte of Mailboxes used (growing and growing). Well, we will be talking about something in the range of above 50k mailboxes, so a single machine is just out of question. And some sort of standby will be needed. I didn't do the concept for this system, though, I'm just the one who has to implement it ;) A single machine is not out of the question for that number of mailboxes, but is perhaps for the amount of traffic driven by your user behavior -- that's what you need to determine. We happily run 350k mailboxes on a single system with the determining factor being I/O contention during mail delivery. Depending on your storage, you won't necessarily be able to fix that contention by running multiple machines. I wouldn't count out a single machine with lots of (relatively small) storage pools to build performance. John -- John Madden Sr. UNIX Systems Engineer Ivy Tech Community College of Indiana [EMAIL PROTECTED] Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Problems with load balancing cluster on GFS
Hello everyone, I hope this is the correct mailing list to post this problem on. I'm seeing some weird behaviour with the pop3 daemon on a GFS HA cluster with load balancing. The general situation is as follows: I have 3 servers here, everyone installed with CentOS 5.1 and the latest RedHat cluster. On every server is a cyrus 2.3.12p2 from the Invoca distribution. he The servers share two common partitions for data storage on an SAN, one 1 GB partition mounted on /var/lib/imap, and one 1.2TB partition mounted on /var/spool/imap. On the /var/lib/imap partition I have set up the following directories so they point to individual directories for each node: backup, proc and socket. The backup directory was made separately because some cron.daily entries locked each other up in the night, rendering the cluster useless. In front of the three backend servers is a load balancer, which balances pop3, imap, lmtp and timsieved on a round robin basis to each node. The load balancer is used (or will be used ;) ) by two perdition servers which connect to the pop or imap port on the LB, which distributes them to a running node. The idea behind this is that we can shut down any node without a notable service interruption, and we only have one backend system instead of several one. We want to migrate away from a murder based setup, so any comments in that direction won't be very useful for me at this stage ;) The problematic behaviour I see at the moment: I have migrated ~100 test mailboxes from the old backend system, and I'm in the process of performing load tests on the new system to get an impression how the performance will be, and if we are on the right track. From the mailboxes around 80 are empty, 10 are medium filled and 10 are filled to the maximum storage, which is about the distribution we will be talking about after putting the system live. The load test is performed with jakarta-jmeter from apache.org, which chooses one of the mailboxes, and performs either a pop-3 or imap login to the backend, using the load balancer. The distribution is roughly that I do 5 pop3 logins for 1 imap login, with a performance about 5 logins/sec. After 30 to 60 seconds into the test, randomly one of the backend servers pop3ds will stop working. It is still accepting connections, but doesn't send a banner anymore. This is recognized by the load balancer as working (as the port is still open), but one after another all my connections will hit the malfunctioning server and the test basically stalls. A restart of the cyrus service stops the problem for another 30 - 60 seconds. If I just stop the one offending server, so it won't be used by the LB anymore, the test usually finishes without a problem.. At first I thought that this was a problem related to entropy, but it even persisted after I turned off allowapop, and unconfigured everything relating to TLS (as SSL/TLS will be handled completely by the perdition, we don't need it) My personal guess is that it is somehow related to the port tests by the load balancer, as normally a connection from the load balancer is the last thing I see in the log of the offending backend server. The port tests are easily distinguishable, as the LB just opens a TCP connection and instantly resets it before it reads any data from the pop3d, not even waiting for a banner. After this happens, there are no more log entries regarding pop3d, or log entries from the master that it spawns new pop3 processes. My second guess was that it is related to locking, but the IMAP server just continues to run fine, and doesn't have a problem. At the moment, I'm running out of ideas where to look, and my knowledge about cyrus debugging is quite limited (never had such a problem before ;) ), so any ideas or points how to debug the problem would be appreciated. Oh yes, I tried to strace the pop3d, and from the pop3d which generates the last log entry normally comes a SIGPIPE, as the end point isn't connected anymore to the pop3d. It looks a bit like master doesn't recognize that there is a problem regarding spawning off new children, and assigns new connections to a dysfunctional pop3d. Any ideas, hints, questions will be greatly appreciated, if information is missing I will provide what I can :) Thanks in advance! Regards, Jens Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: Problems with load balancing cluster on GFS
Jens Hoffrichter wrote, at 06/05/2008 04:03 PM: At first I thought that this was a problem related to entropy, but it even persisted after I turned off allowapop, and unconfigured everything relating to TLS (as SSL/TLS will be handled completely by the perdition, we don't need it) To rule it out completely, watch it during your test: watch -n 0 'cat /proc/sys/kernel/random/entropy_avail' It might start blocking when it gets as low as 100 (healthy seems to be above 1000). If you're at the console (not a remote terminal), type on the keyboard to add entropy and see if this helps. If it does, you may have a cyrus-sasl that uses /dev/random (the default). Check the source RPM to verify, and adjust it to use /dev/urandom to stop the blocking. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html