Re: load balancing at fastmail.fm
On Thu, May 10, 2007 at 09:49:01AM -0400, Nik Conwell wrote: > > On Jan 12, 2007, at 10:43 PM, Rob Mueller wrote: > > >Yep, this means we need quite a bit more software to manage the > >setup, but now that it's done, it's quite nice and works well. For > >maintenance, we can safely fail all masters off a server in a few > >minutes, about 10-30 seconds a store. Then we can take the machine > >down, do whatever we want, bring it back up, wait for replication > >to catch up again, then fail any masters we want back on to the > >server. > > Just curious how you do this - do you just stop the masters and then > change the proxy to point to the replica? Webmail users shouldn't > notice this but don't the desktop IMAP clients notice? We use IPAddr2 from linux-ha to bind the master IP address and replica IP address to each machine based on the database entry saying which slot is the master. That way we don't need to change anything else, you just connect to the master IP address. It also means every slot can just bind to the standard ports on its IP address. As you can imagine, there's a lot of templating and custom config and init scripts going on here - but it all works nicely once you're set up! The failover scripts also run sync_client on leftover log files and other consistency checks. Bron. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
On Jan 12, 2007, at 10:43 PM, Rob Mueller wrote: Yep, this means we need quite a bit more software to manage the setup, but now that it's done, it's quite nice and works well. For maintenance, we can safely fail all masters off a server in a few minutes, about 10-30 seconds a store. Then we can take the machine down, do whatever we want, bring it back up, wait for replication to catch up again, then fail any masters we want back on to the server. Just curious how you do this - do you just stop the masters and then change the proxy to point to the replica? Webmail users shouldn't notice this but don't the desktop IMAP clients notice? Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
I agree that storage and replication are orthogonal issues. However, if a lump of storage is no longer a single point of failure then you don't have to invest (or gamble) quite as much to make that storage perfect. Yes, that old maxim that each extra 9 in 99.99... reliability costs 10 times more. Also a nice thing about replication is allowing controlled upgrades/changes to systems with almost no visible user downtime by controlled failing of all the masters off a particular machine. Nice. Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
So you have multiple SAN's? Or your SAN is still a potential SPOF? Multiple SANs. Nice if you can afford it :) Very ;) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
On Mon, 12 Feb 2007, urgrue wrote: SAN really has nothing to do with replication. You have your data somewhere (local or external disks, local/ext raid, NAS, SAN, etc), and youve got your various replication options (file-level, block-level, via client, via server, etc). I agree that storage and replication are orthogonal issues. However, if a lump of storage is no longer a single point of failure then you don't have to invest (or gamble) quite as much to make that storage perfect. Software is rarely perfect, as the early history of replication in Cyrus 2.3 demonstrates. If the software isn't itself a single point of failure then it can at least be monitored and fixed. On which note I should pass my thanks to Bron Gondwana, Wes Craig and anyone else who has been working on replication there. None of these are a replacement for backups. Absolutely, I agree. Exterprise storage and replication are both just strategies to reduce the frequency that you need to resort to backup. -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
I agree of course about avoiding SPOFs, but I do like a multi-tiered approach, I mean multiple lines of defense. I use SAN for its speed, reliability, and ease of administration, but naturally I replicate everything on the SAN and have "true" backups as well. So you have multiple SAN's? Or your SAN is still a potential SPOF? Nice if you can afford it :) Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
On Mon, 12 Feb 2007, urgrue wrote: If it's using block level replication, how does it offer instant recovery on filesystem corruption? Does it track every block written to disk, and can thus roll back to effectively what was on disk at a particular instant in time, so you then just remount the filesystem and the replay of the journal should restore to a good state? Yes. I may be wrong but to my understanding at least NetApp has this capability. No, NetApp takes snapshots of the filesystems on a schedule (hourly, daily, weekly, etc), and you can read files off of those snapshots. you cannot getany more granular then that. David Lang Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
If it's using block level replication, how does it offer instant recovery on filesystem corruption? Does it track every block written to disk, and can thus roll back to effectively what was on disk at a particular instant in time, so you then just remount the filesystem and the replay of the journal should restore to a good state? Yes. I may be wrong but to my understanding at least NetApp has this capability. With file based replication, about your only way of failure is the replication software going crazy blowing both sides away somehow, which given that the protocol is strictly designed to be one way, seems extremely unlikely that anything will happen to the master side. I agree of course about avoiding SPOFs, but I do like a multi-tiered approach, I mean multiple lines of defense. I use SAN for its speed, reliability, and ease of administration, but naturally I replicate everything on the SAN and have "true" backups as well. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
Fastmail dont use SAN, as I understand they use external raid arrays. There are many ways to lose your data, one of these being filesystem error, others being software bugs and human error. Block-level replication (typically used in SANs) is very fast and uses few resources but doesnt protect from filesystem error (although it can offer instant recovery). If it's using block level replication, how does it offer instant recovery on filesystem corruption? Does it track every block written to disk, and can thus roll back to effectively what was on disk at a particular instant in time, so you then just remount the filesystem and the replay of the journal should restore to a good state? File-level replication is somewhat more resilient and easier to monitor, but is just as prone to human errors, bugs, misconfigurations, etc. Any replication system is prone to human errors and bugs, the most common one being "split brain syndrome" which is pretty much possible with any replication system regardless of which approach it uses if you stuff up. Which is why good tools and automation that ensure you can't stuff it up are really important! :) There will be horror stories for every given system in the world. Generally speaking ext3 is very reliable, but naturally no filesystem is going to remove the need for replication and no replication system is going to remove the need for backups. Indeed. Which is what we have, a replicated setup with nightly incremental backups. And things like filesystem or LVM snapshots are NOT backups, they're still relying on the integrity of your filesystem, rather than being on completely separate storage. The main thing we were trying to avoid was single points of failure. With a SAN, you generally have a very reliable, though very expensive central data store, but it's still a single point of failure, and even better you're dealing with some closed system you have to rely on a vendor for support for. That may or may not be a good thing depending on your point of view. You still have the SAN as a single point of failure With block based replication, you get the hardware redundancy, but you still have the filesystem as a single point of failure. If master end gets corrupted (eg http://oss.sgi.com/projects/xfs/faq.html#dir2) the other end replicates the corruption. With file based replication, about your only way of failure is the replication software going crazy blowing both sides away somehow, which given that the protocol is strictly designed to be one way, seems extremely unlikely that anything will happen to the master side. Rob PS. As a separate observation, if you're looking to get performance out of cyrus with a large number of users in a significantly busy environment, don't use ext3. We've been using reiserfs for years, but after the SUSE announcement, decided to try ext3 again on a machine. We had to switch it back to reiserfs, the load difference and visible performance difference for our users was quite large. And yes we tried with dirindex and various journal options. None of them came close to matching the load and response times of our standard reiser mount options; noatime,nodiratime,notail,data=journal, but read these first: http://www.irbs.net/internet/info-cyrus/0412/0042.html http://lists.andrew.cmu.edu/pipermail/info-cyrus/2006-October/024119.html Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2/12/07 11:01 AM, David Carter wrote: > > I would be surprised if NFS worked given that it is only a approximation > to a "real" Unix filesystem. Cyrus really hammers the filesystem. NFS does not work with cyrus. Been there, done that, didn't like the end that movie at all. dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.3 (Darwin) iD8DBQFF0NPdyPxGVjntI4IRAvbaAJ9oTmYaBCR4DvuZ0E0V2u8E1HTn9ACfW5bN 06ZaSbfhmp+Tv5ioG5Ra+Ys= =82sl -END PGP SIGNATURE- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
David Carter wrote: Why do you need NFS? The whole point of a SAN is distributed access to storage after all :). SAN distributes the disk, not the filesystem. I presume in this case hes not using the SAN for its multiple-client-access features but just because its fast/reliable. Some of my colleagues who run a SAN have had no end of grief. At which point you are dependant on the abilities of the vendor to diagnose and fix problems. It was this experience that encouraged me to try application level replication with lots of small servers in the first place. At least that way I can keep a close eye on what the various copies are up to. SAN really has nothing to do with replication. You have your data somewhere (local or external disks, local/ext raid, NAS, SAN, etc), and youve got your various replication options (file-level, block-level, via client, via server, etc). None of these are a replacement for backups. A SAN doesn't protect you if your filesystem decides to explode: I believe that Fastmail have direct experience of this. Two independent copies of the data allows you to keep running a service for the hours that an fsck typically takes to complete with file per msg stores on large modern disks. It also means rather less stress if the fsck fails to complete. Fastmail dont use SAN, as I understand they use external raid arrays. There are many ways to lose your data, one of these being filesystem error, others being software bugs and human error. Block-level replication (typically used in SANs) is very fast and uses few resources but doesnt protect from filesystem error (although it can offer instant recovery). File-level replication is somewhat more resilient and easier to monitor, but is just as prone to human errors, bugs, misconfigurations, etc. I've heard horror stories about all the common Linux filesystems and I've personally watched fsck.ext3 (supposedly the safest option) unravel a filesystem, with thousands of entries left in lost+found. ZFS looks nice. There will be horror stories for every given system in the world. Generally speaking ext3 is very reliable, but naturally no filesystem is going to remove the need for replication and no replication system is going to remove the need for backups. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
On Mon, 12 Feb 2007, Marten Lehmann wrote: because NFS is the only standard network file protocol. I don't want to load a proprietary driver into the kernel to access a SAN device. Fair enough, although NFS is likely to be really rather slow compared to a block device which just happens to be accessed via a fibre channel link. I would be surprised if NFS worked given that it is only a approximation to a "real" Unix filesystem. Cyrus really hammers the filesystem. I've heard horror stories about all the common Linux filesystems and I've personally watched fsck.ext3 (supposedly the safest option) unravel a filesystem, with thousands of entries left in lost+found. ext3 with journal? I have never experienced this. It was in a RAID set which had had a dodgy disk, but there was a definite urk moment when I saw what fsck had done. Fortunately not critical data. ZFS looks nice. Well, but you are on your own because this project for linux is pretty young. I don't have any problem with OpenSolaris, though it would be a little amusing given that we moved from Solaris to Linux about 4 years back. -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
Hello, Why do you need NFS? because NFS is the only standard network file protocol. I don't want to load a proprietary driver into the kernel to access a SAN device. The whole point of a SAN is distributed access to storage after all :). So where's the point? SANs usually have redundant network devices to access the redudant disk array behind it. It depends how much you trust your SAN. Sure, but at some level you always have to trust to something. A SAN doesn't protect you if your filesystem decides to explode: Well, there are inode based SANs and file based SANs. If I'm just splitting an inode based SAN, I could also use internal disks which give me more control. But with file based SANs I can actually store files (through NFS). And a lot of SANs offer the possibility to do snapshots or replicate their data filebased to another SAN. So you have a very high redundancy and availability. Me idea was, that Cyrus does lock and mmap indices and databases, but not the actual message-files. So these message files could be stored in the SAN with very high redundancy, whereas the metadata which needs to be mmaped remains on the blade with internal disks so in case of problems you could at least restore the messages from the SAN (and its snapshots if you accidentally deleted something) and rebuild the indices. I've heard horror stories about all the common Linux filesystems and I've personally watched fsck.ext3 (supposedly the safest option) unravel a filesystem, with thousands of entries left in lost+found. ext3 with journal? I have never experienced this. ZFS looks nice. Well, but you are on your own because this project for linux is pretty young. Regards Marten Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
On Mon, 12 Feb 2007, Marten Lehmann wrote: what do you think about moving the mailspool to a central SAN storage shared via NFS and having several blades to manage the mmapped files like seen state, quota etc.? Why do you need NFS? The whole point of a SAN is distributed access to storage after all :). So still only one server is responsible for a certain set of mailboxes, but these SAN boxes have nice backup and redundancy features which are hard to get with common servers It depends how much you trust your SAN. Some of my colleagues who run a SAN have had no end of grief. At which point you are dependant on the abilities of the vendor to diagnose and fix problems. It was this experience that encouraged me to try application level replication with lots of small servers in the first place. At least that way I can keep a close eye on what the various copies are up to. A SAN doesn't protect you if your filesystem decides to explode: I believe that Fastmail have direct experience of this. Two independent copies of the data allows you to keep running a service for the hours that an fsck typically takes to complete with file per msg stores on large modern disks. It also means rather less stress if the fsck fails to complete. I've heard horror stories about all the common Linux filesystems and I've personally watched fsck.ext3 (supposedly the safest option) unravel a filesystem, with thousands of entries left in lost+found. ZFS looks nice. -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2/12/07 5:41 AM, Marten Lehmann wrote: > Hello, > > what do you think about moving the mailspool to a central SAN storage > shared via NFS and having several blades to manage the mmapped files > like seen state, quota etc.? So still only one server is responsible for > a certain set of mailboxes, but these SAN boxes have nice backup and > redundancy features which are hard to get with common servers and there > shouldn't be mmap problems as long as all indices remain on the blade on > a separate metadata-partition. Cyrus and NFS don't get along due to locking issues; I believe this is covered in the docs. I tried this about a year ago and it spewed errors at an impressive rate. Instead, you might want to check out the Cyrus IMAP Aggregator Design page, which allows you to distribute mailboxes across multiple servers: http://asg.web.cmu.edu/cyrus/ag.html dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFF0HxpyPxGVjntI4IRAlA8AKCFPCJAFuXkoeyTqI4ofTgoPvxIxACg+spS sLbF5pLBdqaF64S9QnJZe9M= =oNKh -END PGP SIGNATURE- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
Hello, what do you think about moving the mailspool to a central SAN storage shared via NFS and having several blades to manage the mmapped files like seen state, quota etc.? So still only one server is responsible for a certain set of mailboxes, but these SAN boxes have nice backup and redundancy features which are hard to get with common servers and there shouldn't be mmap problems as long as all indices remain on the blade on a separate metadata-partition. Regards Marten Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
Thanks, that was interesting reading. Is there any specific reason you didnt opt for a cluster filesystem? Internal knowledge mostly. We very were familiar with the performance and overall usage implications of a local filesystems on locally attached SATA-to-SCSI RAID boxes that we've been using for a while. The setup, performance, maintenance, etc of cluster filesystems would involve learning and using entirely new technologies we didn't know much about, are complex, and probably don't trust fully. In high usage environments like we have, it stresses software and finds bugs you don't expect. Our backup system was crashing our kernel NFS server regularly in mutliple versions of the linux kernel. Even when we tried LVM once we managed to find subtle bugs that seemed to cause filesystem corruption (search for LVM reiserfs corruption, there's old reports in kernel mailing lists). These are both supposedly well tested, well used and understood technologies that when you push them still show their corner case problems. I'd hate to see cluster filesystems pushed given their inherent complexity. Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
Thanks, that was interesting reading. Is there any specific reason you didnt opt for a cluster filesystem? Rob Mueller wrote: > > >> May I ask how you are doing the actual replication, technically >> speaking? shared fs, drbd, something over imap? > > We're using the replication engine in cyrus 2.3 > > http://blog.fastmail.fm/?p=576 > > Rob > Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
May I ask how you are doing the actual replication, technically speaking? shared fs, drbd, something over imap? We're using the replication engine in cyrus 2.3 http://blog.fastmail.fm/?p=576 Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
May I ask how you are doing the actual replication, technically speaking? shared fs, drbd, something over imap? Rob Mueller wrote: >> as fastmail.fm seems to be a very big setup of cyrus nodes, I would be >> interested to know how you organized load balancing and managing disk >> space. >> >> Did you setup servers for a maximum of lets say 1000 mailboxes and >> then you use a new server? Or do you use a murder installation so you >> can move mailboxes to another server once a certain gets too much >> load? Or do you have a big SAN storage with good mmap support behind >> an arbitrary amount of cyrus nodes? > > We don't use a murder setup. Two main reasons. > 1) Murder wasn't very mature when we started > 2) The main advantage murder gives you is a set of proxies > (imap/pop/lmtp) to connect users to the appropriate backends, which we > ended up using other software for, and a unified mailbox namespace if > you want to do mailbox sharing, something we didn't really need either. > Also the unifed mailbox needs a global mailboxes.db somewhere. As it > was, because the skiplist backend mmaps the entire mailboxes.db file > into memory, and we had multiple machines with 100M+ mailboxes.db files, > I didn't really like the idea of dealing with a 500M+ mailboxes.db file. > > We don't use a shared SAN storage. When we started out we didn't have > that much money, so purchasing an expensive SAN unit wasn't an option. > > What we have has evolved over time to our current point. Basically we > now have a hardware set that is quite nicely balanced with regard to > spool IO vs metadata IO vs CPU, and a storage configuration that gives > us replication with good failure capability, but without having to waste > lots of hardware on just having replica machines. > > IMAP/POP frontend - We used to use perdition, but have now changed to > nginx (http://blog.fastmail.fm/?p=592). As you can read from the linked > blog post, nginx is great. > > LMTP delivery - We use a custom written perl daemon that forwards lmtp > deliveries from postfix to the appropriate backend server. It also does > the spam scanning, virus checking and a bunch of other in house stuff. > > Servers - We use servers with attached SATA-to-SCSI RAID units with > battery backed up caches. We have a mix of large drives for the email > spool, and smaller faster drives for meta-data. That's the reason we > sponsored the metapartition config options > (http://cyrusimap.web.cmu.edu/imapd/changes.html). > > Replication - We initial started with pairs of machines, half of each > being a replica and half a master replicating between each other, but > that meant on a failure, one machine became fully loaded with masters. > masters take a much bigger IO hit than replicas. Instead we went with a > system we calls "slots" and "stores". Each machine is divided into a set > of "slots". "slots" from different machines are then paired as a > replicated "store" with a master and replica. So say you have 20 slots > per machine (half master, half replica), and 10 machines, then if one > machine fails, on average you only have to distribute one more master > slot to each of the other machines. Much better on IO. Some more details > in this blog post on our replication trials... > http://blog.fastmail.fm/?p=576 > > Yep, this means we need quite a bit more software to manage the setup, > but now that it's done, it's quite nice and works well. For maintenance, > we can safely fail all masters off a server in a few minutes, about > 10-30 seconds a store. Then we can take the machine down, do whatever we > want, bring it back up, wait for replication to catch up again, then > fail any masters we want back on to the server. > > Unfortunately most of this software is in house and quite specific to > our setup, it's not very "generic" (e.g. it assumes particular disk > layouts and sizes, machines, database tables, hostnames, etc) to manage > and track it all, so it's not something we're going to release. > > Rob > > > Cyrus Home Page: http://cyrusimap.web.cmu.edu/ > Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki > List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: load balancing at fastmail.fm
as fastmail.fm seems to be a very big setup of cyrus nodes, I would be interested to know how you organized load balancing and managing disk space. Did you setup servers for a maximum of lets say 1000 mailboxes and then you use a new server? Or do you use a murder installation so you can move mailboxes to another server once a certain gets too much load? Or do you have a big SAN storage with good mmap support behind an arbitrary amount of cyrus nodes? We don't use a murder setup. Two main reasons. 1) Murder wasn't very mature when we started 2) The main advantage murder gives you is a set of proxies (imap/pop/lmtp) to connect users to the appropriate backends, which we ended up using other software for, and a unified mailbox namespace if you want to do mailbox sharing, something we didn't really need either. Also the unifed mailbox needs a global mailboxes.db somewhere. As it was, because the skiplist backend mmaps the entire mailboxes.db file into memory, and we had multiple machines with 100M+ mailboxes.db files, I didn't really like the idea of dealing with a 500M+ mailboxes.db file. We don't use a shared SAN storage. When we started out we didn't have that much money, so purchasing an expensive SAN unit wasn't an option. What we have has evolved over time to our current point. Basically we now have a hardware set that is quite nicely balanced with regard to spool IO vs metadata IO vs CPU, and a storage configuration that gives us replication with good failure capability, but without having to waste lots of hardware on just having replica machines. IMAP/POP frontend - We used to use perdition, but have now changed to nginx (http://blog.fastmail.fm/?p=592). As you can read from the linked blog post, nginx is great. LMTP delivery - We use a custom written perl daemon that forwards lmtp deliveries from postfix to the appropriate backend server. It also does the spam scanning, virus checking and a bunch of other in house stuff. Servers - We use servers with attached SATA-to-SCSI RAID units with battery backed up caches. We have a mix of large drives for the email spool, and smaller faster drives for meta-data. That's the reason we sponsored the metapartition config options (http://cyrusimap.web.cmu.edu/imapd/changes.html). Replication - We initial started with pairs of machines, half of each being a replica and half a master replicating between each other, but that meant on a failure, one machine became fully loaded with masters. masters take a much bigger IO hit than replicas. Instead we went with a system we calls "slots" and "stores". Each machine is divided into a set of "slots". "slots" from different machines are then paired as a replicated "store" with a master and replica. So say you have 20 slots per machine (half master, half replica), and 10 machines, then if one machine fails, on average you only have to distribute one more master slot to each of the other machines. Much better on IO. Some more details in this blog post on our replication trials... http://blog.fastmail.fm/?p=576 Yep, this means we need quite a bit more software to manage the setup, but now that it's done, it's quite nice and works well. For maintenance, we can safely fail all masters off a server in a few minutes, about 10-30 seconds a store. Then we can take the machine down, do whatever we want, bring it back up, wait for replication to catch up again, then fail any masters we want back on to the server. Unfortunately most of this software is in house and quite specific to our setup, it's not very "generic" (e.g. it assumes particular disk layouts and sizes, machines, database tables, hostnames, etc) to manage and track it all, so it's not something we're going to release. Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html