Re: load balancing at fastmail.fm

2007-05-11 Thread Bron Gondwana
On Thu, May 10, 2007 at 09:49:01AM -0400, Nik Conwell wrote:
> 
> On Jan 12, 2007, at 10:43 PM, Rob Mueller wrote:
> 
> >Yep, this means we need quite a bit more software to manage the  
> >setup, but now that it's done, it's quite nice and works well. For  
> >maintenance, we can safely fail all masters off a server in a few  
> >minutes, about 10-30 seconds a store. Then we can take the machine  
> >down, do whatever we want, bring it back up, wait for replication  
> >to catch up again, then fail any masters we want back on to the  
> >server.
> 
> Just curious how you do this - do you just stop the masters and then 
> change the proxy to point to the replica?  Webmail users shouldn't  
> notice this but don't the desktop IMAP clients notice?

We use IPAddr2 from linux-ha to bind the master IP address and replica
IP address to each machine based on the database entry saying which
slot is the master.  That way we don't need to change anything else,
you just connect to the master IP address.

It also means every slot can just bind to the standard ports on its
IP address.

As you can imagine, there's a lot of templating and custom config and
init scripts going on here - but it all works nicely once you're set
up!

The failover scripts also run sync_client on leftover log files and
other consistency checks.

Bron.

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-05-10 Thread Nik Conwell


On Jan 12, 2007, at 10:43 PM, Rob Mueller wrote:

Yep, this means we need quite a bit more software to manage the  
setup, but now that it's done, it's quite nice and works well. For  
maintenance, we can safely fail all masters off a server in a few  
minutes, about 10-30 seconds a store. Then we can take the machine  
down, do whatever we want, bring it back up, wait for replication  
to catch up again, then fail any masters we want back on to the  
server.


Just curious how you do this - do you just stop the masters and then  
change the proxy to point to the replica?  Webmail users shouldn't  
notice this but don't the desktop IMAP clients notice?



Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-13 Thread Rob Mueller
I agree that storage and replication are orthogonal issues. However, if a 
lump of storage is no longer a single point of failure then you don't have 
to invest (or gamble) quite as much to make that storage perfect.


Yes, that old maxim that each extra 9 in 99.99... reliability costs 10 times 
more.


Also a nice thing about replication is allowing controlled upgrades/changes 
to systems with almost no visible user downtime by controlled failing of all 
the masters off a particular machine. Nice.


Rob


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-13 Thread urgrue



So you have multiple SAN's? Or your SAN is still a potential SPOF?

Multiple SANs.


Nice if you can afford it :)


Very ;)

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-13 Thread David Carter

On Mon, 12 Feb 2007, urgrue wrote:

SAN really has nothing to do with replication. You have your data 
somewhere (local or external disks, local/ext raid, NAS, SAN, etc), and 
youve got your various replication options (file-level, block-level, via 
client, via server, etc).


I agree that storage and replication are orthogonal issues. However, if a 
lump of storage is no longer a single point of failure then you don't have 
to invest (or gamble) quite as much to make that storage perfect.


Software is rarely perfect, as the early history of replication in Cyrus 
2.3 demonstrates. If the software isn't itself a single point of failure 
then it can at least be monitored and fixed. On which note I should pass 
my thanks to Bron Gondwana, Wes Craig and anyone else who has been working 
on replication there.



None of these are a replacement for backups.


Absolutely, I agree. Exterprise storage and replication are both just 
strategies to reduce the frequency that you need to resort to backup.


--
David Carter Email: [EMAIL PROTECTED]
University Computing Service,Phone: (01223) 334502
New Museums Site, Pembroke Street,   Fax:   (01223) 334679
Cambridge UK. CB2 3QH.

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread Rob Mueller


I agree of course about avoiding SPOFs, but I do like a multi-tiered 
approach, I mean multiple lines of defense. I use SAN for its speed, 
reliability, and ease of administration, but naturally I replicate 
everything on the SAN and have "true" backups as well.


So you have multiple SAN's? Or your SAN is still a potential SPOF?

Nice if you can afford it :)

Rob


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread David Lang

On Mon, 12 Feb 2007, urgrue wrote:

If it's using block level replication, how does it offer instant recovery 
on filesystem corruption? Does it track every block written to disk, and 
can thus roll back to effectively what was on disk at a particular instant 
in time, so you then just remount the filesystem and the replay of the 
journal should restore to a good state?
Yes. I may be wrong but to my understanding at least NetApp has this 
capability.


No, NetApp takes snapshots of the filesystems on a schedule (hourly, daily, 
weekly, etc), and you can read files off of those snapshots. you cannot getany 
more granular then that.


David Lang

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread urgrue


If it's using block level replication, how does it offer instant 
recovery on filesystem corruption? Does it track every block written 
to disk, and can thus roll back to effectively what was on disk at a 
particular instant in time, so you then just remount the filesystem 
and the replay of the journal should restore to a good state?
Yes. I may be wrong but to my understanding at least NetApp has this 
capability.




With file based replication, about your only way of failure is the 
replication software going crazy blowing both sides away somehow, 
which given that the protocol is strictly designed to be one way, 
seems extremely unlikely that anything will happen to the master side.
I agree of course about avoiding SPOFs, but I do like a multi-tiered 
approach, I mean multiple lines of defense. I use SAN for its speed, 
reliability, and ease of administration, but naturally I replicate 
everything on the SAN and have "true" backups as well.


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread Rob Mueller

Fastmail dont use SAN, as I understand they use external raid arrays.
There are many ways to lose your data, one of these being filesystem 
error, others being software bugs and human error. Block-level replication 
(typically used in SANs) is very fast and uses few resources but doesnt 
protect from filesystem error (although it can offer instant recovery).


If it's using block level replication, how does it offer instant recovery on 
filesystem corruption? Does it track every block written to disk, and can 
thus roll back to effectively what was on disk at a particular instant in 
time, so you then just remount the filesystem and the replay of the journal 
should restore to a good state?


File-level replication is somewhat more resilient and easier to monitor, 
but is just as prone to human errors, bugs, misconfigurations, etc.


Any replication system is prone to human errors and bugs, the most common 
one being "split brain syndrome" which is pretty much possible with any 
replication system regardless of which approach it uses if you stuff up. 
Which is why good tools and automation that ensure you can't stuff it up are 
really important! :)


There will be horror stories for every given system in the world. 
Generally speaking ext3 is very reliable, but naturally no filesystem is 
going to remove the need for replication and no replication system is 
going to remove the need for backups.


Indeed. Which is what we have, a replicated setup with nightly incremental 
backups. And things like filesystem or LVM snapshots are NOT backups, 
they're still relying on the integrity of your filesystem, rather than being 
on completely separate storage.


The main thing we were trying to avoid was single points of failure.

With a SAN, you generally have a very reliable, though very expensive 
central data store, but it's still a single point of failure, and even 
better you're dealing with some closed system you have to rely on a vendor 
for support for. That may or may not be a good thing depending on your point 
of view. You still have the SAN as a single point of failure


With block based replication, you get the hardware redundancy, but you still 
have the filesystem as a single point of failure. If master end gets 
corrupted (eg http://oss.sgi.com/projects/xfs/faq.html#dir2) the other end 
replicates the corruption.


With file based replication, about your only way of failure is the 
replication software going crazy blowing both sides away somehow, which 
given that the protocol is strictly designed to be one way, seems extremely 
unlikely that anything will happen to the master side.


Rob

PS. As a separate observation, if you're looking to get performance out of 
cyrus with a large number of users in a significantly busy environment, 
don't use ext3. We've been using reiserfs for years, but after the SUSE 
announcement, decided to try ext3 again on a machine. We had to switch it 
back to reiserfs, the load difference and visible performance difference for 
our users was quite large. And yes we tried with dirindex and various 
journal options. None of them came close to matching the load and response 
times of our standard reiser mount options; 
noatime,nodiratime,notail,data=journal, but read these first:


http://www.irbs.net/internet/info-cyrus/0412/0042.html
http://lists.andrew.cmu.edu/pipermail/info-cyrus/2006-October/024119.html



Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread David Newman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2/12/07 11:01 AM, David Carter wrote:
> 
> I would be surprised if NFS worked given that it is only a approximation
> to a "real" Unix filesystem. Cyrus really hammers the filesystem.

NFS does not work with cyrus. Been there, done that, didn't like the end
that movie at all.

dn

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFF0NPdyPxGVjntI4IRAvbaAJ9oTmYaBCR4DvuZ0E0V2u8E1HTn9ACfW5bN
06ZaSbfhmp+Tv5ioG5Ra+Ys=
=82sl
-END PGP SIGNATURE-

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread urgrue



David Carter wrote:

Why do you need NFS?

The whole point of a SAN is distributed access to storage after all :).


SAN distributes the disk, not the filesystem. I presume in this case hes 
not using the SAN for its multiple-client-access features but just 
because its fast/reliable.




Some of my colleagues who run a SAN have had no end of grief. At which 
point you are dependant on the abilities of the vendor to diagnose and 
fix problems. It was this experience that encouraged me to try 
application level replication with lots of small servers in the first 
place. At least that way I can keep a close eye on what the various 
copies are up to.


SAN really has nothing to do with replication. You have your data 
somewhere (local or external disks, local/ext raid, NAS, SAN, etc), and 
youve got your various replication options (file-level, block-level, via 
client, via server, etc).

None of these are a replacement for backups.



A SAN doesn't protect you if your filesystem decides to explode: I 
believe that Fastmail have direct experience of this. Two independent 
copies of the data allows you to keep running a service for the hours 
that an fsck typically takes to complete with file per msg stores on 
large modern disks. It also means rather less stress if the fsck fails 
to complete.


Fastmail dont use SAN, as I understand they use external raid arrays.
There are many ways to lose your data, one of these being filesystem 
error, others being software bugs and human error. Block-level 
replication (typically used in SANs) is very fast and uses few resources 
but doesnt protect from filesystem error (although it can offer instant 
recovery). File-level replication is somewhat more resilient and easier 
to monitor, but is just as prone to human errors, bugs, 
misconfigurations, etc.


I've heard horror stories about all the common Linux filesystems and 
I've personally watched fsck.ext3 (supposedly the safest option) 
unravel a filesystem, with thousands of entries left in lost+found. 
ZFS looks nice.




There will be horror stories for every given system in the world. 
Generally speaking ext3 is very reliable, but naturally no filesystem is 
going to remove the need for replication and no replication system is 
going to remove the need for backups.






Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread David Carter

On Mon, 12 Feb 2007, Marten Lehmann wrote:

because NFS is the only standard network file protocol. I don't want to 
load a proprietary driver into the kernel to access a SAN device.


Fair enough, although NFS is likely to be really rather slow compared to a 
block device which just happens to be accessed via a fibre channel link.


I would be surprised if NFS worked given that it is only a approximation to 
a "real" Unix filesystem. Cyrus really hammers the filesystem.


I've heard horror stories about all the common Linux filesystems and I've 
personally watched fsck.ext3 (supposedly the safest option) unravel a 
filesystem, with thousands of entries left in lost+found.


ext3 with journal? I have never experienced this.


It was in a RAID set which had had a dodgy disk, but there was a definite 
urk moment when I saw what fsck had done. Fortunately not critical data.



ZFS looks nice.


Well, but you are on your own because this project for linux is pretty 
young.


I don't have any problem with OpenSolaris, though it would be a little 
amusing given that we moved from Solaris to Linux about 4 years back.


--
David Carter Email: [EMAIL PROTECTED]
University Computing Service,Phone: (01223) 334502
New Museums Site, Pembroke Street,   Fax:   (01223) 334679
Cambridge UK. CB2 3QH.

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread Marten Lehmann

Hello,


Why do you need NFS?


because NFS is the only standard network file protocol. I don't want to 
load a proprietary driver into the kernel to access a SAN device.



The whole point of a SAN is distributed access to storage after all :).


So where's the point? SANs usually have redundant network devices to 
access the redudant disk array behind it.



It depends how much you trust your SAN.


Sure, but at some level you always have to trust to something.


A SAN doesn't protect you if your filesystem decides to explode:


Well, there are inode based SANs and file based SANs. If I'm just 
splitting an inode based SAN, I could also use internal disks which give 
me more control. But with file based SANs I can actually store files 
(through NFS). And a lot of SANs offer the possibility to do snapshots 
or replicate their data filebased to another SAN. So you have a very 
high redundancy and availability. Me idea was, that Cyrus does lock and 
mmap indices and databases, but not the actual message-files. So these 
message files could be stored in the SAN with very high redundancy, 
whereas the metadata which needs to be mmaped remains on the blade with 
internal disks so in case of problems you could at least restore the 
messages from the SAN (and its snapshots if you accidentally deleted 
something) and rebuild the indices.



I've heard horror stories about all the common Linux 
filesystems and I've personally watched fsck.ext3 (supposedly the safest 
option) unravel a filesystem, with thousands of entries left in 
lost+found.


ext3 with journal? I have never experienced this.


ZFS looks nice.


Well, but you are on your own because this project for linux is pretty 
young.


Regards
Marten

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread David Carter

On Mon, 12 Feb 2007, Marten Lehmann wrote:

what do you think about moving the mailspool to a central SAN storage 
shared via NFS and having several blades to manage the mmapped files 
like seen state, quota etc.?


Why do you need NFS?

The whole point of a SAN is distributed access to storage after all :).

So still only one server is responsible for a certain set of mailboxes, 
but these SAN boxes have nice backup and redundancy features which are 
hard to get with common servers


It depends how much you trust your SAN.

Some of my colleagues who run a SAN have had no end of grief. At which 
point you are dependant on the abilities of the vendor to diagnose and fix 
problems. It was this experience that encouraged me to try application 
level replication with lots of small servers in the first place. At least 
that way I can keep a close eye on what the various copies are up to.


A SAN doesn't protect you if your filesystem decides to explode: I believe 
that Fastmail have direct experience of this. Two independent copies of 
the data allows you to keep running a service for the hours that an fsck 
typically takes to complete with file per msg stores on large modern 
disks. It also means rather less stress if the fsck fails to complete. 
I've heard horror stories about all the common Linux filesystems and I've 
personally watched fsck.ext3 (supposedly the safest option) unravel a 
filesystem, with thousands of entries left in lost+found. ZFS looks nice.


--
David Carter Email: [EMAIL PROTECTED]
University Computing Service,Phone: (01223) 334502
New Museums Site, Pembroke Street,   Fax:   (01223) 334679
Cambridge UK. CB2 3QH.

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread David Newman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2/12/07 5:41 AM, Marten Lehmann wrote:
> Hello,
> 
> what do you think about moving the mailspool to a central SAN storage
> shared via NFS and having several blades to manage the mmapped files
> like seen state, quota etc.? So still only one server is responsible for
> a certain set of mailboxes, but these SAN boxes have nice backup and
> redundancy features which are hard to get with common servers and there
> shouldn't be mmap problems as long as all indices remain on the blade on
> a separate metadata-partition.

Cyrus and NFS don't get along due to locking issues; I believe this is
covered in the docs. I tried this about a year ago and it spewed errors
at an impressive rate.

Instead, you might want to check out the Cyrus IMAP Aggregator Design
page, which allows you to distribute mailboxes across multiple servers:

http://asg.web.cmu.edu/cyrus/ag.html

dn

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFF0HxpyPxGVjntI4IRAlA8AKCFPCJAFuXkoeyTqI4ofTgoPvxIxACg+spS
sLbF5pLBdqaF64S9QnJZe9M=
=oNKh
-END PGP SIGNATURE-

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-02-12 Thread Marten Lehmann

Hello,

what do you think about moving the mailspool to a central SAN storage 
shared via NFS and having several blades to manage the mmapped files 
like seen state, quota etc.? So still only one server is responsible for 
a certain set of mailboxes, but these SAN boxes have nice backup and 
redundancy features which are hard to get with common servers and there 
shouldn't be mmap problems as long as all indices remain on the blade on 
a separate metadata-partition.


Regards
Marten

Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-01-16 Thread Rob Mueller



Thanks, that was interesting reading.
Is there any specific reason you didnt opt for a cluster filesystem?


Internal knowledge mostly. We very were familiar with the performance and 
overall usage implications of a local filesystems on locally attached 
SATA-to-SCSI RAID boxes that we've been using for a while.


The setup, performance, maintenance, etc of cluster filesystems would 
involve learning and using entirely new technologies we didn't know much 
about, are complex, and probably don't trust fully. In high usage 
environments like we have, it stresses software and finds bugs you don't 
expect. Our backup system was crashing our kernel NFS server regularly in 
mutliple versions of the linux kernel. Even when we tried LVM once we 
managed to find subtle bugs that seemed to cause filesystem corruption 
(search for LVM reiserfs corruption, there's old reports in kernel mailing 
lists). These are both supposedly well tested, well used and understood 
technologies that when you push them still show their corner case problems. 
I'd hate to see cluster filesystems pushed given their inherent complexity.


Rob


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-01-16 Thread urgrue
Thanks, that was interesting reading.
Is there any specific reason you didnt opt for a cluster filesystem?

Rob Mueller wrote:
> 
> 
>> May I ask how you are doing the actual replication, technically
>> speaking? shared fs, drbd, something over imap?
> 
> We're using the replication engine in cyrus 2.3
> 
> http://blog.fastmail.fm/?p=576
> 
> Rob
> 


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-01-16 Thread Rob Mueller




May I ask how you are doing the actual replication, technically
speaking? shared fs, drbd, something over imap?


We're using the replication engine in cyrus 2.3

http://blog.fastmail.fm/?p=576

Rob


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-01-16 Thread urgrue
May I ask how you are doing the actual replication, technically
speaking? shared fs, drbd, something over imap?


Rob Mueller wrote:
>> as fastmail.fm seems to be a very big setup of cyrus nodes, I would be
>> interested to know how you organized load balancing and managing disk
>> space.
>>
>> Did you setup servers for a maximum of lets say 1000 mailboxes and
>> then you use a new server? Or do you use a murder installation so you
>> can move mailboxes to another server once a certain gets too much
>> load? Or do you have a big SAN storage with good mmap support behind
>> an arbitrary amount of cyrus nodes?
> 
> We don't use a murder setup. Two main reasons.
> 1) Murder wasn't very mature when we started
> 2) The main advantage murder gives you is a set of proxies
> (imap/pop/lmtp) to connect users to the appropriate backends, which we
> ended up using other software for, and a unified mailbox namespace if
> you want to do mailbox sharing, something we didn't really need either.
> Also the unifed mailbox needs a global mailboxes.db somewhere. As it
> was, because the skiplist backend mmaps the entire mailboxes.db file
> into memory, and we had multiple machines with 100M+ mailboxes.db files,
> I didn't really like the idea of dealing with a 500M+ mailboxes.db file.
> 
> We don't use a shared SAN storage. When we started out we didn't have
> that much money, so purchasing an expensive SAN unit wasn't an option.
> 
> What we have has evolved over time to our current point. Basically we
> now have a hardware set that is quite nicely balanced with regard to
> spool IO vs metadata IO vs CPU, and a storage configuration that gives
> us replication with good failure capability, but without having to waste
> lots of hardware on just having replica machines.
> 
> IMAP/POP frontend - We used to use perdition, but have now changed to
> nginx (http://blog.fastmail.fm/?p=592). As you can read from the linked
> blog post, nginx is great.
> 
> LMTP delivery - We use a custom written perl daemon that forwards lmtp
> deliveries from postfix to the appropriate backend server. It also does
> the spam scanning, virus checking and a bunch of other in house stuff.
> 
> Servers - We use servers with attached SATA-to-SCSI RAID units with
> battery backed up caches. We have a mix of large drives for the email
> spool, and smaller faster drives for meta-data. That's the reason we
> sponsored the metapartition config options
> (http://cyrusimap.web.cmu.edu/imapd/changes.html).
> 
> Replication - We initial started with pairs of machines, half of each
> being a replica and half a master replicating between each other, but
> that meant on a failure, one machine became fully loaded with masters.
> masters take a much bigger IO hit than replicas. Instead we went with a
> system we calls "slots" and "stores". Each machine is divided into a set
> of "slots". "slots" from different machines are then paired as a
> replicated "store" with a master and replica. So say you have 20 slots
> per machine (half master, half replica), and 10 machines, then if one
> machine fails, on average you only have to distribute one more master
> slot to each of the other machines. Much better on IO. Some more details
> in this blog post on our replication trials...
> http://blog.fastmail.fm/?p=576
> 
> Yep, this means we need quite a bit more software to manage the setup,
> but now that it's done, it's quite nice and works well. For maintenance,
> we can safely fail all masters off a server in a few minutes, about
> 10-30 seconds a store. Then we can take the machine down, do whatever we
> want, bring it back up, wait for replication to catch up again, then
> fail any masters we want back on to the server.
> 
> Unfortunately most of this software is in house and quite specific to
> our setup, it's not very "generic" (e.g. it assumes particular disk
> layouts and sizes, machines, database tables, hostnames, etc) to manage
> and track it all, so it's not something we're going to release.
> 
> Rob
> 
> 
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: load balancing at fastmail.fm

2007-01-12 Thread Rob Mueller
as fastmail.fm seems to be a very big setup of cyrus nodes, I would be 
interested to know how you organized load balancing and managing disk 
space.


Did you setup servers for a maximum of lets say 1000 mailboxes and then 
you use a new server? Or do you use a murder installation so you can move 
mailboxes to another server once a certain gets too much load? Or do you 
have a big SAN storage with good mmap support behind an arbitrary amount 
of cyrus nodes?


We don't use a murder setup. Two main reasons.
1) Murder wasn't very mature when we started
2) The main advantage murder gives you is a set of proxies (imap/pop/lmtp) 
to connect users to the appropriate backends, which we ended up using other 
software for, and a unified mailbox namespace if you want to do mailbox 
sharing, something we didn't really need either. Also the unifed mailbox 
needs a global mailboxes.db somewhere. As it was, because the skiplist 
backend mmaps the entire mailboxes.db file into memory, and we had multiple 
machines with 100M+ mailboxes.db files, I didn't really like the idea of 
dealing with a 500M+ mailboxes.db file.


We don't use a shared SAN storage. When we started out we didn't have that 
much money, so purchasing an expensive SAN unit wasn't an option.


What we have has evolved over time to our current point. Basically we now 
have a hardware set that is quite nicely balanced with regard to spool IO vs 
metadata IO vs CPU, and a storage configuration that gives us replication 
with good failure capability, but without having to waste lots of hardware 
on just having replica machines.


IMAP/POP frontend - We used to use perdition, but have now changed to nginx 
(http://blog.fastmail.fm/?p=592). As you can read from the linked blog post, 
nginx is great.


LMTP delivery - We use a custom written perl daemon that forwards lmtp 
deliveries from postfix to the appropriate backend server. It also does the 
spam scanning, virus checking and a bunch of other in house stuff.


Servers - We use servers with attached SATA-to-SCSI RAID units with battery 
backed up caches. We have a mix of large drives for the email spool, and 
smaller faster drives for meta-data. That's the reason we sponsored the 
metapartition config options 
(http://cyrusimap.web.cmu.edu/imapd/changes.html).


Replication - We initial started with pairs of machines, half of each being 
a replica and half a master replicating between each other, but that meant 
on a failure, one machine became fully loaded with masters. masters take a 
much bigger IO hit than replicas. Instead we went with a system we calls 
"slots" and "stores". Each machine is divided into a set of "slots". "slots" 
from different machines are then paired as a replicated "store" with a 
master and replica. So say you have 20 slots per machine (half master, half 
replica), and 10 machines, then if one machine fails, on average you only 
have to distribute one more master slot to each of the other machines. Much 
better on IO. Some more details in this blog post on our replication 
trials... http://blog.fastmail.fm/?p=576


Yep, this means we need quite a bit more software to manage the setup, but 
now that it's done, it's quite nice and works well. For maintenance, we can 
safely fail all masters off a server in a few minutes, about 10-30 seconds a 
store. Then we can take the machine down, do whatever we want, bring it back 
up, wait for replication to catch up again, then fail any masters we want 
back on to the server.


Unfortunately most of this software is in house and quite specific to our 
setup, it's not very "generic" (e.g. it assumes particular disk layouts and 
sizes, machines, database tables, hostnames, etc) to manage and track it 
all, so it's not something we're going to release.


Rob


Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html