Re: choosing a file system
Please, we are following a long thread I introduced a while ago which was speaking about file system. If you want to ask question about something you red in that thread, but not properly speaking about file system, please, be kind enough to start another thread and stop replying in that thread. Dom 2009/1/19 Jorey Bump : > Andrew McNamara wrote, at 01/19/2009 01:29 AM: >>> Yeah, except Postfix encodes the inode of the queue files in its queue >>> IDs, so it gets very confused if you do this. Same with restoring >>> queues from backups. >> >> You should be able to get away with this if, when moving the queue to >> another machine, you move the queued mail from hold, incoming, active and >> deferred directories into the maildrop directory on the target instance. >> >> This (somewhat old, but still correct, I think) message from Wietse >> might shed more light on it: >> >> Date: Thu, 12 Sep 2002 20:33:08 -0400 (EDT) >> From: wie...@porcupine.org (Wietse Venema) >> Subject: Re: postfix migration >> >> > I want to migrate postfix to another machine. What are also the steps >> so >> > that I won't lose mails on the process? >> >> This is the safe procedure. >> >> 1) On the old machine, stop Postfix. >> >> 2) On the old machine, run as super-user: >> >> postsuper -r ALL >> >>This moves all queue files to the maildrop queue. >> >> 3) On the old machine, back up /var/spool/postfix/maildrop >> >> 4) On the new machine, make sure Postfix works. >> >> 5) On the new machine, stop Postfix. >> >> 6) On the new machine, restore /var/spool/postfix/maildrop >> >> 7) On the new machine, start Postfix. >> >> There are ways to skip the "postsuper -r ALL" step, and copy the >> incoming + active + deferred + bounce + defer + flush + hold >> directories to the new machine, but that would be safe only with >> an empty queue on the new machine. >> > > This has become somewhat off-topic for this list, but you might be able > to simply sync the entire Postfix queue to the backup machine, and run > postsuper -s before starting Postfix on the backup. From the postsuper > man page: > > -s Structure check and structure repair. This should be done > once before Postfix startup. > > Rename files whose name does not match the message file inode > number. This operation is necessary after restoring a mail > queue from a different machine, or from backup media. > > The important thing to keep in mind is that Postfix embeds the inode > number in the filename simply to keep the name unique while the message > resides on the filesystem. Obviously, this approach breaks when the > files are copied to another filesystem. Renaming them appropriately on > the new destination ensures no files will be overwritten as the queue is > processed or new messages enter the queue. Of course, the scheme I > proposed earlier requires that once the backup Postfix is brought up, it > must be impossible for the primary to begin resyncing files to the same > location on the backup if it becomes active again (or refuses to die a > graceful death). Certainly tricky, but it sounds like the use case is to > preserve the queue in case of a total failure, just to make sure the > mail goes out (even it means it goes out twice). > > > > Cyrus Home Page: http://cyrusimap.web.cmu.edu/ > Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki > List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html > -- Dominique LALOT Ingénieur Systèmes et Réseaux http://annuaire.univmed.fr/showuser?uid=lalot Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Andrew McNamara wrote, at 01/19/2009 01:29 AM: >> Yeah, except Postfix encodes the inode of the queue files in its queue >> IDs, so it gets very confused if you do this. Same with restoring >> queues from backups. > > You should be able to get away with this if, when moving the queue to > another machine, you move the queued mail from hold, incoming, active and > deferred directories into the maildrop directory on the target instance. > > This (somewhat old, but still correct, I think) message from Wietse > might shed more light on it: > > Date: Thu, 12 Sep 2002 20:33:08 -0400 (EDT) > From: wie...@porcupine.org (Wietse Venema) > Subject: Re: postfix migration > > > I want to migrate postfix to another machine. What are also the steps > so > > that I won't lose mails on the process? > > This is the safe procedure. > > 1) On the old machine, stop Postfix. > > 2) On the old machine, run as super-user: > > postsuper -r ALL > >This moves all queue files to the maildrop queue. > > 3) On the old machine, back up /var/spool/postfix/maildrop > > 4) On the new machine, make sure Postfix works. > > 5) On the new machine, stop Postfix. > > 6) On the new machine, restore /var/spool/postfix/maildrop > > 7) On the new machine, start Postfix. > > There are ways to skip the "postsuper -r ALL" step, and copy the > incoming + active + deferred + bounce + defer + flush + hold > directories to the new machine, but that would be safe only with > an empty queue on the new machine. > This has become somewhat off-topic for this list, but you might be able to simply sync the entire Postfix queue to the backup machine, and run postsuper -s before starting Postfix on the backup. From the postsuper man page: -s Structure check and structure repair. This should be done once before Postfix startup. Rename files whose name does not match the message file inode number. This operation is necessary after restoring a mail queue from a different machine, or from backup media. The important thing to keep in mind is that Postfix embeds the inode number in the filename simply to keep the name unique while the message resides on the filesystem. Obviously, this approach breaks when the files are copied to another filesystem. Renaming them appropriately on the new destination ensures no files will be overwritten as the queue is processed or new messages enter the queue. Of course, the scheme I proposed earlier requires that once the backup Postfix is brought up, it must be impossible for the primary to begin resyncing files to the same location on the backup if it becomes active again (or refuses to die a graceful death). Certainly tricky, but it sounds like the use case is to preserve the queue in case of a total failure, just to make sure the mail goes out (even it means it goes out twice). Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
>Yeah, except Postfix encodes the inode of the queue files in its queue >IDs, so it gets very confused if you do this. Same with restoring >queues from backups. You should be able to get away with this if, when moving the queue to another machine, you move the queued mail from hold, incoming, active and deferred directories into the maildrop directory on the target instance. This (somewhat old, but still correct, I think) message from Wietse might shed more light on it: Date: Thu, 12 Sep 2002 20:33:08 -0400 (EDT) From: wie...@porcupine.org (Wietse Venema) Subject: Re: postfix migration > I want to migrate postfix to another machine. What are also the steps so > that I won't lose mails on the process? This is the safe procedure. 1) On the old machine, stop Postfix. 2) On the old machine, run as super-user: postsuper -r ALL This moves all queue files to the maildrop queue. 3) On the old machine, back up /var/spool/postfix/maildrop 4) On the new machine, make sure Postfix works. 5) On the new machine, stop Postfix. 6) On the new machine, restore /var/spool/postfix/maildrop 7) On the new machine, start Postfix. There are ways to skip the "postsuper -r ALL" step, and copy the incoming + active + deferred + bounce + defer + flush + hold directories to the new machine, but that would be safe only with an empty queue on the new machine. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Sat, Jan 10, 2009 at 02:35:53PM -0500, Jorey Bump wrote: > Bron Gondwana wrote, at 01/10/2009 04:56 AM: > > > So - no filesystem is sacred. Except for bloody out1 with its 1000+ > > queued postfix emails and no replication. It's been annoying me for > > over a year now, because EVERYTHING ELSE is replicated. We've got > > some new hardware in place, so I'm investigating drbd as an option > > here. Not convined. It still puts us at the mercy of a filesystem > > crash. > > > > I'd prefer a higher level replication solution, but I don't know > > any product that replicates outbound mail queues nicely between > > multiple machines in a way that guarantees that every mail will be > > delivered at least once, and if there's a machine failure the only > > possible failure mode is that the second machine isn't aware that > > the message hasn't been delivered yet, so delivers it again. That's > > what I want. > > You could regularly rsync or rdiff-backup your Postfix queue directory > to another machine where Postfix lies dormant, but with a similar > configuration. In the event of a machine failure, you can start up > Postfix on the backup, which may even be able to function as a complete > replacement (submission, MX, delivery over LMTP). There is still > opportunity for minor race conditions and automating failover needs to > be worked out, but it's better than nothing. Yeah, except Postfix encodes the inode of the queue files in its queue IDs, so it gets very confused if you do this. Same with restoring queues from backups. My searches on the postfix mailing list archives have shown similar questions being asked a couple of times, but nobody has come up with a really good solution so far. We do keep inhouse patches against postfix - I think we apply 6 at the moment. So I'm happy to make small changes to support this :) > Jorey ( big fan of Bron's occasional parenthetical sig comments! ) Bron ( I try ;) ) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Bron Gondwana wrote, at 01/10/2009 04:56 AM: > So - no filesystem is sacred. Except for bloody out1 with its 1000+ > queued postfix emails and no replication. It's been annoying me for > over a year now, because EVERYTHING ELSE is replicated. We've got > some new hardware in place, so I'm investigating drbd as an option > here. Not convined. It still puts us at the mercy of a filesystem > crash. > > I'd prefer a higher level replication solution, but I don't know > any product that replicates outbound mail queues nicely between > multiple machines in a way that guarantees that every mail will be > delivered at least once, and if there's a machine failure the only > possible failure mode is that the second machine isn't aware that > the message hasn't been delivered yet, so delivers it again. That's > what I want. You could regularly rsync or rdiff-backup your Postfix queue directory to another machine where Postfix lies dormant, but with a similar configuration. In the event of a machine failure, you can start up Postfix on the backup, which may even be able to function as a complete replacement (submission, MX, delivery over LMTP). There is still opportunity for minor race conditions and automating failover needs to be worked out, but it's better than nothing. Jorey ( big fan of Bron's occasional parenthetical sig comments! ) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Fri, Jan 09, 2009 at 05:20:02PM +0200, Janne Peltonen wrote: > I've even been playing a little with userland ZFS, but it's far from > usable in production (was a nice little toy. though, and a /lot/ faster > than could be believed). Yeah - zfs-on-fuse is not something I'd want to trust production data to. > I think other points concerning why not to change to another OS > completely for the benefits available in ZFS were already covered by > Bron, so I'm not going to waste bandwidth any more with this matter. :) I did get a bit worked up about it ;) Thankfully, I don't get confronted with fsck prompts very often, because my response to fsck required is pretty simple these days :) a) it's a system partition - reinstall. Takes 10 minutes from start to finish (ok, 15 on some of the bigger servers, POST being the extra) and doesn't blat data partitions. Our machines are installed using FAI to bring the base operating system up and install the "fastmail-server" Debian package, which pulls in all the packages we use as dependencies. It then checks out the latest subversion repository and does "make -C conf install" which sets up everything else. This is all per-role and per machine configured in a config file which contains lots of little micro languages optimised for being easy to read in a 'diff -u', since that's what our subversion commit hook emails us. b) if it's a cyrus partition, nuke the data and meta partitions and re-sync all users from the replicated pair. c) if it's a VFS partition, nuke it and let the automated balancing script fill it back up in its own time (this is the nicest one, all key-value based with sha1. I know I'll probably have to migrate the whole thing to sha3 at some stage, but happy to wait until it's finalised) d) oh yeah, mysql. That's replicated between two machines as well, and dumped with ibbackup every night. If we lose one of these we restore from the previous night's backup and let replication catch up. It's never happened (yet) on the primary pair - I've had to rebuild a few slaves though, so the process is well tested. So - no filesystem is sacred. Except for bloody out1 with its 1000+ queued postfix emails and no replication. It's been annoying me for over a year now, because EVERYTHING ELSE is replicated. We've got some new hardware in place, so I'm investigating drbd as an option here. Not convined. It still puts us at the mercy of a filesystem crash. I'd prefer a higher level replication solution, but I don't know any product that replicates outbound mail queues nicely between multiple machines in a way that guarantees that every mail will be delivered at least once, and if there's a machine failure the only possible failure mode is that the second machine isn't aware that the message hasn't been delivered yet, so delivers it again. That's what I want. I'd also like a replication mode for our IMAP server that guaranteed the message was actually committed to disk on both machines before returning OK to the lmtpd or imapd. That's a whole lot of work though. (we actually lost an entire external drive unit the other day, and had to move replicas to new machines. ZFS wouldn't have helped here, the failure was hardware. We would still have had perfectly good filesystems that were offline. Can't serve up emails while offline) Bron. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Fri, Jan 09, 2009 at 08:41:38AM -0600, Scott Lambert wrote: > On Fri, Jan 09, 2009 at 10:54:10AM +0200, Janne Peltonen wrote: > > So have I. But in the current Cyrus installation, I'm stuck with Linux, > > so I concentrated on what's available on Linux. Moreover, I don't want > > to use non-free operating systems - if anything, I've become more > > ideological with age... I'd happily use /any/ free unix variant that ran > > ZFS, but. > > Well, fire up your test environment and start playing with FreeBSD. ZFS > and DTrace and "free." The better ZFS is in 8-CURRENT. Apparently, > you have to tweak things a bit on 7-STABLE still. But, by the time you > (third person non-specific) get comfortable with "not Linux" 8 may be > -STABLE. OK, I was oversimplifying things. ZFS isn't actually non-free as such, it's just GPL-incompatible. And the "but" above did include quite a lot of things, like for instance "us" being committed to red hat / centos. If I haven't been able to even alter the Linux distribution here, just how hard would you think it'd be to try altering the unix variant... For example, one thing is, our SAN vendor says their product supports Red Hat, and I'm already treading over thin ice by using Centos. I've even been playing a little with userland ZFS, but it's far from usable in production (was a nice little toy. though, and a /lot/ faster than could be believed). I think other points concerning why not to change to another OS completely for the benefits available in ZFS were already covered by Bron, so I'm not going to waste bandwidth any more with this matter. :) --Janne -- Janne Peltonen PGP Key ID: 0x9CFAC88B Please consider membership of the Hospitality Club (http://www.hospitalityclub.org) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Fri, Jan 09, 2009 at 10:54:10AM +0200, Janne Peltonen wrote: > So have I. But in the current Cyrus installation, I'm stuck with Linux, > so I concentrated on what's available on Linux. Moreover, I don't want > to use non-free operating systems - if anything, I've become more > ideological with age... I'd happily use /any/ free unix variant that ran > ZFS, but. Well, fire up your test environment and start playing with FreeBSD. ZFS and DTrace and "free." The better ZFS is in 8-CURRENT. Apparently, you have to tweak things a bit on 7-STABLE still. But, by the time you (third person non-specific) get comfortable with "not Linux" 8 may be -STABLE. -- Scott LambertKC5MLE Unix SysAdmin lamb...@lambertfam.org Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Bron Gondwana wrote: > On Thu, Jan 08, 2009 at 10:13:25PM -0800, Robert Banz wrote: >> (notice, didn't mention AIX. I've got my standards ;) > > Hey - I have a friend who _likes_ AIX. There are odd people in the > world. We at Uppsala university have been running cyrus on AIX for a little more than 10 years. Back then, there was no acceptable alternative to the AIX LVM. AIX still is, as I see it, very competent when it comes to handling disk I/O. About three years ago I thought time was ready for running cyrus on a shared file system. We had a large installation of IBM SANFS on another system that at the time was performing well. We purchased SANFS and four RS/6000 servers for cyrus and needed to take it in production rather abruptly after a filesystem crash on the old cyrus server. However time was not quite ready for running cyrus on SANFS. After about two years of SANFS problems (although at no time did SANFS get corrupt, it was very stable but just could not handle the load. Sometimes we needed to restart the filesystem several times a week) we decided to follow the examples published on this list by the FastMail guys. I did not dare to go with GPFS when SANFS was discontinued. We've since splitted our 20 million cyrus files onto eight IBM blade servers, running AIX virtualization server handling SAN connections for the 6 virtual RedHat servers running on each of them. We run one cyrus instance on each RedHat server, 3 primary servers and 3 replicas on each blade. We thus have 24 primary servers, and 24 replicas, with about 1 million cyrus files each. We did some tests on which file system to choose but there were not that much difference so we decided on ext3. We also have 4 additional blades running Debian, 2 for LVS and 2 for Nginx, and about 10TB of SAN disk area dedicated to cyrus. The system has been running very nicely for six months now. So I guess this is a success story inspired by FastMail. But I still would not choose anything other than AIX for our TSM servers. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Nic Bernstein wrote: > PS - This has been a very interesting thread to read. Some of us just > don't have the exposure to large systems like the participants in this > thread have, and this can be very educational. It's actually been helpful to us, as well. All of our mail backends are currently Solaris with SAN storage using vxfs. We're considering a move to Linux, but which filesystem to choose is still a major unanswered question for us. After reading this entire thread, it makes me realize that I've been taking vxfs for granted. It's been rock solid and the performance is fine. Thanks, Dave -- Dave McMurtrie, SPE Email Systems Team Leader Carnegie Mellon University, Computing Services Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On 01/09/2009 12:59 AM, Bron Gondwana wrote: On Thu, Jan 08, 2009 at 10:13:25PM -0800, Robert Banz wrote: There's a significant upfront cost to learning a whole new system for one killer feature, especially if it comes along with signifiant regressions in lots of other features (like a non-sucky userland out of the box). The "non-sucky" userland comment is simply a matter of preference, and bait for a religious war, which I'm not going to bite. Well, yeah. Point. Though most Solaris admins I know tend to pull in gnu or bsd utilities pretty quickly. I'll take that one back, it was baity. So at the risk of entering into a flame war, I must say I am surprised that no one has mentioned Nexenta/OS. http://www.nexenta.org/os They have bolted the Ubuntu/Debian userland onto OpenSolaris to give the Linux lovers out there a linuxy experience with access to all of that shiny new Solaris bling, such as zfs and dtrace. You may want to give it a look-see. Patching is always an issue on any OS, and you do have the choice of running X applications remotely (booting an entire graphic environment!?), and many other tools available such as pca to help you patch on Solaris, which provide many of the features that you're used to. And I'm seeing there are quite a few third party tools that people have written to ease the pain of patch management on Solaris (I believe it's actually one of the nicer unixes to manage patches on, but when you're used to apt-get, there's a whole world of WTFery in manually downloading and applying patch sets - especially when you get permission denied on a bunch of things that the tool has just suggested as being missing) Oh yeah, apt-get included. Cheers, -nic PS - This has been a very interesting thread to read. Some of us just don't have the exposure to large systems like the participants in this thread have, and this can be very educational. -- Nic Bernstein n...@onlight.com Onlight llc. www.onlight.com 2266 North Prospect Avenue #610 v. 414.272.4477 Milwaukee, Wisconsin 53202-6306 f. 414.290.0335 Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Thu, Jan 08, 2009 at 10:13:25PM -0800, Robert Banz wrote: >> >> There's a significant upfront cost to learning a whole new system >> for one killer feature, especially if it comes along with signifiant >> regressions in lots of other features (like a non-sucky userland >> out of the box). > > The "non-sucky" userland comment is simply a matter of preference, and > bait for a religious war, which I'm not going to bite. Well, yeah. Point. Though most Solaris admins I know tend to pull in gnu or bsd utilities pretty quickly. I'll take that one back, it was baity. > What I will say is that switching between Solaris, Linux, IRIX, Ultrix, > FreeBSD, HP-UX, OSF/1 -- any *nix variant, should not be considered a > stumbling block. Your comment shows the narrow-mindedness of the current > Linux culture, many of us were brought up supporting and using a > collection of these platforms at any one time. There's a switching cost, particularly if you don't have any experience with a new system. You have to consider that cost when making an upgrade choice. I agree that ZFS is better than anything currently available on Linux - but the question is "does that outweight the disadvantages of learning and supporting a new platform?". There are basically two worthwhile things on Solaris: ZFS and DTrace. Other things - fork behaviour caused us pain recently, it's just not as cheap as on Linux, and forking from a big process caused lots of swapping because even though it was execing pretty quickly, it had to commit the memory first. Oops. There are downsides to Linux's overcommit, but having to add complexity to our backup manager because forking for every backup was too expensive was annoying. (I do take offence to being considered narrow minded for not blindly following the latest fashion and wanting to switch everything over to Solaris because it has the latest bling - I've considered it, but the numbers just don't add up. We have something that works, is reliable and is fast. Our redundancy is just at a different level) Hey - back on topic for cyrus. We store sha1s of message files in the index file now. We don't have checksums on index files (yet, I have crc32 patches half finished somewhere), but we're at a point where userland scrubs are possible. Along with replication, you can restore any damaged file from the replica. Actually, with out backup system, you can even pull the original file from the backup, knowing its sha1 because it gets recalculated again and checked during the backup phase. > (notice, didn't mention AIX. I've got my standards ;) Hey - I have a friend who _likes_ AIX. There are odd people in the world. > Patching is always an issue on any OS, and you do have the choice of > running X applications remotely (booting an entire graphic > environment!?), and many other tools available such as pca to help you > patch on Solaris, which provide many of the features that you're used > to. I take it you haven't run X applications remotely from the other side of the world before? I'd hardly call it "running". Crawling maybe. My current approach is to run up an vncserver on a box in the same colo and run X applications remotely to there. It's significantly less painful, and also gives me a place to run an iceweasel to talk to the web interfaces of things things that won't talk to me any other way. Uploading firmware via the web from locally is similarly less sucky than pushing it out from Australia. And I'm seeing there are quite a few third party tools that people have written to ease the pain of patch management on Solaris (I believe it's actually one of the nicer unixes to manage patches on, but when you're used to apt-get, there's a whole world of WTFery in manually downloading and applying patch sets - especially when you get permission denied on a bunch of things that the tool has just suggested as being missing) In short - I'm not sold on the value to FastMail of at least two of us (bus factor) learning to maintain Solaris to the level that we'd want for running something so core to our operations as the IMAP servers. Bron ( happy to either stop the flamewar or take it off list at this point. I don't think we're contributing anything meaningful any more ) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Thu, Jan 08, 2009 at 08:01:04AM -0800, Vincent Fox wrote: > (Summary of filesystem discussion) > > You left out ZFS. > > Sometimes Linux admins remind me of Windows admins. I didn't. --clip-- Btrfs is in so early development that I don't know yet what to say about it, but the fact of ZFS's being incompatible with GPL might be mitigated by this. --clip-- > I have adminned a half-dozen UNIX variants professionally but > keep running into admins who only do ONE and for whom every > problem is solved with "how can I do this with one OS only?" So have I. But in the current Cyrus installation, I'm stuck with Linux, so I concentrated on what's available on Linux. Moreover, I don't want to use non-free operating systems - if anything, I've become more ideological with age... I'd happily use /any/ free unix variant that ran ZFS, but. > Dark Ages now for terabytes of mail volume I'd throw a professional fit. > Even the idea that I need to tune my filesystem for inodes and to avoid it > wanting to fsck on reboot #20 or whatever seems like caveman discussion. > Any of them offer cheap and nearly-instant snapshots & online scrubbing? > No? Then why use it for large number of files of important nature? Because there isn't a free FS that does those things (yet). And there are free systems that do enough... > I love Linux, I surely do. Virtually everything of an appliance nature here > will probably shift over to it in the long run I think and for good reasons. > But filesystem is one area where the bazaar model has fallen into a very > deep rut and can't muster energy to climb out. Really? Btrfs /does/ appear promising to me. I might be wrong, though. > So far ZFS ticking along with no problems and low iostat numbers > with everything in one big pool. I have separate fs for data, imap, mail > but haven't seen any need to carve mail spool into chunks at all. > There were initial problems noted here in the mailing lists way back > in Solaris 10u3 but that was solved with the fsync patch and since then > it's been like butter. Mail-store systems nobody ever needs to look > at them because it "just works". Well, that's nice. It's a shame they made it GPL-incompatible. BR -- Janne Peltonen PGP Key ID: 0x9CFAC88B Please consider membership of the Hospitality Club (http://www.hospitalityclub.org) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> > There's a significant upfront cost to learning a whole new system > for one killer feature, especially if it comes along with signifiant > regressions in lots of other features (like a non-sucky userland > out of the box). ... The "non-sucky" userland comment is simply a matter of preference, and bait for a religious war, which I'm not going to bite. What I will say is that switching between Solaris, Linux, IRIX, Ultrix, FreeBSD, HP-UX, OSF/1 -- any *nix variant, should not be considered a stumbling block. Your comment shows the narrow-mindedness of the current Linux culture, many of us were brought up supporting and using a collection of these platforms at any one time. (notice, didn't mention AIX. I've got my standards ;) Patching is always an issue on any OS, and you do have the choice of running X applications remotely (booting an entire graphic environment!?), and many other tools available such as pca to help you patch on Solaris, which provide many of the features that you're used to. -rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Bron Gondwana wrote: > BUT - if someone is asking "what's the best filesystem to use > on Linux" and gets told ZFS, and by the way you should switch > operating systems and ditch all the rest of your custom setup/ > experience then you're as bad as a Linux weenie saying "just > use Cyrus on Linux" in a "how should I tune NTFS on my > Exchange server" discussion. > > Point taken. We can go around that circle all day long but I *am* saying there are other UNIX OS out there than just Linux and quite frankly it blows my mind sometimes how people fall into ruts. Numerous times in my career I have had to switch some application from AIX to HP-UX, or IRIX to Linux. The differing flavors of UNIX are not so different to me as others perhaps. Particularly when it's a single app on a dedicated server I usually find it odd how people get stuck on something and won't change. Or they take the safe institutional path and never fight it. Collect your paycheck and go home at 4. I sleep very well at night knowing the Cyrus mail-stores are on ZFS. Once in a while I run a scrub just for fun. No futzing around. This was no cakewalk. I was pushing a boulder up a hill particularly when we ran head-first into the ZFS fsync bottleneck start of Fall quarter. Managers said we needed a crash program to convert everything to Linux or Exchange or whatever. I dug into the bugs instead and Sun got us an interim patch to fix it and we moved on. Now as I said it's like butter and one of those setups nobody thinks about. There are always excuses why you will stick with "established" practice even if it's antiquated and full of aches and pains, and I fought that and won. It seems to me there is no bigger deal than having a RELIABLE filesystem for mail-store and this is where all other filesystem I have worked with since 1989 have been a frigging nightmare. Everything from bad controllers to double-disk failures in RAID-5 sets keeps me wondering am I paranoid ENOUGH. I'll be all over btrfs when it hits beta. I'm not married to ZFS. But I'm quite unashamedly looking down my nose at any filesystem now that leaves me possibly looking at fsck prompt. I've done enough of that in my career already it's time to move beyond 30+ years worth of cruft atop antique designs that seemed tolerable when a huge disk was 20 gigs. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Thu, Jan 08, 2009 at 08:01:04AM -0800, Vincent Fox wrote: > (Summary of filesystem discussion) > > You left out ZFS. Just to come back to this - I should say that I'm a big fan of ZFS and what Sun have done with filesystem design. Despite the issues we've had with that machine, I know it's great for people who are using it... BUT - if someone is asking "what's the best filesystem to use on Linux" and gets told ZFS, and by the way you should switch operating systems and ditch all the rest of your custom setup/ experience then you're as bad as a Linux weenie saying "just use Cyrus on Linux" in a "how should I tune NTFS on my Exchange server" discussion. >From the original post: Message-ID: <1617f8010812300849k1c7c878bl2f17e8d4287c1...@mail.gmail.com> "zfs (but we should switch to solaris or freebsd and throw away our costly SAN)" I'd love to do some load testing on a ZFS box with our setup at some point. There would be some advantages, though I suspect having one big mailboxes.db vs the lots of little ones we have would be a point of contention - and fine-grained skiplist locking is still very much a wishlist item. I'd want to take some time testing it before unleashing it on the world! Bron. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Thu, Jan 08, 2009 at 08:57:18PM -0800, Robert Banz wrote: > > On Jan 8, 2009, at 4:46 PM, Bron Gondwana wrote: > >> On Thu, Jan 08, 2009 at 08:01:04AM -0800, Vincent Fox wrote: >>> (Summary of filesystem discussion) >>> >>> You left out ZFS. >>> >>> Sometimes Linux admins remind me of Windows admins. >>> >>> I have adminned a half-dozen UNIX variants professionally but >>> keep running into admins who only do ONE and for whom every >>> problem is solved with "how can I do this with one OS only?" There's a significant upfront cost to learning a whole new system for one killer feature, especially if it comes along with signifiant regressions in lots of other features (like a non-sucky userland out of the box). Applying patches on Solaris seems to be a choice between incredibly low-level command line tools or boot up a whole graphical environment on a machine in a datacentre on the other side of the world. >> We run one zfs machine. I've seen it report issues on a scrub >> only to not have them on the second scrub. While it looks shiny >> and great, it's also relatively new. > > You'd be surprised how unreliable disks and the transport between the > disk and host can be. This isn't a ZFS problem, but a statistical > certainty as we're pushing a large amount of bits down the wire. > > You can, with a large enough corpus, have on-disk data corruption, or > data corruption that appeared en-flight to the disk, or in the > controller, that your standard disk CRCs can't correct for. As we keep > pushing the limits, data integrity checking at the filesystem layer -- > before the information is presented for your application to consume -- > has basically become a requirement. > > BTW, the reason that the first scrub saw the error, and the second scrub > didn't, is that the first scrub fixed it -- that's the job of a ZFS # zpool status -v rpool pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub in progress for 0h0m, 0.69% done, 1h40m to go config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t0d0s0 ONLINE 0 0 0 c5t4d0s0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: //dev/dsk --- if that's an "error that the scrub fixed" then it's a really badly written error message. Same error didn't exist next scrub, which was what confused me. Bron. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Jan 8, 2009, at 4:46 PM, Bron Gondwana wrote: > On Thu, Jan 08, 2009 at 08:01:04AM -0800, Vincent Fox wrote: >> (Summary of filesystem discussion) >> >> You left out ZFS. >> >> Sometimes Linux admins remind me of Windows admins. >> >> I have adminned a half-dozen UNIX variants professionally but >> keep running into admins who only do ONE and for whom every >> problem is solved with "how can I do this with one OS only?" > > We run one zfs machine. I've seen it report issues on a scrub > only to not have them on the second scrub. While it looks shiny > and great, it's also relatively new. You'd be surprised how unreliable disks and the transport between the disk and host can be. This isn't a ZFS problem, but a statistical certainty as we're pushing a large amount of bits down the wire. You can, with a large enough corpus, have on-disk data corruption, or data corruption that appeared en-flight to the disk, or in the controller, that your standard disk CRCs can't correct for. As we keep pushing the limits, data integrity checking at the filesystem layer -- before the information is presented for your application to consume -- has basically become a requirement. BTW, the reason that the first scrub saw the error, and the second scrub didn't, is that the first scrub fixed it -- that's the job of a ZFS scrub. -rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Thu, 08 Jan 2009 20:03 -0500, "Dale Ghent" wrote: > On Jan 8, 2009, at 7:46 PM, Bron Gondwana wrote: > > > We run one zfs machine. I've seen it report issues on a scrub > > only to not have them on the second scrub. While it looks shiny > > and great, it's also relatively new. > > Wait, weren't you just crowing about ext4? The filesystem that was > marked GA in the linux kernel release that happened just a few weeks > ago? You also sound pretty enthusiastic, rather than cautious, when > talking about brtfs and tux3. I was saying I find it interesting. I wouldn't seriously consider using it for production mail stores just yet. But I have been testing it on my laptop, where I'm running an offlineimap replicated copy of my mail. I wouldn't consider btrfs for production yet either, and tux3 isn't even on the radar. They're interesting to watch though, as is ZFS. I also said (or at least meant) that if you have commercial support, ext4 is probably going to be the next evolutionary step from ext3. > ZFS, and anyone who even remotely seriously follows Solaris would know > this, has been GA for 3 years now. For someone who doesn't have their > nose buried in Solaris much or with any serious attention span, I > guess it could still seem new. Yeah, it's true - but I've heard anecdotes of people losing entire zpools due to bugs. Google turns up things like: http://www.techcrunch.com/2008/01/15/joyent-suffers-major-downtime-due-to-zfs-bug/ which points to this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=49020&tstart=0 and finally this comment: http://www.joyeur.com/2008/01/16/strongspace-and-bingodisk-update#c008480 Not something I would want happening to my entire universe, which is why having ~280 separate filesystems (at the moment) with our email spread across them means that a rare filesystem bug is only likely to affect a single store if it bites - and we can restore one store's worth of users a lot quicker than the whole system. It's the same reason we prefer Cyrus replication (and put a LOT of work into making it stable - check this mailing list from a couple of years ago. I wrote most of the patches the stabilised replication between 2.3.3 and 2.3.8) If all your files are on a single filesystem then a rare bug only has to hit once. A frequent bug on the other hand, well - you'll know about them pretty fast... :) None of the filesystems mentioned have frequent bugs (except btrfs and probably tux3 - but they ship with big fat warnings all over) > As for your x4500, I can't tell if those syslog lines you pasted were > from Aug. 2008 or 2007, but certainly since 2007 the marvel SATA > driver has seen some huge improvements to work around some pretty > nasty bugs in the marvell chipset. If you still have that x4500, and > have not applied the current patch for the marvell88sx driver, I > highly suggest doing so. Problems with that chip are some of the > reasons Sun switched to the LSI 1068E as the controller in the x4540. I think it was 2007 actually. We haven't had any trouble with it for a while, but then it does pretty little. The big zpool is just used for backups, which are pretty much one .tar.gz and one .sqlite3 file per user - and the .sqlite3 file is just indexing the .tar.gz file, we can rebuild it by reading the tar file if needed. As a counterpoint to some of the above, we had an issue with Linux where there was a bug in 64 bit writev handling of mmaped space. If you were doing a writev with a mmaped space that crossed a page boundary and the following page wasn't mapped in, it would inject spurious zero bytes in the output where the start of the next page belonged. It took me a few days to prove it was the kernel and create a repeatable test case, and then backwards and forwards with Linus and a couple of other developers we fixed it and tested it _that_day_. I don't know anyone with even unobtanium level support with a commercial vendor who has actually had that sort of turnaround. This caused pretty massive file corruption of especially our skiplist files, but bits of every other meta file too. Luckily, as per above, we had only upgraded one machine. We generally do that with new kernels or software versions - upgrade one production machine and watch it for a bit. We also test things on testbed machines first, but you always find something different on production. The mmap over boundaries case was pretty rare - only a few per day would actually cause a crash, the others were silent corruption that wasn't detected at the time. If something like this hit an only machine, we would have been seriously screwed. Since it only hit one machine, we could apply the fix and re-replicate all the damaged data from the other machine. No actual dataloss. Bron. -- Bron Gondwana br...@fastmail.fm Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http:/
Re: choosing a file system
On Jan 8, 2009, at 7:46 PM, Bron Gondwana wrote: > We run one zfs machine. I've seen it report issues on a scrub > only to not have them on the second scrub. While it looks shiny > and great, it's also relatively new. Wait, weren't you just crowing about ext4? The filesystem that was marked GA in the linux kernel release that happened just a few weeks ago? You also sound pretty enthusiastic, rather than cautious, when talking about brtfs and tux3. ZFS, and anyone who even remotely seriously follows Solaris would know this, has been GA for 3 years now. For someone who doesn't have their nose buried in Solaris much or with any serious attention span, I guess it could still seem new. As for your x4500, I can't tell if those syslog lines you pasted were from Aug. 2008 or 2007, but certainly since 2007 the marvel SATA driver has seen some huge improvements to work around some pretty nasty bugs in the marvell chipset. If you still have that x4500, and have not applied the current patch for the marvell88sx driver, I highly suggest doing so. Problems with that chip are some of the reasons Sun switched to the LSI 1068E as the controller in the x4540. /dale Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Thu, Jan 08, 2009 at 08:01:04AM -0800, Vincent Fox wrote: > (Summary of filesystem discussion) > > You left out ZFS. > > Sometimes Linux admins remind me of Windows admins. > > I have adminned a half-dozen UNIX variants professionally but > keep running into admins who only do ONE and for whom every > problem is solved with "how can I do this with one OS only?" We run one zfs machine. I've seen it report issues on a scrub only to not have them on the second scrub. While it looks shiny and great, it's also relatively new. Besides, we had a disk _fail_ early on in our x4500 - Sun shipped a replacement drive, but the kernel was unable to recognise it: --- "Nothing odd about how it snaps in. We can see the connectors in the slot - they seem fine as far as we can tell. The drive's 'ok' light is on and the blue led lit." Which suggests the server thinks the drive is fine, but the dmesg data definitely suggests it isn't. I've also included the output of hdadm display below as well, which shows that currently it thinks the drive is not present, even though the last thing reported in the dmesg log is that the device was connected. Aug 14 21:59:13 backup1 SATA device attached at port 0 Aug 14 21:59:13 backup1 sata: [ID 663010 kern.info] +/p...@2,0/pci1022,7...@8/pci11ab,1...@1 : The output of hdadm display shows that the machine definitely thinks the drive is NOT connected. --- Sun's response was to wait for the next kernel upgrade - there was a bug that made that channel unusable even after a reboot. > So far ZFS ticking along with no problems and low iostat numbers > with everything in one big pool. I have separate fs for data, imap, mail > but haven't seen any need to carve mail spool into chunks at all. > There were initial problems noted here in the mailing lists way back > in Solaris 10u3 but that was solved with the fsync patch and since then > it's been like butter. Mail-store systems nobody ever needs to look > at them because it "just works". I'd sure hate to lose the entire basket, say due to an unknown bug in zfs. Besides, I _know_ Debian quite well. We don't have any Solaris experience in our team. The documentation looks quite good, but it's still a lot of things that work differently. I tell you what, maintaining Solaris and using the Solaris userland feels like going back 20 years - and the whole "need a sunsolve password and only get some patches - permission denied on others" crap. I don't need that. So while I apprciate that ZFS has some advantages, I'd have to say that they need to be weighed up against the rest of the system, and the "all the eggs in a relatively new basket" argument. Also, the response we've had from Linus when we find kernel issues has been absolutely fantastic. Bron ( Debian on the Solaris kernel would be interesting... ) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Thu, Jan 08, 2009 at 05:20:00PM +0200, Janne Peltonen wrote: > If I'm still following after reading through all this discussion, > everyone who is actually using ReiserFS (v3) appears to be very content > with it, even with very large installations. Apparently the fact that > ReiserFS uses the BKL in places doesn't hurt performance too badly, even > with multi core systems? Another thing I don't recall being mentioned > was fragmentation - ext3 appears to have a problem with it, in typical > Cyrus usage, but how does ReiserFS compare to it? Yeah, I'm surprised the BKL hasn't hurt us more. Fragmentation, yeah it does hurt performance a bit. We run a patch which causes a skiplist checkpoint every time it runs a "recovery", which includes every restart. We also tune skiplists to checkpoint more frequently in everyday use. This helps reduce meta fragmentation. For data fragmentation - we don't care. Honestly. Data IO is so rare. The main time it matters is if someone does a body search. Which leaves... index files. The worst case are files that are only ever appended to, never any records deleted. Each time you expunge a mailbox (even with delayed expunge) it causes a complete rewrite of the cyrus.index file. I also wrote a filthy little script (attached) which can repack cyrus meta directories. I'm not 100% certain that it's problem free though, so I only run it on replicas. Besides, it's not "protected" like most of our auto-system functions, which check the database to see if the machine is reporting high load problems and choke themselves until the load drops back down again. > I'm using this happily, with 50k users, 24 distinct mailspools of 240G > each. Full backups take quite a while to complete (~2 days), but normal > usage is quite fast. There is the barrier problem, of course... I'm > using noatime (implying nodiratime) and data=ordered, since > data=writeback resulted in corrupted skiplist files on crash, while > data=ordered mostly didn't. Yeah, full backups. Ouch. I think the last time we had to do that it took somewhat over a week. Mainly CPU limited on the backup server, which is doing a LOT of gzipping! Our incremental backups take about 4 hours. We could probably speed this up a little more, but given that it's now down from about 12 hours two weeks ago, I'm happy. We were actually rate limited by Perl 'unpack' and hash creation, believe it or not! I wound up rewriting Cyrus::IndexFile to provide a raw interface, and unpacking just the fields that I needed. I also asserted index file version == 10 in the backup library so I can guarantee the offsets are correct. I've described our backup system here before - it's _VERY_ custom, based on a deep understanding of the Cyrus file structures. In this case it's definitely worth it - it allows us to reconstruct partial mailbox recoveries with flags intact. Unfortunately, "seen" information is much trickier. I've been tempted for a while to patch cyrus's seen support to store seen information for the user themselves in the cyrus.index file, and only seen information for unowned folders in the user.seen files. The way it works now seems optimised for the uncommon case at the expense of the common. That always annoys me! > Ext4 just got stable, so there is no real world Cyrus user experience on > it. Among other things, it contains an online defragmenter. Journal > checksumming might also help around the write barrier problem on LVM > logical volumes, if I've understood correctly. Yeah, it's interesting. Local fiddling suggests it's worse for my Maildir performance than even btrfs, and btrfs feels more jerky than reiser3, so I stick with reiser3. > Reiser4 might have a future, at least Andrew Morton's -mm patch contains > it and there are people developing it. But I don't know if it ever will > be included in the "standard" kernel tree. Yeah, the mailing list isn't massively active at the moment either... I do keep an eye on it. > Btrfs is in so early development that I don't know yet what to say about > it, but the fact of ZFS's being incompatible with GPL might be mitigated > by this. Yeah, btrfs looks interesting. Especially with their work on improving locking - even on my little dual processor laptop (yay core processors) I would expect to see an improvement when they merge the new locking code. > I'm going to continue using ext3 for now, and probably ext4 when it's > available from certain commercial enterprise linux vendor (personally, > I'd be using Debian, but the department has an official policy of using > RH / Centos). I'm eagerly waiting for btrfs to appear... I probably /would/ > switch to ReiserFS for now, if RH cluster would support ReiserFS FS > resources. Hmm, maybe I should just start hacking... On the other hand, > the upgrade path from ext3 to ext4 is quite easy, and I don't know yet > which would be better, ReiserFS or ext4. Sounds sane. If vendor support matters, then ext4 is probabl
Re: choosing a file system
(Summary of filesystem discussion) You left out ZFS. Sometimes Linux admins remind me of Windows admins. I have adminned a half-dozen UNIX variants professionally but keep running into admins who only do ONE and for whom every problem is solved with "how can I do this with one OS only?" I admin numerous Linux systems in our data center (Perdition proxy in front of Cyrus for one) but frankly you want me to go back into filesystem Dark Ages now for terabytes of mail volume I'd throw a professional fit. Even the idea that I need to tune my filesystem for inodes and to avoid it wanting to fsck on reboot #20 or whatever seems like caveman discussion. Any of them offer cheap and nearly-instant snapshots & online scrubbing? No? Then why use it for large number of files of important nature? I love Linux, I surely do. Virtually everything of an appliance nature here will probably shift over to it in the long run I think and for good reasons. But filesystem is one area where the bazaar model has fallen into a very deep rut and can't muster energy to climb out. So far ZFS ticking along with no problems and low iostat numbers with everything in one big pool. I have separate fs for data, imap, mail but haven't seen any need to carve mail spool into chunks at all. There were initial problems noted here in the mailing lists way back in Solaris 10u3 but that was solved with the fsync patch and since then it's been like butter. Mail-store systems nobody ever needs to look at them because it "just works". Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Hm. ReiserFS: If I'm still following after reading through all this discussion, everyone who is actually using ReiserFS (v3) appears to be very content with it, even with very large installations. Apparently the fact that ReiserFS uses the BKL in places doesn't hurt performance too badly, even with multi core systems? Another thing I don't recall being mentioned was fragmentation - ext3 appears to have a problem with it, in typical Cyrus usage, but how does ReiserFS compare to it? Also, the write barrier problem mentioned in response to my earlier post on ext3 would apparently be there with ReiserFS, too, wouldn't it? GFS: Nobody mentioned using GFS, which /is/ a clustered file system and as such, probably overkill if it's only mounted on one node at a time, but I'm curious... the overhead of a clustered FS is the fact that all metadata operations take a long time, because there is a lot of cluster-wide locking. But how much metadata operations there are, after all, in Cyrus? Also, GFS is one of the two file systems available when using RH clustering... Ext3: I'm using this happily, with 50k users, 24 distinct mailspools of 240G each. Full backups take quite a while to complete (~2 days), but normal usage is quite fast. There is the barrier problem, of course... I'm using noatime (implying nodiratime) and data=ordered, since data=writeback resulted in corrupted skiplist files on crash, while data=ordered mostly didn't. Also, ext3 is the other FS available when using RH clustering. (Of course, it isn't a clustered FS, so it is only available when using the cluster in active-passive mode.) XFS: There was someone using this, too, and happy with it. JFS: Mm, apparently no comments on this, not positive, at least. Future: Ext4 just got stable, so there is no real world Cyrus user experience on it. Among other things, it contains an online defragmenter. Journal checksumming might also help around the write barrier problem on LVM logical volumes, if I've understood correctly. Reiser4 might have a future, at least Andrew Morton's -mm patch contains it and there are people developing it. But I don't know if it ever will be included in the "standard" kernel tree. Btrfs is in so early development that I don't know yet what to say about it, but the fact of ZFS's being incompatible with GPL might be mitigated by this. Conclusion: I'm going to continue using ext3 for now, and probably ext4 when it's available from certain commercial enterprise linux vendor (personally, I'd be using Debian, but the department has an official policy of using RH / Centos). I'm eagerly waiting for btrfs to appear... I probably /would/ switch to ReiserFS for now, if RH cluster would support ReiserFS FS resources. Hmm, maybe I should just start hacking... On the other hand, the upgrade path from ext3 to ext4 is quite easy, and I don't know yet which would be better, ReiserFS or ext4. -- Janne Peltonen PGP Key ID: 0x9CFAC88B Please consider membership of the Hospitality Club (http://www.hospitalityclub.org) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
>> We've found that splitting the data up into more volumes + more cyrus >> instances seems to help as well because it seems to reduce overall >> contention points in the kernel + software (eg filesystem locks spread >> across multiple mounts, db locks are spread across multiple dbs, etc) > > Makes sense. Single cyrus env here, might consider that in the future. > At > that point though, I'd probably consider Murder or similar. That should work fine as well. I believe murder just does two main things. 1. It's merges the mailboxes.db from each instance into each other instance, so you end up with just one giant single namespace 2. It proxies everything (imap/pop/lmtp) as needed to the appropriate instance if it's not the local one We don't use murder as we don't really need (1), and we do (2) ourselves with a combination of nginx and custom lmtpproxy tool. Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
2009/1/5 Patrick Boutilier > David Lang wrote: > >> On Sat, 3 Jan 2009, Rob Mueller wrote: >> >> But the new Solid-State-Disks seem very promising. They are claimed to give 30x the throughput of a 15k rpm disk. If IO improves by 30 times that should make all these optimizations unnecessary. As my boss used to tell me ... Good hardware always compensates for not-so-good software. >>> What we've found is that the meta-data (eg mailbox.db, seen db's, quota >>> files, cyrus.* files) use WAY more IO than the email data, but only use >>> 1/20th the space. >>> >>> By separating the meta data onto RAID1 10k/15k RPM drives, and the email >>> data onto RAID5/6 7.2k RPM drives, you can get a good balance of >>> space/speed. >>> >> >> how do you move the cyrus* files onto other drives? >> > > metapartition_files and metapartition-default imapd.conf options in > cyrus-imapd 2.3.x So, then, may be we can easily store pure email data on an NFS appliance, keeping metadata on traditionnal filesystem, which can be synced using low level tools Dom > > > > >> David Lang >> >> Cyrus Home Page: http://cyrusimap.web.cmu.edu/ >> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki >> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html >> > > > > Cyrus Home Page: http://cyrusimap.web.cmu.edu/ > Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki > List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html > -- Dominique LALOT Ingénieur Systèmes et Réseaux http://annuaire.univmed.fr/showuser?uid=lalot Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
David Lang wrote: On Sat, 3 Jan 2009, Rob Mueller wrote: But the new Solid-State-Disks seem very promising. They are claimed to give 30x the throughput of a 15k rpm disk. If IO improves by 30 times that should make all these optimizations unnecessary. As my boss used to tell me ... Good hardware always compensates for not-so-good software. What we've found is that the meta-data (eg mailbox.db, seen db's, quota files, cyrus.* files) use WAY more IO than the email data, but only use 1/20th the space. By separating the meta data onto RAID1 10k/15k RPM drives, and the email data onto RAID5/6 7.2k RPM drives, you can get a good balance of space/speed. how do you move the cyrus* files onto other drives? metapartition_files and metapartition-default imapd.conf options in cyrus-imapd 2.3.x David Lang Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html begin:vcard fn:Patrick Boutilier n:Boutilier;Patrick org:;Nova Scotia Department of Education adr:;;2021 Brunswick Street;Halifax;NS;B3K 2Y5;Canada email;internet:bouti...@ednet.ns.ca title:WAN Communications Specialist tel;work:902-424-6800 tel;fax:902-424-0874 version:2.1 end:vcard Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Sat, 3 Jan 2009, Rob Mueller wrote: >> But the new Solid-State-Disks seem very promising. They are claimed to >> give 30x the throughput of a 15k rpm disk. If IO improves by 30 times >> that should make all these optimizations unnecessary. >> As my boss used to tell me ... Good hardware always compensates for >> not-so-good software. > > What we've found is that the meta-data (eg mailbox.db, seen db's, quota > files, cyrus.* files) use WAY more IO than the email data, but only use > 1/20th the space. > > By separating the meta data onto RAID1 10k/15k RPM drives, and the email > data onto RAID5/6 7.2k RPM drives, you can get a good balance of > space/speed. how do you move the cyrus* files onto other drives? David Lang Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> $ mount | wc -l > 92 Wow. > We've found that splitting the data up into more volumes + more cyrus > instances seems to help as well because it seems to reduce overall > contention points in the kernel + software (eg filesystem locks spread > across multiple mounts, db locks are spread across multiple dbs, etc) Makes sense. Single cyrus env here, might consider that in the future. At that point though, I'd probably consider Murder or similar. > Also one thing I did fail to mention, was that for the data volumes, you > should definitely be using the "notail" mount option. Unfortunately that's > not the default, and I think it probably should be. Tails packing is neat > for saving space, but it reduces the average meta-data density, which makes > "stating" lots of files in a directory a lot slower. I think that's what > you might have been seeing. Of course you also mounted "noatime,nodiratime" > on both? Yes, we were using notail,noatime,nodiratime. John -- John Madden Sr. UNIX Systems Engineer Ivy Tech Community College of Indiana jmad...@ivytech.edu Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> On the other hand, XFS was the only Linux filesystems capable to handle our > 5 million files (at that time, we're now at 33 million) we had in these > days with an acceptable performance. Ext3 was way too slow with directories > with > 1000 files (but many things have changed from kernel 2.4.x to > nowadays kernels) It has; not Cyrus but another application we had, and we had to make a 'hashed' directory structure to avoid the many-files-in-a-directory situation. But this isn't true anymore, ext3 performs well with very large directories. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> > I had the feeling whatever optimizations done at the FS level would give > > us a max of 5-10% benefit. > > We migrated from ext3 to reiserfs on our cyrus servers with 30k > > mailboxes. I am not sure I saw a great benefit in terms of the iowait. > > At peak times I always see a iowait of 40-60% > To be honest, that's not what we saw in our ext3 <-> reiserfs tests. > What mount options are you using? Are you using the mount options I > mentioned? > noatime,nodiratime,notail,data=ordered FYI, noatime implies nodiratime. You can set nodiratime without atime, but not atime without nodiratime. > > But the new Solid-State-Disks seem very promising. They are claimed to > > give 30x the throughput of a 15k rpm disk. If IO improves by 30 times > > that should make all these optimizations unnecessary. > > As my boss used to tell me ... Good hardware always compensates for > > not-so-good software. > What we've found is that the meta-data (eg mailbox.db, seen db's, quota > files, cyrus.* files) use WAY more IO than the email data, but only use > 1/20th the space. Ditto. The meta-data is very much the hot-spot for I/O. > By separating the meta data onto RAID1 10k/15k RPM drives, and the email > data onto RAID5/6 7.2k RPM drives, you can get a good balance of > space/speed. Agree. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Henrique de Moraes Holschuh wrote: > Ext4, I never tried. Nor reiser3. I may have to, we will build a brand > new Cyrus spool (small, just 5K users) next month, and the XFS unlink > [lack of] performance worries me. Nobody likes deletes. Even databases used to mark deleted space only as "deleted" until a vacuum (Postgres) or other periodical maintenance command was run. Cyrus offers a similiar construct named "delayed expunge". Before we migrated our mail system to Solaris 10 it ran on Linux 2.4 with XFS on a FC SAN device. Deletes were extremely slow so we had to delay the expunges until the weekend, even on night they were too slow and too IO congesting. On the other hand, XFS was the only Linux filesystems capable to handle our 5 million files (at that time, we're now at 33 million) we had in these days with an acceptable performance. Ext3 was way too slow with directories with > 1000 files (but many things have changed from kernel 2.4.x to nowadays kernels), IBM jfs was not stable (it crashed during a high load test, which was an immediate k.o.). We were reluctant to use Reiser then as it was "too new" in 2001. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> Ext4, I never tried. Nor reiser3. I may have to, we will build a brand > new > Cyrus spool (small, just 5K users) next month, and the XFS unlink > [lack of] performance worries me. >From what I can tell, all filesystems seem to have relatively poor unlink performance and unlinks often cause excessive contention and IO for what you think they should be doing. And it's not just filesystems. SQL deletes in MySQL InnoDB are way slower than you'd expect as well. Maybe deletes in general are just not as optimised a path, or there's something tricky about making atomic deletes work, I admit I've never really looked into it. Anyway, that's part of the reason we sponsored Ken to create the "delayed expunge" mode code for cyrus, which allows us to delay unlinks to the weekends when IO load due to other things is the lowest. --- Added support for "delayed" expunge, in which messages are removed from the mailbox index at the time of the EXPUNGE (hiding them from the client), but the message files and cache entries are left behind, to be purged at a later time by cyr_expire. This reduces the amount of I/O that takes place at the time of EXPUNGE and should result in greater responsiveness for the client, especially when expunging a large number of messages. The new expunge_mode option in imapd.conf controls whether expunges are "immediate" or "delayed". Development sponsored by FastMail. --- Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> Running multiple cyrus instances with different dbs ? How do we do that. > I have seen the ultimate io-contention point is the mailboxes.db file. > And that has to be single. > Do you mean dividing the users to different cyrus instances. That is a > maintenance issue IMHO. As Bron said, yes it is, but if you have more than 1 machines worth of users anyway, you have maintenance issues anyway. So rather than just one instance per machine, we run multiple instances per machine. The only issue it really introduces is that folder sharing between arbitrary users isn't possible (unless you used murder to join all the instances together again, but we don't), only users within an instance can share. > I had the feeling whatever optimizations done at the FS level would give > us a max of 5-10% benefit. > We migrated from ext3 to reiserfs on our cyrus servers with 30k > mailboxes. I am not sure I saw a great benefit in terms of the iowait. > At peak times I always see a iowait of 40-60% To be honest, that's not what we saw in our ext3 <-> reiserfs tests. What mount options are you using? Are you using the mount options I mentioned? noatime,nodiratime,notail,data=ordered > But the new Solid-State-Disks seem very promising. They are claimed to > give 30x the throughput of a 15k rpm disk. If IO improves by 30 times > that should make all these optimizations unnecessary. > As my boss used to tell me ... Good hardware always compensates for > not-so-good software. What we've found is that the meta-data (eg mailbox.db, seen db's, quota files, cyrus.* files) use WAY more IO than the email data, but only use 1/20th the space. By separating the meta data onto RAID1 10k/15k RPM drives, and the email data onto RAID5/6 7.2k RPM drives, you can get a good balance of space/speed. Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Wed, 31 Dec 2008, Adam Tauno Williams wrote: > > I never really got the point of the data=writeback mode. Sure, it > > increases throughput, but so does disabling the journal completely, and > > seems to me the end result as concerns data integrity is exactly the > > same. > > The *filesystem* is recoverable as the meta-data is journaled. > *Contents* of files may be lost/corrupted. I'm fine with that since a > serious abend usually leaves the state of the data in a questionable > state anyway for reasons other than the filesystem; I want something I > can safely (and quickly) remount and investigate/restore. It is a > trade-off. Err... you guys better read the recent threads in LKML where Pavel goes really hard on the data safety holes in ext3 and Linux VFS (and POSIX). Short answer: ext3 without barriers (you can also disable disk write cache, in that case barriers are not needed) is not deserving of the name "safe". At which point *I* personally prefer XFS, which is just as adverse to the lack of barriers on a disk with an enabled write cache, but performs better than ext3 on most workloads AND has delayed write allocation. Ext4, I never tried. Nor reiser3. I may have to, we will build a brand new Cyrus spool (small, just 5K users) next month, and the XFS unlink [lack of] performance worries me. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Sat, Jan 03, 2009 at 11:46:41AM +0530, ram wrote: > Running multiple cyrus instances with different dbs ? How do we do that. > I have seen the ultimate io-contention point is the mailboxes.db file. > And that has to be single. Yeah, mailboxes.db access kinda sucks like that. If you're making any changes then it locks the entire DB with a single writelock. I did consider fine-grained mailboxes.db locking at one point. It's definitely doable with fcntl locking, which is what Cyrus is using on our machines. It would require some small format changes to skiplist though. Somewhere in a checkout I have cyrusdb_skiplist2.c which contains a bunch of checksumming code and the start of the new format. I got sidetracked and never finished it though. All our cyrus instances are installed on completely different drives. Entirely self-contained on those external units so we can plug then into a new machine and go. The init scripts are in /etc/init.d/, but they are generated from templates which pull their configuration from a central file. We can create a new pair of cyrus instances by adding a single line that looks like this in a config file: store$n slot$s1 slot$s2 where $n, $s1 and $s2 are just numbers. Slots are numbered as %d%02d with server and partition numbers (it will break if we ever have over 100 slots on a machine, but I'm happy to renumber at that point. Our biggest so far is 40. When I set this up the biggest was 8. Future proofing something so easily reconfigurable would have just meant more typing in the meanwhile. > Do you mean dividing the users to different cyrus instances. That is a > maintenance issue IMHO. It's amazing what you can do with good tools - besides, if your site is already bigger than any one single machine then you already have the issue. Might as well be smart about it. As I said upthread somewhere - moving a user is pretty easy for us: use ME::User; my $UserName = shift; my $TargetServer = shift; my $User = ME::User->new_find($UserName); $User->MoveUser($TargetServer); > But the new Solid-State-Disks seem very promising. They are claimed to > give 30x the throughput of a 15k rpm disk. If IO improves by 30 times > that should make all these optimizations unnecessary. > As my boss used to tell me ... Good hardware always compensates for > not-so-good software. Yeah, that would be nice. Modulo the rewrite cost of course. Note that mailboxes.db is a skiplist file. They make a lot of random updates to 4 bytes at a time when you append a record. Imagine what that costs if your minimum rewrite block is larger than the size of the whole file. You'd be better off going to flatfile DB. I'm not kidding you here. Running "recovery" at startup time would take days on a reasonable sized DB. Check out the seeks and rewrites that baby does. (ok, so if your filesystem isn't mounted writeback it would probably only rewrite twice when you actually did the fsyncs. So much for rhetorical devices) Bron ( rambling again ) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Sat, 2009-01-03 at 13:21 +1100, Rob Mueller wrote: > > Now see, I've had almost exactly the opposite experience. Reiserfs seemed > > to > > start out well and work consistently until the filesystem reached a > > certain > > size (around 160GB, ~30m files) at which point backing it up would start > > to > > take too long and at around 180GB would take nearly a week. This forced > > us > > to move to ext3 and it doesn't seem to be degrade that way. We did, > > however, > > also move from a single partition to 8 of them, so that obviously has some > > effect as well. > > As you noted, changing two variables at once doesn't help you determine > which was the problem! > > Multiple partitions will definitely allow more parallelism, which definitely > helps speed things up, which is one of the other things we have done over > time. Basically we went from a few large volumes to hundreds of > 300G(data)/15G(meta) volumes. One of our machines has 40 data volumes + 40 > meta data volumes + the standard FS mounts. > > $ mount | wc -l > 92 > > We've found that splitting the data up into more volumes + more cyrus > instances seems to help as well because it seems to reduce overall > contention points in the kernel + software (eg filesystem locks spread > across multiple mounts, db locks are spread across multiple dbs, etc) > Running multiple cyrus instances with different dbs ? How do we do that. I have seen the ultimate io-contention point is the mailboxes.db file. And that has to be single. Do you mean dividing the users to different cyrus instances. That is a maintenance issue IMHO. I had the feeling whatever optimizations done at the FS level would give us a max of 5-10% benefit. We migrated from ext3 to reiserfs on our cyrus servers with 30k mailboxes. I am not sure I saw a great benefit in terms of the iowait. At peak times I always see a iowait of 40-60% But the new Solid-State-Disks seem very promising. They are claimed to give 30x the throughput of a 15k rpm disk. If IO improves by 30 times that should make all these optimizations unnecessary. As my boss used to tell me ... Good hardware always compensates for not-so-good software. > Also one thing I did fail to mention, was that for the data volumes, you > should definitely be using the "notail" mount option. Unfortunately that's > not the default, and I think it probably should be. Tails packing is neat > for saving space, but it reduces the average meta-data density, which makes > "stating" lots of files in a directory a lot slower. I think that's what you > might have been seeing. Of course you also mounted "noatime,nodiratime" on > both? > > I think that's another problem with a lot of filesystem benchmarks, not > finding out what the right mount "tuning" options are for your benchmark. > Arguing that "the default should be fine" is clearly wrong, because every > sane person uses "noatime", so you're already doing some tuning, so you > should find out what's best for the filesystem you are trying. > > For the record, we use: > > noatime,nodiratime,notail,data=ordered > > On all our reiserfs volumes. > > Rob > > > Cyrus Home Page: http://cyrusimap.web.cmu.edu/ > Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki > List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> Now see, I've had almost exactly the opposite experience. Reiserfs seemed > to > start out well and work consistently until the filesystem reached a > certain > size (around 160GB, ~30m files) at which point backing it up would start > to > take too long and at around 180GB would take nearly a week. This forced > us > to move to ext3 and it doesn't seem to be degrade that way. We did, > however, > also move from a single partition to 8 of them, so that obviously has some > effect as well. As you noted, changing two variables at once doesn't help you determine which was the problem! Multiple partitions will definitely allow more parallelism, which definitely helps speed things up, which is one of the other things we have done over time. Basically we went from a few large volumes to hundreds of 300G(data)/15G(meta) volumes. One of our machines has 40 data volumes + 40 meta data volumes + the standard FS mounts. $ mount | wc -l 92 We've found that splitting the data up into more volumes + more cyrus instances seems to help as well because it seems to reduce overall contention points in the kernel + software (eg filesystem locks spread across multiple mounts, db locks are spread across multiple dbs, etc) Also one thing I did fail to mention, was that for the data volumes, you should definitely be using the "notail" mount option. Unfortunately that's not the default, and I think it probably should be. Tails packing is neat for saving space, but it reduces the average meta-data density, which makes "stating" lots of files in a directory a lot slower. I think that's what you might have been seeing. Of course you also mounted "noatime,nodiratime" on both? I think that's another problem with a lot of filesystem benchmarks, not finding out what the right mount "tuning" options are for your benchmark. Arguing that "the default should be fine" is clearly wrong, because every sane person uses "noatime", so you're already doing some tuning, so you should find out what's best for the filesystem you are trying. For the record, we use: noatime,nodiratime,notail,data=ordered On all our reiserfs volumes. Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> Now from our experience, I can tell you that ext3 really does poorly on > this workload compared to reiserfs. We had two exact same servers, one all > reiserfs and one all ext3. The ext3 one started out ok, but over the course > of a few weeks/months, it started getting worse and worse and was > eventually being completely crushed by IO load. The machine running > reiserfs had no problems at all even though it had more users on it as well > and was growing at the same rate as the other machine. Now see, I've had almost exactly the opposite experience. Reiserfs seemed to start out well and work consistently until the filesystem reached a certain size (around 160GB, ~30m files) at which point backing it up would start to take too long and at around 180GB would take nearly a week. This forced us to move to ext3 and it doesn't seem to be degrade that way. We did, however, also move from a single partition to 8 of them, so that obviously has some effect as well. John -- John Madden Sr. UNIX Systems Engineer Ivy Tech Community College of Indiana jmad...@ivytech.edu Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Fri, Jan 02, 2009 at 04:19:52PM +1100, Rob Mueller wrote: > http://lkml.org/lkml/2008/6/17/9 Ahh, that week. *sigh*. Not strictly a reiserfs problem of course, that would have affected everyone. Speaking of which, Linus did point out in that thread that the way Cyrus does IO (mmap for reads, fseek/fwrite for writes) is totally insane and guaranteed to hit every bug in existance. Normal people just use mmap for both. Does anyone actually run Cyrus on anything that doesn't support writable mmap these days? Bron. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> There are /lots/ of (comparative) tests done: The most recent I could > find with a quick Google is here: > > http://www.phoronix.com/scan.php?page=article&item=ext4_benchmarks Almost every filesystem benchmark I've ever seen is effectively useless for comparing what's best for a cyrus mail server. They try and show the maximums/minimums of a bunch of discrete operation types (eg streaming IO, creating files, deleting files, lots of small random reads, etc) running on near empty volumes. What none of them show is what happens to a filesystem when it's a real world cyrus mail spool/index: * 100,000's of directories * 10,000,000's of files * 1-1,000,000 files per directory * files continuously being created and deleted (emails) * data being appended to existing files (cyrus.* files) * lots of fsync calls all over the place (every lmtp append has multiple fsyncs, as well as various imap actions) * run over the course of multiple years of continuous operations * with a filesystem that's 60-90% full depending on your usage levels There's serious fragmentation issues going on here that no benchmark even comes close to simulating. Now from our experience, I can tell you that ext3 really does poorly on this workload compared to reiserfs. We had two exact same servers, one all reiserfs and one all ext3. The ext3 one started out ok, but over the course of a few weeks/months, it started getting worse and worse and was eventually being completely crushed by IO load. The machine running reiserfs had no problems at all even though it had more users on it as well and was growing at the same rate as the other machine. Yes we did have directory indexing enabled (we had it turned on from the start), and we tried different data modes like data=writeback and data=ordered but that didn't help either. To be honest, I don't know why exactly, and working out what's causing IO bottlenecks is not easy. We just went back to reiserfs. Some previous comments I've made. http://www.irbs.net/internet/info-cyrus/0412/0042.html http://lists.andrew.cmu.edu/pipermail/info-cyrus/2006-October/024119.html > The problem with reiserfs is... well. The developers have explicitely > stated that the development of v3 has come to its end, and there was the In this particular case, I'm really almost happy with this! Reiserfs has been very stable for us for at least 5 years, and I'm almost glad no-one is touching it because invariably people working on something will introduce new weird edge case bugs. This was a while back, but it demonstrates how apparently just adding 'some "sparse" endian annotations' caused a bug. http://oss.sgi.com/projects/xfs/faq.html#dir2 That one was really nasty, even the xfs_repair tool couldn't fix it for a while! Having said that, there have been some bugs over the last few years with reiserfs, however the kernel developers will still help with bug fixes if you find them and can trace them down. http://blog.fastmail.fm/2007/09/21/reiserfs-bugs-32-bit-vs-64-bit-kernels-cache-vs-inode-memory/ http://lkml.org/lkml/2005/7/12/396 http://lkml.org/lkml/2008/6/17/9 Rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Wed, Dec 31, 2008 at 07:47:31AM -0500, Nik Conwell wrote: > > On Dec 30, 2008, at 4:43 PM, Shawn Nock wrote: > > [...] > > > a scripted rename of mailboxes to balance partition utilization when > > we > > add another partition. > > Just curious - how do stop people from accessing their mailboxes > during the time they are being renamed and moved to another partition? All access goes via an nginx proxy - we use the proc directory contents to detect currently active connections and termintate them after blocking all new logins in the authentication daemon. Once they're fully moved, logins are enabled again. Bron. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Nik Conwell wrote: > > On Dec 30, 2008, at 4:43 PM, Shawn Nock wrote: > > [...] > >> a scripted rename of mailboxes to balance partition utilization when we >> add another partition. > > Just curious - how do stop people from accessing their mailboxes during > the time they are being renamed and moved to another partition? > We don't really bother. We run the script overnight (over several nights) to minimize storage utilization and we haven't run into a problem. I haven't looked at the code in a while, but as I recall the rename operation is fairly atomic. In short: it doesn't take long to move a box. The worst thing that I could imagine would be a momentary outage for a single user (``Mailbox does not exist'' or similar). This sort of error (if it does occur in the wild) would clear almost immediately. Shawn -- Shawn Nock (OpenPGP: 0xFF7D08A3) Unix Systems Group; UITS University of Arizona nock at email.arizona.edu signature.asc Description: OpenPGP digital signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Wed, 31 Dec 2008, Adam Tauno Williams wrote: > On Wed, 2008-12-31 at 11:47 +0100, LALOT Dominique wrote: >> Thanks for everybody. That was an interesting thread. Nobody seems to >> use a NetApp appliance, may be due to NFS architecture problems. > > Personally, I'd never use NFS for anything. Over the years I've had way > to many NFS related problems on other things to ever want to try it > again. NFS has some very interesting capabilities and limitations. it's really bad for multiple processes writing to the same file (the cyrus* files for example) and for atomic actions (writing the message files for example) there are ways that you can configure it that will work, but unless you already have a big NFS server you are probably much better off using a mechanism that makes the drives look more like local drives (SAN, iSCSI, etc) or try one of the cluster filesystems that has different tradeoffs than NFS does >> I believe I'll look to ext4 that seemed to be available in last >> kernel, and also to Solaris, but we are not enough to support another >> OS. > > We've used Cyrus on XFS for almost a years, no problems. > > In regards to ext3 I'd pay attention to the vintage of problem reports > and performance issues; ext3 of several years ago is not the ext3 of > today, many improvements have been made. "data=writeback" mode can help > performance quite a bit, as well as enabling "dir_index" if it isn't > already (did it ever become the default?). The periodic fsck can also > be disabled via tune2fs. I only point this out since, if you already > have any ext3 setup, trying the above are all painless and might buy > you something. it's definantly worth testing different filesystems. I last did a test about two years ago and confirmed XFS as my choice. I have one instance of cyrus still running on ext3 and I definantly see it as a user in the performance. David Lang Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Ah the saga of Hans Reiser. That unfortunately is the Downfall of Reiserfs. Yes, his company has disappeared, and a "void" has appeared from his lack of presence? However, the Reiserfs4 patch set is current against the linux kernel 2.6.28 (see http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/) However I think that (http://en.wikipedia.org/wiki/Reiser4) pretty much sums up the future of Reiserfs4. ... However I haven't really run into show stopping bugs on Reiserfs3 in quite some time (with excellent hardware). However you replace it with dodgy hardware and things change. I haven't looked at btrfs yet with Cyrus, perhaps I'll do that sometime soon. On Dec 31, 2008, at 6:20 AM, Janne Peltonen wrote: > On Wed, Dec 31, 2008 at 04:58:57AM -0800, Scott Likens wrote: >> I would not discount using reiserfs (v3) by any means. It's still >> by far a >> better choice for a filesystem with Cyrus then Ext3 or Ext4. I >> haven't really >> seen anyone do any tests with Ext4, but I imagine it should be >> about par for >> the course for Ext3. > > There are /lots/ of (comparative) tests done: The most recent I could > find with a quick Google is here: > > http://www.phoronix.com/scan.php?page=article&item=ext4_benchmarks > > The problem with reiserfs is... well. The developers have explicitely > stated that the development of v3 has come to its end, and there was > the > long argument between Hans Reiser and kernel delevopers about > whether v4 > could be included in kernel. When Hans Reiser was charged with murder > (not the crow or Cyrus variant), his company assured that the > development (of v4) would continue, but the last time I tried to find > out anything about the project, it appeared more or less dead. Of > course, the current reiserfs (v3) is very stable, but if you run into > any issues, there really isn't a developer you can contact (or send > patches to, if you figure out the bug). > > > --Janne > -- > Janne Peltonen PGP Key ID: 0x9CFAC88B > Please consider membership of the Hospitality Club > (http://www.hospitalityclub.org > ) > > > !DSPAM:495b87d570801804284693! > > Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Wed, 2008-12-31 at 15:46 +0200, Janne Peltonen wrote: > On Wed, Dec 31, 2008 at 07:38:21AM -0500, Adam Tauno Williams wrote: > > In regards to ext3 I'd pay attention to the vintage of problem reports > > and performance issues; ext3 of several years ago is not the ext3 of > > today, many improvements have been made. "data=writeback" mode can help > > performance quite a bit, as well as enabling "dir_index" if it isn't > > already (did it ever become the default?). The periodic fsck can also > > be disabled via tune2fs. I only point this out since, if you already > > have any ext3 setup, trying the above are all painless and might buy > > you something. > I wouldn't call data=writeback painless. I had it on in the testing phase > of our current Cyrus installation, and if the filesystem had to be > forcibly unmounted by any reason (yes, there are reasons), the amount of > corruption in those files that happened to be active during the unmount > - well, it wasn't a nice sight. And the files weren't recoverable, > except from backup. > I never really got the point of the data=writeback mode. Sure, it > increases throughput, but so does disabling the journal completely, and > seems to me the end result as concerns data integrity is exactly the > same. The *filesystem* is recoverable as the meta-data is journaled. *Contents* of files may be lost/corrupted. I'm fine with that since a serious abend usually leaves the state of the data in a questionable state anyway for reasons other than the filesystem; I want something I can safely (and quickly) remount and investigate/restore. It is a trade-off. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Wed, Dec 31, 2008 at 04:58:57AM -0800, Scott Likens wrote: > I would not discount using reiserfs (v3) by any means. It's still by far a > better choice for a filesystem with Cyrus then Ext3 or Ext4. I haven't really > seen anyone do any tests with Ext4, but I imagine it should be about par for > the course for Ext3. There are /lots/ of (comparative) tests done: The most recent I could find with a quick Google is here: http://www.phoronix.com/scan.php?page=article&item=ext4_benchmarks The problem with reiserfs is... well. The developers have explicitely stated that the development of v3 has come to its end, and there was the long argument between Hans Reiser and kernel delevopers about whether v4 could be included in kernel. When Hans Reiser was charged with murder (not the crow or Cyrus variant), his company assured that the development (of v4) would continue, but the last time I tried to find out anything about the project, it appeared more or less dead. Of course, the current reiserfs (v3) is very stable, but if you run into any issues, there really isn't a developer you can contact (or send patches to, if you figure out the bug). --Janne -- Janne Peltonen PGP Key ID: 0x9CFAC88B Please consider membership of the Hospitality Club (http://www.hospitalityclub.org) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
У вт, 2008-12-30 у 17:49 +0100, LALOT Dominique пише: > Once, there was a bad shutdown corrupting ext3fs and we spent 6 hours > on an fsck. Actually i do use reiserfs over 2 years on cyrus-imapd. It performs great even with realy big count of files in imap spool folders. But i dont know how it will perform on EMC. 4 years ago i tryied ext3. It was disaster. Slow as hell. Reiser4 was once used too, it did even better than reiserfs. But after 2 mounth stable running it get kernel OPS because a FS. And i did swiched back to reiserfs. -- Teresa Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Wed, Dec 31, 2008 at 07:38:21AM -0500, Adam Tauno Williams wrote: > In regards to ext3 I'd pay attention to the vintage of problem reports > and performance issues; ext3 of several years ago is not the ext3 of > today, many improvements have been made. "data=writeback" mode can help > performance quite a bit, as well as enabling "dir_index" if it isn't > already (did it ever become the default?). The periodic fsck can also > be disabled via tune2fs. I only point this out since, if you already > have any ext3 setup, trying the above are all painless and might buy > you something. I wouldn't call data=writeback painless. I had it on in the testing phase of our current Cyrus installation, and if the filesystem had to be forcibly unmounted by any reason (yes, there are reasons), the amount of corruption in those files that happened to be active during the unmount - well, it wasn't a nice sight. And the files weren't recoverable, except from backup. I never really got the point of the data=writeback mode. Sure, it increases throughput, but so does disabling the journal completely, and seems to me the end result as concerns data integrity is exactly the same. --Janne -- Janne Peltonen PGP Key ID: 0x9CFAC88B Please consider membership of the Hospitality Club (http://www.hospitalityclub.org) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> -- Nik Conwell is rumored to have mumbled on 31. Dezember 2008 > 07:47:31 -0500 regarding Re: choosing a file system: > > > Just curious - how do stop people from accessing their mailboxes > > during the time they are being renamed and moved to another partition? I moved a few thousand mailboxes in a similar fashion (summer of 2007) and encountered no problems. New message deliveries were nicely "frozen" by Cyrus while the target Inbox was being renamed/moved. Question : would it, stabilitywise, make a difference if the mail data and metadata are split, allocating the metadata partitions on SAN-based LUNs and storing messages in NAS (NFS) space ? In other words : are the Cyrus-over-NFS inconveniences confined to the cyrus.* files ? Rationale : NAS space can, typically, be "grown" more easily than SAN space. This could be an advantage to older server OSes en filesystems... Eric Luyten, Brussels Free University Computing Centre (Cyrus 2.2, 58k users, 2.3 TB) Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
-- Nik Conwell is rumored to have mumbled on 31. Dezember 2008 07:47:31 -0500 regarding Re: choosing a file system: Just curious - how do stop people from accessing their mailboxes during the time they are being renamed and moved to another partition? I just do a grep on the username in the proc directory - if there is no process for that user, I figure it's safe enough to move the mailbox. This approach has worked well so far. I experimented with accessing a mailbox while it was being moved and that seemed to be OK as well, i.e. it failed while the operation was in progress. -- Sebastian Hagedorn - RZKR-R1 (Flachbau), Zi. 18, Robert-Koch-Str. 10 Zentrum für angewandte Informatik - Universitätsweiter Service RRZK Universität zu Köln / Cologne University - Tel. +49-221-478-5587 pgpPU72K0BOGZ.pgp Description: PGP signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Hi, I would not discount using reiserfs (v3) by any means. It's still by far a better choice for a filesystem with Cyrus then Ext3 or Ext4. I haven't really seen anyone do any tests with Ext4, but I imagine it should be about par for the course for Ext3. as far as the NFS... NFS isn't itself that bad, it's just that people tend to find ways to use NFS in a incorrect manner that only ends up leading to failure. Scott On Dec 31, 2008, at 2:47 AM, LALOT Dominique wrote: Thanks for everybody. That was an interesting thread. Nobody seems to use a NetApp appliance, may be due to NFS architecture problems. I believe I'll look to ext4 that seemed to be available in last kernel, and also to Solaris, but we are not enough to support another OS. Dom And Happy New Year ! 2008/12/31 Bron Gondwana On Tue, Dec 30, 2008 at 02:43:14PM -0700, Shawn Nock wrote: > Bron and the fastmail guys could tell you more about reiserfs... we've > used RH&SuSE/reiserfs/EMC for quite a while and we are very happy. Yeah, sure could :) You can probably find plenty of stuff from me in the archives about our setup - the basic things are: * separate metadata on RAID1 10kRPM (or 15kRPM in the new boxes) drives. * data files on RAID5 big slow drives - data IO isn't a limiting factor * 300Gb "slots" with 15Gb associated meta drives, like this: /dev/sdb6 14016208 8080360 5935848 58% /mnt/meta6 /dev/sdb7 14016208 8064848 5951360 58% /mnt/meta7 /dev/sdb8 14016208 8498812 5517396 61% /mnt/meta8 /dev/sdd2292959500 248086796 44872704 85% /mnt/data6 /dev/sdd3292959500 242722420 50237080 83% /mnt/data7 /dev/sdd4292959500 248840432 44119068 85% /mnt/data8 as you can see, that balances out pretty nicely. We also store per-user bayes databases on the associated meta drives. We balance our disk usage by moving users between stores when usage reaches 88% on any partition. We get emailed if it goes above 92% and paged if it goes above 95%. Replication. We have multiple "slots" on each server, and since they are all the same size, we have replication pairs spread pretty randomly around the hosts, so the failure of any one drive unit (SCSI attached SATA) or imap server doesn't significantly overload any one other machine. By using Cyrus replication rather than, say, DRBD, a filesystem corruption should only affect a single partition, which won't take so long to fsck. Moving users is easy - we run a sync_server on the Cyrus master, and just create a custom config directory with symlinks into the tree on the real server and a rewritten piece of mailboxes.db so we can rename them during the move if needed. It's all automatic. We also have a "CheckReplication" perl module that can be used to compare two ends to make sure everything is the same. It does full per-message flags checks, random sha1 integrity checks, etc. Does require a custom patch to expose the GUID (as DIGEST.SHA1) via IMAP. I lost an entire drive unit on the 26th. It stopped responding. 8 x 1TB drives in it. I tried rebooting everything, then switched the affected stores over to their replicas. Total downtime for those users of about 15 minutes because I tried the reboot first just in case (there's a chance that some messages were delivered and not yet replicated, so it's better not to bring up the replica uncleanly until you're sure there's no other choice) In the end I decided that it wasn't recoverable quickly enough to be viable, so chose new replica pairs for the slots that had been on that drive unit (we keep some empty space on our machines for just this eventuality) and started up another handy little script "sync_all_users" which runs sync_client -u for every user, then starts the rolling sync_client again at the end. It took about 16 hours to bring everything back to fully replicated again. Bron. -- Dominique LALOT Ingénieur Systèmes et Réseaux http://annuaire.univmed.fr/showuser?uid=lalot !DSPAM:495b4f1f47731804284693! Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html !DSPAM:495b4f1f47731804284693! Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Dec 30, 2008, at 4:43 PM, Shawn Nock wrote: [...] > a scripted rename of mailboxes to balance partition utilization when > we > add another partition. Just curious - how do stop people from accessing their mailboxes during the time they are being renamed and moved to another partition? -nik Information Technology Systems Programming Boston University Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Wed, 2008-12-31 at 11:47 +0100, LALOT Dominique wrote: > Thanks for everybody. That was an interesting thread. Nobody seems to > use a NetApp appliance, may be due to NFS architecture problems. Personally, I'd never use NFS for anything. Over the years I've had way to many NFS related problems on other things to ever want to try it again. > I believe I'll look to ext4 that seemed to be available in last > kernel, and also to Solaris, but we are not enough to support another > OS. We've used Cyrus on XFS for almost a years, no problems. In regards to ext3 I'd pay attention to the vintage of problem reports and performance issues; ext3 of several years ago is not the ext3 of today, many improvements have been made. "data=writeback" mode can help performance quite a bit, as well as enabling "dir_index" if it isn't already (did it ever become the default?). The periodic fsck can also be disabled via tune2fs. I only point this out since, if you already have any ext3 setup, trying the above are all painless and might buy you something. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Thanks for everybody. That was an interesting thread. Nobody seems to use a NetApp appliance, may be due to NFS architecture problems. I believe I'll look to ext4 that seemed to be available in last kernel, and also to Solaris, but we are not enough to support another OS. Dom And Happy New Year ! 2008/12/31 Bron Gondwana > On Tue, Dec 30, 2008 at 02:43:14PM -0700, Shawn Nock wrote: > > Bron and the fastmail guys could tell you more about reiserfs... we've > > used RH&SuSE/reiserfs/EMC for quite a while and we are very happy. > > Yeah, sure could :) > > You can probably find plenty of stuff from me in the archives about our > setup - the basic things are: > > * separate metadata on RAID1 10kRPM (or 15kRPM in the new boxes) drives. > * data files on RAID5 big slow drives - data IO isn't a limiting factor > * 300Gb "slots" with 15Gb associated meta drives, like this: > > /dev/sdb6 14016208 8080360 5935848 58% /mnt/meta6 > /dev/sdb7 14016208 8064848 5951360 58% /mnt/meta7 > /dev/sdb8 14016208 8498812 5517396 61% /mnt/meta8 > /dev/sdd2292959500 248086796 44872704 85% /mnt/data6 > /dev/sdd3292959500 242722420 50237080 83% /mnt/data7 > /dev/sdd4292959500 248840432 44119068 85% /mnt/data8 > > as you can see, that balances out pretty nicely. We also store > per-user bayes databases on the associated meta drives. > > We balance our disk usage by moving users between stores when usage > reaches 88% on any partition. We get emailed if it goes above 92% > and paged if it goes above 95%. > > Replication. We have multiple "slots" on each server, and since > they are all the same size, we have replication pairs spread pretty > randomly around the hosts, so the failure of any one drive unit > (SCSI attached SATA) or imap server doesn't significantly overload > any one other machine. By using Cyrus replication rather than, > say, DRBD, a filesystem corruption should only affect a single > partition, which won't take so long to fsck. > > Moving users is easy - we run a sync_server on the Cyrus master, and > just create a custom config directory with symlinks into the tree on > the real server and a rewritten piece of mailboxes.db so we can > rename them during the move if needed. It's all automatic. > > We also have a "CheckReplication" perl module that can be used to > compare two ends to make sure everything is the same. It does full > per-message flags checks, random sha1 integrity checks, etc. > Does require a custom patch to expose the GUID (as DIGEST.SHA1) > via IMAP. > > I lost an entire drive unit on the 26th. It stopped responding. > 8 x 1TB drives in it. > > I tried rebooting everything, then switched the affected stores over > to their replicas. Total downtime for those users of about 15 > minutes because I tried the reboot first just in case (there's a > chance that some messages were delivered and not yet replicated, > so it's better not to bring up the replica uncleanly until you're > sure there's no other choice) > > In the end I decided that it wasn't recoverable quickly enough to > be viable, so chose new replica pairs for the slots that had been > on that drive unit (we keep some empty space on our machines for > just this eventuality) and started up another handy little script > "sync_all_users" which runs sync_client -u for every user, then > starts the rolling sync_client again at the end. It took about > 16 hours to bring everything back to fully replicated again. > > Bron. > -- Dominique LALOT Ingénieur Systèmes et Réseaux http://annuaire.univmed.fr/showuser?uid=lalot Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Tue, Dec 30, 2008 at 02:43:14PM -0700, Shawn Nock wrote: > Bron and the fastmail guys could tell you more about reiserfs... we've > used RH&SuSE/reiserfs/EMC for quite a while and we are very happy. Yeah, sure could :) You can probably find plenty of stuff from me in the archives about our setup - the basic things are: * separate metadata on RAID1 10kRPM (or 15kRPM in the new boxes) drives. * data files on RAID5 big slow drives - data IO isn't a limiting factor * 300Gb "slots" with 15Gb associated meta drives, like this: /dev/sdb6 14016208 8080360 5935848 58% /mnt/meta6 /dev/sdb7 14016208 8064848 5951360 58% /mnt/meta7 /dev/sdb8 14016208 8498812 5517396 61% /mnt/meta8 /dev/sdd2292959500 248086796 44872704 85% /mnt/data6 /dev/sdd3292959500 242722420 50237080 83% /mnt/data7 /dev/sdd4292959500 248840432 44119068 85% /mnt/data8 as you can see, that balances out pretty nicely. We also store per-user bayes databases on the associated meta drives. We balance our disk usage by moving users between stores when usage reaches 88% on any partition. We get emailed if it goes above 92% and paged if it goes above 95%. Replication. We have multiple "slots" on each server, and since they are all the same size, we have replication pairs spread pretty randomly around the hosts, so the failure of any one drive unit (SCSI attached SATA) or imap server doesn't significantly overload any one other machine. By using Cyrus replication rather than, say, DRBD, a filesystem corruption should only affect a single partition, which won't take so long to fsck. Moving users is easy - we run a sync_server on the Cyrus master, and just create a custom config directory with symlinks into the tree on the real server and a rewritten piece of mailboxes.db so we can rename them during the move if needed. It's all automatic. We also have a "CheckReplication" perl module that can be used to compare two ends to make sure everything is the same. It does full per-message flags checks, random sha1 integrity checks, etc. Does require a custom patch to expose the GUID (as DIGEST.SHA1) via IMAP. I lost an entire drive unit on the 26th. It stopped responding. 8 x 1TB drives in it. I tried rebooting everything, then switched the affected stores over to their replicas. Total downtime for those users of about 15 minutes because I tried the reboot first just in case (there's a chance that some messages were delivered and not yet replicated, so it's better not to bring up the replica uncleanly until you're sure there's no other choice) In the end I decided that it wasn't recoverable quickly enough to be viable, so chose new replica pairs for the slots that had been on that drive unit (we keep some empty space on our machines for just this eventuality) and started up another handy little script "sync_all_users" which runs sync_client -u for every user, then starts the rolling sync_client again at the end. It took about 16 hours to bring everything back to fully replicated again. Bron. Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
LALOT Dominique wrote: > Hello, > > We are using cyrus-imap for a long time. Our architecture is a SAN from EMC > and thanks to our "DELL support" we are obliged to install redhat. The only > option we have is to use ext3fs on rather old kernels. We have 4000 accounts > for staff and 2 for students > The system is rather fast and reliable. BUT.. > We support ~8000 faculty and staff and ~45000 students. On 16x 250G reiserfs 'partitions' from and EMC CX500 arrays. Reiserfs has proven to handle the load much better than ext3 (which we tested... it was a disaster). We've been using reiserfs since RedHat Linux 7.x. We also tested an early xfs patchset... but it was prone to corruption (but that was years ago). > Once, there was a bad shutdown corrupting ext3fs and we spent 6 hours on an > fsck. > Next we discovered that our backup system was going slower and slower. We > just pointed out that it was due to fragmentation, and guess what, there's > no online defrag tool for ext3. We've only had to reiserfsck a partition once (with --rebuild-tree eek!). It took a while, but the data was intact... it beats restoring from tape. We don't defragment (as such). In an attempt to speed up overnight backups we once did a scripted rename of mailboxes to spare partitions. Since this time we have given up on filesystem based backup and simply do a block-level backup in combination with partition snapshots. Keeping the cyrus partition size low has limited many of our problems and we do a scripted rename of mailboxes to balance partition utilization when we add another partition. Bron and the fastmail guys could tell you more about reiserfs... we've used RH&SuSE/reiserfs/EMC for quite a while and we are very happy. Except those loony folks who want Exchange... Shawn -- Shawn Nock (OpenPGP: 0xFF7D08A3) Unix Systems Group; UITS University of Arizona nock at email.arizona.edu signature.asc Description: OpenPGP digital signature Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Tue, 30 Dec 2008, LALOT Dominique wrote: > Hello, > > We are using cyrus-imap for a long time. Our architecture is a SAN from EMC > and thanks to our "DELL support" we are obliged to install redhat. The only > option we have is to use ext3fs on rather old kernels. We have 4000 accounts > for staff and 2 for students > The system is rather fast and reliable. BUT.. We also have a Dell/EMC SAN (currently a CX500 but upgrading to a CX4-240 soon). I'd like to dispel any rumors about SAN support though. Dell will support pretty much any combination of software and hardware that has been validated by EMC. This include RedHat, Suse, and Solaris that I'm aware of, plus more I'm sure. Now, if you want to get support for the operating system itself from Dell, then you are probably limited to RedHat. I know a lot of folks like to get their entire environment supported from a single vendor, but that can really limit your choices too. We run Solaris 10 and Debian Linux with our CX500. Dell helped us setup the Emulex HBA in the Solaris 10 boxes and connected it to the SAN. During the initial setup of the SAN, I installed Suse Enterprise on one of our servers so I could see what they did to install the Qlogic HBA and setup the SAN connection. After they left, I blew it away and installed Debian Linux. It's not "supported" by Dell/EMC, but this is all standardized hardware and software. It works great with the kernel-included Qlogic drivers and even with standard linux multipathing. > Once, there was a bad shutdown corrupting ext3fs and we spent 6 hours on an > fsck. > Next we discovered that our backup system was going slower and slower. We > just pointed out that it was due to fragmentation, and guess what, there's > no online defrag tool for ext3. How did you determine that it was due to fragmentation? We use ext3 here as well, so I'm curious. > I'm looking for other solutions: > ext4fs (does somebody use such filesystem?), xfs > zfs (but we should switch to solaris or freebsd and throw away our costly > SAN) No need to throw away your SAN if you switch to another OS, see above. :) Andy Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
John, No, that was due to framentation. A fresh copy (one night to copy, then 2 hours to backup, 6 times faster then) solved that problem. There's a filefrag utility, and for some mailboxes, it was over 60%. I have 3 500Mo spools at the moment. And one is left for the copy.. You copy first your data, then you destroy randomly small files and you fill the holes randomly.. Ext4 is said to do delayed allocation, in order to have a decent idea of the file size when writing to disk Dom 2008/12/30 John Madden > > Once, there was a bad shutdown corrupting ext3fs and we spent 6 hours on > an > > fsck. > > Next we discovered that our backup system was going slower and slower. We > > just pointed out that it was due to fragmentation, and guess what, > there's > > no online defrag tool for ext3. > > Sure it isn't due to the number of files on those filesystems? File-level > backups will slow down linearly as the filesystems grow, of course. > I "solve" this by adding more spools (up to 8 at the moment with about 350k > mailboxes) so they can be backed up in parallel. All on ext3. > > John > > > > > -- > John Madden > Sr. UNIX Systems Engineer > Ivy Tech Community College of Indiana > jmad...@ivytech.edu > > Cyrus Home Page: http://cyrusimap.web.cmu.edu/ > Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki > List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html > -- Dominique LALOT Ingénieur Systèmes et Réseaux http://annuaire.univmed.fr/showuser?uid=lalot Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
Robert Banz wrote: > At my last job, we had explored a Dell/EMC SAN at one point. Those > folks don't seem to understand the idea that Fibre Channel is a well > established standard -- they only expect you to connect their > supported stack of hardware and software, otherwise they don't wanna > talk. Regarding to support as described by the support contract you are right - but I had many EMC big iron SAN devices running without a problem with Solaris 10. You have to adapt scsi_vhci.conf if you want symmetric multipathing as Sun does not recognize many of the FC devices which can handle symmetric links out there. ZFS with SAN devices is perfectly OK. We have 33 million files on our (single!) ZFS mail pool, running gzip compression (Solaris 10 Patch 137137-09 resp. 137138-09). Our Tivoli Storage Manager backup (tsm) runs every night for three hours approximately. Within this 3 hours it does scan all files. We do a zfs snapshot every day and we are holding 14 days snapshots to restore mailboxes. We are not conservatice enough to run scrub regularly, the last time I did was last week, without any error. A happy and successful 2009 for all of you! Pascal -- Pascal Gienger pas...@southbrain.com http://southbrain.com/ Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
We run Solaris 10 on our Cyrus mail-store backends. The mail is stored in a ZFS pool. The ZFS pool are composed of 4 SAN volumes in RAID-10. The active and failover server of each backend pair have "fiber multipath" enabled so their dual connections to the SAN switch ensure that if an HBA or SAN switch fails there is no downtime. Once a month we run a scrub while the systems are online. Never having to run fsck EVER AGAIN is a good thing. The scrub is run during a weekend and not during a backup window to be paranoid since it does keep the disks busy for some hours but it never impacts performance. Using ZFS also allows us easy & CHEAP snapshots. We keep 14 days worth of snapshots in the pool and that handles 99% of restore requests. We run backups to tape once a week from the most recent snapshot also. LALOT Dominique wrote: > Hello, > > We are using cyrus-imap for a long time. Our architecture is a SAN > from EMC and thanks to our "DELL support" we are obliged to install > redhat. The only option we have is to use ext3fs on rather old > kernels. We have 4000 accounts for staff and 2 for students > The system is rather fast and reliable. BUT.. > > Once, there was a bad shutdown corrupting ext3fs and we spent 6 hours > on an fsck. > Next we discovered that our backup system was going slower and slower. > We just pointed out that it was due to fragmentation, and guess what, > there's no online defrag tool for ext3. > > I'm looking for other solutions: > ext4fs (does somebody use such filesystem?), xfs > zfs (but we should switch to solaris or freebsd and throw away our > costly SAN) > use a NetApp Appliance (are you using such a device?, NFS seems to be > tricky with cyrus..) > > Thanks for your advice > > Dom > > -- > Dominique LALOT > Ingénieur Systèmes et Réseaux > http://annuaire.univmed.fr/showuser?uid=lalot > > > > Cyrus Home Page: http://cyrusimap.web.cmu.edu/ > Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki > List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
> Once, there was a bad shutdown corrupting ext3fs and we spent 6 hours on an > fsck. > Next we discovered that our backup system was going slower and slower. We > just pointed out that it was due to fragmentation, and guess what, there's > no online defrag tool for ext3. Sure it isn't due to the number of files on those filesystems? File-level backups will slow down linearly as the filesystems grow, of course. I "solve" this by adding more spools (up to 8 at the moment with about 350k mailboxes) so they can be backed up in parallel. All on ext3. John -- John Madden Sr. UNIX Systems Engineer Ivy Tech Community College of Indiana jmad...@ivytech.edu Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Dec 30, 2008, at 9:06 AM, Pascal Gienger wrote: > LALOT Dominique wrote: > >> zfs (but we should switch to solaris or freebsd and throw away our >> costly >> SAN) > > Why that? SAN volumes are running very fine with Solaris 10 hosts > (SPARC > and x86). You have extended multipathing (symmetric and asymmetric) > onboard. > Solaris accepts nearly all Q-Logic FC cards (according to my > experience). At my last job, we had explored a Dell/EMC SAN at one point. Those folks don't seem to understand the idea that Fibre Channel is a well established standard -- they only expect you to connect their supported stack of hardware and software, otherwise they don't wanna talk. -rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
LALOT Dominique wrote: > zfs (but we should switch to solaris or freebsd and throw away our costly > SAN) Why that? SAN volumes are running very fine with Solaris 10 hosts (SPARC and x86). You have extended multipathing (symmetric and asymmetric) onboard. Solaris accepts nearly all Q-Logic FC cards (according to my experience). Pascal Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
Re: choosing a file system
On Dec 30, 2008, at 8:49 AM, LALOT Dominique wrote: > Hello, > > We are using cyrus-imap for a long time. Our architecture is a SAN > from EMC and thanks to our "DELL support" we are obliged to install > redhat. The only option we have is to use ext3fs on rather old > kernels. We have 4000 accounts for staff and 2 for students > The system is rather fast and reliable. BUT.. > > Once, there was a bad shutdown corrupting ext3fs and we spent 6 > hours on an fsck. > Next we discovered that our backup system was going slower and > slower. We just pointed out that it was due to fragmentation, and > guess what, there's no online defrag tool for ext3. > > I'm looking for other solutions: > ext4fs (does somebody use such filesystem?), xfs > zfs (but we should switch to solaris or freebsd and throw away our > costly SAN) > use a NetApp Appliance (are you using such a device?, NFS seems to > be tricky with cyrus..) Run Solaris, but keep a machine on the SAN with that old version of RedHat that you can use to replicate any problems you have? ;) -rob Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
choosing a file system
Hello, We are using cyrus-imap for a long time. Our architecture is a SAN from EMC and thanks to our "DELL support" we are obliged to install redhat. The only option we have is to use ext3fs on rather old kernels. We have 4000 accounts for staff and 2 for students The system is rather fast and reliable. BUT.. Once, there was a bad shutdown corrupting ext3fs and we spent 6 hours on an fsck. Next we discovered that our backup system was going slower and slower. We just pointed out that it was due to fragmentation, and guess what, there's no online defrag tool for ext3. I'm looking for other solutions: ext4fs (does somebody use such filesystem?), xfs zfs (but we should switch to solaris or freebsd and throw away our costly SAN) use a NetApp Appliance (are you using such a device?, NFS seems to be tricky with cyrus..) Thanks for your advice Dom -- Dominique LALOT Ingénieur Systèmes et Réseaux http://annuaire.univmed.fr/showuser?uid=lalot Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html