Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)
On Sat, 16 Oct 2004 14:00:57 +0200, Marcin wrote in message <[EMAIL PROTECTED]>: > On Sat, Oct 16, 2004 at 09:29:32PM +1000, Russell Coker wrote: > > On Fri, 15 Oct 2004 23:33, Arnt Karlsen <[EMAIL PROTECTED]> wrote: > > > > On Fri, 15 Oct 2004 03:19, Arnt Karlsen <[EMAIL PROTECTED]> wrote: > > > > > > Increasing the number of machines increases the probability > > > > > > of one machine failing for any given time period. Also it > > > > > > makes it more difficult to debug problems as you can't > > > > > > always be certain of which machine was involved. > > > > > > > > > > ..very true, even for aero engines. The reason the airlines > > > > > like 2, 3 or even 4 rather than one jet. > > > > > > > > You seem to have entirely misunderstood what I wrote. > > > > > > ..really? Compare with your average automobile accident and > > > see who has the more adequate safety philosophy. > > > > If one machine has a probability of failure of 0.1 over a particular > > time period then the probability of at least one machine failing if > > there are two servers in the cluster over that same time period is > > 1-0.9*0.9 == 0.19. > > But do we really care about whether a "machine" fails? I'd rather say > that what we want to minimize is the _service_ downtime. > > With one machine, the possibility of the service being unavailable is > 0.1. With two machines it's equal to the possibility of both machines > failing at the same time, so it's 0.1*0.1 == 0.01, as long as the > possibilites are independent (not sure if that's the right translation > of the term). > > Or am I wrong in the first sentence? > > Otherwise, I'd say that the increase of availability is worth the > additional debugging effort :-) ..email is a lot like Zeppeliner transportation, even if these services stop, there is no loss other than propulsion, unlike with common jet airliners promptly dropping outta the sky to ditch in the drink or rocks, unless the aircrew manages to do another Gimli glide. -- ..med vennlig hilsen = with Kind Regards from Arnt... ;-) ...with a number of polar bear hunters in his ancestry... Scenarios always come in sets of three: best case, worst case, and just in case. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)
On Sat, 16 Oct 2004 21:29:32 +1000, Russell wrote in message <[EMAIL PROTECTED]>: > On Fri, 15 Oct 2004 23:33, Arnt Karlsen <[EMAIL PROTECTED]> wrote: > > > On Fri, 15 Oct 2004 03:19, Arnt Karlsen <[EMAIL PROTECTED]> wrote: > > > > > Increasing the number of machines increases the probability of > > > > > one machine failing for any given time period. Also it makes > > > > > it more difficult to debug problems as you can't always be > > > > > certain of which machine was involved. > > > > > > > > ..very true, even for aero engines. The reason the airlines > > > > like 2, 3 or even 4 rather than one jet. > > > > > > You seem to have entirely misunderstood what I wrote. > > > > ..really? Compare with your average automobile accident and > > see who has the more adequate safety philosophy. > > If one machine has a probability of failure of 0.1 over a particular > time period then the probability of at least one machine failing if > there are two servers in the cluster over that same time period is > 1-0.9*0.9 == 0.19. > > > [EMAIL PROTECTED], "2 boxes watching each other" or some such, will give > > that "Ok, I'll have a look some time next week" peace of mind, > > and we don't need symmetric power here, one big and one or > > more small ones will do fine > > Have you ever actually run an ISP? ..no, I'm an aeronautical engineer and likes Zeppeliners. ;-) -- ..med vennlig hilsen = with Kind Regards from Arnt... ;-) ...with a number of polar bear hunters in his ancestry... Scenarios always come in sets of three: best case, worst case, and just in case. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: Documentation of big "mail systems"?
On Sat, 16 Oct 2004, Russell Coker wrote: > Getting servers that each have 200G or 300G of storage is easy. Local Make it a few TBs... > though). Having multiple back-end servers with local disks reduces the risks > (IMHO). There's less cables for idiots to trip over or otherwise break It depends on the global amount of disks you end up with on your server room, I think. If it increases too much, very good SAN hardware will decrease the chances of service downtime, because the SAN hardware and disks are better protected against catastrophic failures, and do predictive failure analysis right. And a big SAN is much easier to manage than hundreds of servers with many disks each, of different types. OTOH, any SAN hardware that does not have at least full double redundancy is NOT a good idea at all. > one back-end server go down and take out 1/7 of the mail boxes would be > annoying, but a lot less annoying than a SAN problem taking it all out. And that can happen, too. I think it did happen to an ISP around here, to their big, expensive as all heck EMC hardware. It is rather rare, and the service contract usually states that the SAN manufacturer is responsible for damages on such long downtimes... -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: Documentation of big "mail systems"?
On Sat, Oct 16, 2004 at 09:41:43PM +1000, Russell Coker wrote: > There's less cables for idiots to trip over or otherwise break > (don't ask), I dare to ask :-) Marcin -- Marcin Owsiany <[EMAIL PROTECTED]> http://marcin.owsiany.pl/ GnuPG: 1024D/60F41216 FE67 DA2D 0ACA FC5E 3F75 D6F6 3A0D 8AA0 60F4 1216 -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)
On Sat, Oct 16, 2004 at 09:29:32PM +1000, Russell Coker wrote: > On Fri, 15 Oct 2004 23:33, Arnt Karlsen <[EMAIL PROTECTED]> wrote: > > > On Fri, 15 Oct 2004 03:19, Arnt Karlsen <[EMAIL PROTECTED]> wrote: > > > > > Increasing the number of machines increases the probability of one > > > > > machine failing for any given time period. Also it makes it more > > > > > difficult to debug problems as you can't always be certain of > > > > > which machine was involved. > > > > > > > > ..very true, even for aero engines. The reason the airlines like > > > > 2, 3 or even 4 rather than one jet. > > > > > > You seem to have entirely misunderstood what I wrote. > > > > ..really? Compare with your average automobile accident and > > see who has the more adequate safety philosophy. > > If one machine has a probability of failure of 0.1 over a particular time > period then the probability of at least one machine failing if there are two > servers in the cluster over that same time period is 1-0.9*0.9 == 0.19. But do we really care about whether a "machine" fails? I'd rather say that what we want to minimize is the _service_ downtime. With one machine, the possibility of the service being unavailable is 0.1. With two machines it's equal to the possibility of both machines failing at the same time, so it's 0.1*0.1 == 0.01, as long as the possibilites are independent (not sure if that's the right translation of the term). Or am I wrong in the first sentence? Otherwise, I'd say that the increase of availability is worth the additional debugging effort :-) Marcin -- Marcin Owsiany <[EMAIL PROTECTED]> http://marcin.owsiany.pl/ GnuPG: 1024D/60F41216 FE67 DA2D 0ACA FC5E 3F75 D6F6 3A0D 8AA0 60F4 1216 -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: Documentation of big "mail systems"?
On Fri, 15 Oct 2004 20:08, Paul Dwerryhouse <[EMAIL PROTECTED]> wrote: > On Fri, Oct 15, 2004 at 06:56:21PM +1000, Russell Coker wrote: > > The machines were all running 2.4.2x last time I was there, but they > > may be moving to 2.6.x now. > > All the stores, relays and proxies are still on 2.4.x, but the LDAP > servers are now on 2.6.x (mainly because I could, not for any technical > reason. At the time I upgraded them I had enough redundancy to go around > that the downtime didn't affect anything). In that case you should get the 4/4 kernel patch, it will make a huge improvement to your LDAP rebuild times which can come in handy in an emergency. From memory I had the slave machines rebuilding in about 15 minutes, I expect that I could get it down to 5 minutes with a 4/4 kernel, and less if the machine has 6G of RAM or more. For 4/4 the easiest thing to do is probably to get the Fedora kernel. > Four perdition/apache/imp servers now, rather than three. The webmail is > rather popular now, and three servers couldn't cut it on their own > anymore. Is there any way to optimise PHP for speed? Maybe PHP5 is worth trying? > Seven backend mailstores now, and I really want an eighth, but can't get > anyone to pay for it. I still think that using a umem device for journals is the right thing to do. You should be able to double performance by putting a umem device in each machine. It'll cost less than half as much as a new server to put a umem device in each machine, and give much more performance. I recall that none of those machines was even close to running out of disk space. You could probably handle the current load with 4 back-end machines if you used umem devices. -- http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark http://www.coker.com.au/postal/Postal SMTP/POP benchmark http://www.coker.com.au/~russell/ My home page -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: Documentation of big "mail systems"?
On Sat, 16 Oct 2004 02:02, Christoph Moench-Tegeder <[EMAIL PROTECTED]> wrote: > ## Henrique de Moraes Holschuh ([EMAIL PROTECTED]): > > > So, now we would like Russel to explain why he does not like SAN. > > > > He probably doesn't advocate using SAN instead of local disks if you do > > not have a good reason to use SAN. If that's it, I *do* agree with him. > > Don't use SANs just for the heck of it. Even external JBOD enclosures > > are a bad idea if you don't need them. > > Of course. Buying SAN for a single mailserver is not worth the money. > Think of money per gigabyte and the extra trouble of managing your > SAN, local disks are much easier to handle. Exactly. Getting servers that each have 200G or 300G of storage is easy. Local storage is expected to be faster than SAN (never had a chance to benchmark it though). Having multiple back-end servers with local disks reduces the risks (IMHO). There's less cables for idiots to trip over or otherwise break (don't ask), and no single point of failure for the entire network. Having one back-end server go down and take out 1/7 of the mail boxes would be annoying, but a lot less annoying than a SAN problem taking it all out. For recovery I would prefer to have a spare system and hot-swap disks. If there's a serious problem then swap the disks into an identical machine that's already connected. Down time is the time taken to get a taxi to the server room. -- http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark http://www.coker.com.au/postal/Postal SMTP/POP benchmark http://www.coker.com.au/~russell/ My home page -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)
On Fri, 15 Oct 2004 23:33, Arnt Karlsen <[EMAIL PROTECTED]> wrote: > > On Fri, 15 Oct 2004 03:19, Arnt Karlsen <[EMAIL PROTECTED]> wrote: > > > > Increasing the number of machines increases the probability of one > > > > machine failing for any given time period. Also it makes it more > > > > difficult to debug problems as you can't always be certain of > > > > which machine was involved. > > > > > > ..very true, even for aero engines. The reason the airlines like > > > 2, 3 or even 4 rather than one jet. > > > > You seem to have entirely misunderstood what I wrote. > > ..really? Compare with your average automobile accident and > see who has the more adequate safety philosophy. If one machine has a probability of failure of 0.1 over a particular time period then the probability of at least one machine failing if there are two servers in the cluster over that same time period is 1-0.9*0.9 == 0.19. > [EMAIL PROTECTED], "2 boxes watching each other" or some such, will give > that "Ok, I'll have a look some time next week" peace of mind, > and we don't need symmetric power here, one big and one or > more small ones will do fine Have you ever actually run an ISP? -- http://www.coker.com.au/selinux/ My NSA Security Enhanced Linux packages http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark http://www.coker.com.au/postal/Postal SMTP/POP benchmark http://www.coker.com.au/~russell/ My home page -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]