Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)

2004-10-16 Thread Russell Coker
On Fri, 15 Oct 2004 23:33, Arnt Karlsen [EMAIL PROTECTED] wrote:
  On Fri, 15 Oct 2004 03:19, Arnt Karlsen [EMAIL PROTECTED] wrote:
Increasing the number of machines increases the probability of one
machine failing for any given time period.  Also it makes it more
difficult to debug problems as you can't always be certain of
which machine was involved.
  
   ..very true, even for aero engines.  The reason the airlines like
   2, 3 or even 4 rather than one jet.
 
  You seem to have entirely misunderstood what I wrote.

 ..really?   Compare with your average automobile accident and
 see who has the more adequate safety philosophy.

If one machine has a probability of failure of 0.1 over a particular time 
period then the probability of at least one machine failing if there are two 
servers in the cluster over that same time period is 1-0.9*0.9 == 0.19.

 [EMAIL PROTECTED], 2 boxes watching each other or some such, will give
 that Ok, I'll have a look some time next week peace of mind,
 and we don't need symmetric power here, one big and one or
 more small ones will do fine

Have you ever actually run an ISP?

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Documentation of big mail systems?

2004-10-16 Thread Russell Coker
On Sat, 16 Oct 2004 02:02, Christoph Moench-Tegeder [EMAIL PROTECTED] 
wrote:
 ## Henrique de Moraes Holschuh ([EMAIL PROTECTED]):
   So, now we would like Russel to explain why he does not like SAN.
 
  He probably doesn't advocate using SAN instead of local disks if you do
  not have a good reason to use SAN.  If that's it, I *do* agree with him. 
  Don't use SANs just for the heck of it.  Even external JBOD enclosures
  are a bad idea if you don't need them.

 Of course. Buying SAN for a single mailserver is not worth the money.
 Think of money per gigabyte and the extra trouble of managing your
 SAN, local disks are much easier to handle.

Exactly.

Getting servers that each have 200G or 300G of storage is easy.  Local storage 
is expected to be faster than SAN (never had a chance to benchmark it 
though).  Having multiple back-end servers with local disks reduces the risks 
(IMHO).  There's less cables for idiots to trip over or otherwise break 
(don't ask), and no single point of failure for the entire network.  Having 
one back-end server go down and take out 1/7 of the mail boxes would be 
annoying, but a lot less annoying than a SAN problem taking it all out.

For recovery I would prefer to have a spare system and hot-swap disks.  If 
there's a serious problem then swap the disks into an identical machine 
that's already connected.  Down time is the time taken to get a taxi to the 
server room.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Documentation of big mail systems?

2004-10-16 Thread Russell Coker
On Fri, 15 Oct 2004 20:08, Paul Dwerryhouse [EMAIL PROTECTED] wrote:
 On Fri, Oct 15, 2004 at 06:56:21PM +1000, Russell Coker wrote:
  The machines were all running 2.4.2x last time I was there, but they
  may be moving to 2.6.x now.

 All the stores, relays and proxies are still on 2.4.x, but the LDAP
 servers are now on 2.6.x (mainly because I could, not for any technical
 reason. At the time I upgraded them I had enough redundancy to go around
 that the downtime didn't affect anything).

In that case you should get the 4/4 kernel patch, it will make a huge 
improvement to your LDAP rebuild times which can come in handy in an 
emergency.  From memory I had the slave machines rebuilding in about 15 
minutes, I expect that I could get it down to 5 minutes with a 4/4 kernel, 
and less if the machine has 6G of RAM or more.

For 4/4 the easiest thing to do is probably to get the Fedora kernel.

 Four perdition/apache/imp servers now, rather than three. The webmail is
 rather popular now, and three servers couldn't cut it on their own
 anymore.

Is there any way to optimise PHP for speed?  Maybe PHP5 is worth trying?

 Seven backend mailstores now, and I really want an eighth, but can't get
 anyone to pay for it.

I still think that using a umem device for journals is the right thing to do.  
You should be able to double performance by putting a umem device in each 
machine.  It'll cost less than half as much as a new server to put a umem 
device in each machine, and give much more performance.

I recall that none of those machines was even close to running out of disk 
space.  You could probably handle the current load with 4 back-end machines 
if you used umem devices.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)

2004-10-16 Thread Marcin Owsiany
On Sat, Oct 16, 2004 at 09:29:32PM +1000, Russell Coker wrote:
 On Fri, 15 Oct 2004 23:33, Arnt Karlsen [EMAIL PROTECTED] wrote:
   On Fri, 15 Oct 2004 03:19, Arnt Karlsen [EMAIL PROTECTED] wrote:
 Increasing the number of machines increases the probability of one
 machine failing for any given time period.  Also it makes it more
 difficult to debug problems as you can't always be certain of
 which machine was involved.
   
..very true, even for aero engines.  The reason the airlines like
2, 3 or even 4 rather than one jet.
  
   You seem to have entirely misunderstood what I wrote.
 
  ..really?   Compare with your average automobile accident and
  see who has the more adequate safety philosophy.
 
 If one machine has a probability of failure of 0.1 over a particular time 
 period then the probability of at least one machine failing if there are two 
 servers in the cluster over that same time period is 1-0.9*0.9 == 0.19.

But do we really care about whether a machine fails? I'd rather say
that what we want to minimize is the _service_ downtime.

With one machine, the possibility of the service being unavailable is
0.1. With two machines it's equal to the possibility of both machines
failing at the same time, so it's 0.1*0.1 == 0.01, as long as the
possibilites are independent (not sure if that's the right translation
of the term).

Or am I wrong in the first sentence?

Otherwise, I'd say that the increase of availability is worth the
additional debugging effort :-)

Marcin
-- 
Marcin Owsiany [EMAIL PROTECTED] http://marcin.owsiany.pl/
GnuPG: 1024D/60F41216  FE67 DA2D 0ACA FC5E 3F75  D6F6 3A0D 8AA0 60F4 1216


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Documentation of big mail systems?

2004-10-16 Thread Marcin Owsiany
On Sat, Oct 16, 2004 at 09:41:43PM +1000, Russell Coker wrote:
  There's less cables for idiots to trip over or otherwise break 
 (don't ask),

I dare to ask :-)

Marcin
-- 
Marcin Owsiany [EMAIL PROTECTED] http://marcin.owsiany.pl/
GnuPG: 1024D/60F41216  FE67 DA2D 0ACA FC5E 3F75  D6F6 3A0D 8AA0 60F4 1216


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Documentation of big mail systems?

2004-10-16 Thread Henrique de Moraes Holschuh
On Sat, 16 Oct 2004, Russell Coker wrote:
 Getting servers that each have 200G or 300G of storage is easy.  Local

Make it a few TBs...

 though).  Having multiple back-end servers with local disks reduces the risks 
 (IMHO).  There's less cables for idiots to trip over or otherwise break 

It depends on the global amount of disks you end up with on your server
room, I think.  If it increases too much, very good SAN hardware will
decrease the chances of service downtime, because the SAN hardware and disks
are better protected against catastrophic failures, and do predictive
failure analysis right.  And a big SAN is much easier to manage than
hundreds of servers with many disks each, of different types.

OTOH, any SAN hardware that does not have at least full double redundancy
is NOT a good idea at all.

 one back-end server go down and take out 1/7 of the mail boxes would be 
 annoying, but a lot less annoying than a SAN problem taking it all out.

And that can happen, too.  I think it did happen to an ISP around here, to
their big, expensive as all heck EMC hardware.  It is rather rare, and the
service contract usually states that the SAN manufacturer is responsible for
damages on such long downtimes...

-- 
  One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie. -- The Silicon Valley Tarot
  Henrique Holschuh


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)

2004-10-16 Thread Arnt Karlsen
On Sat, 16 Oct 2004 21:29:32 +1000, Russell wrote in message 
[EMAIL PROTECTED]:

 On Fri, 15 Oct 2004 23:33, Arnt Karlsen [EMAIL PROTECTED] wrote:
   On Fri, 15 Oct 2004 03:19, Arnt Karlsen [EMAIL PROTECTED] wrote:
 Increasing the number of machines increases the probability of
 one machine failing for any given time period.  Also it makes
 it more difficult to debug problems as you can't always be
 certain of which machine was involved.
   
..very true, even for aero engines.  The reason the airlines
like 2, 3 or even 4 rather than one jet.
  
   You seem to have entirely misunderstood what I wrote.
 
  ..really?   Compare with your average automobile accident and
  see who has the more adequate safety philosophy.
 
 If one machine has a probability of failure of 0.1 over a particular
 time period then the probability of at least one machine failing if
 there are two servers in the cluster over that same time period is
 1-0.9*0.9 == 0.19.
 
  [EMAIL PROTECTED], 2 boxes watching each other or some such, will give
  that Ok, I'll have a look some time next week peace of mind,
  and we don't need symmetric power here, one big and one or
  more small ones will do fine
 
 Have you ever actually run an ISP?

..no, I'm an aeronautical engineer and likes Zeppeliners.  ;-)

-- 
..med vennlig hilsen = with Kind Regards from Arnt... ;-)
...with a number of polar bear hunters in his ancestry...
  Scenarios always come in sets of three: 
  best case, worst case, and just in case.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Can we build a proper email cluster? (was: Re: Why is debian.org email so unreliable?)

2004-10-16 Thread Arnt Karlsen
On Sat, 16 Oct 2004 14:00:57 +0200, Marcin wrote in message 
[EMAIL PROTECTED]:

 On Sat, Oct 16, 2004 at 09:29:32PM +1000, Russell Coker wrote:
  On Fri, 15 Oct 2004 23:33, Arnt Karlsen [EMAIL PROTECTED] wrote:
On Fri, 15 Oct 2004 03:19, Arnt Karlsen [EMAIL PROTECTED] wrote:
  Increasing the number of machines increases the probability
  of one machine failing for any given time period.  Also it
  makes it more difficult to debug problems as you can't
  always be certain of which machine was involved.

 ..very true, even for aero engines.  The reason the airlines
 like 2, 3 or even 4 rather than one jet.
   
You seem to have entirely misunderstood what I wrote.
  
   ..really?   Compare with your average automobile accident and
   see who has the more adequate safety philosophy.
  
  If one machine has a probability of failure of 0.1 over a particular
  time period then the probability of at least one machine failing if
  there are two servers in the cluster over that same time period is
  1-0.9*0.9 == 0.19.
 
 But do we really care about whether a machine fails? I'd rather say
 that what we want to minimize is the _service_ downtime.
 
 With one machine, the possibility of the service being unavailable is
 0.1. With two machines it's equal to the possibility of both machines
 failing at the same time, so it's 0.1*0.1 == 0.01, as long as the
 possibilites are independent (not sure if that's the right translation
 of the term).
 
 Or am I wrong in the first sentence?
 
 Otherwise, I'd say that the increase of availability is worth the
 additional debugging effort :-)

..email is a lot like Zeppeliner transportation, even if these services
stop, there is no loss other than propulsion, unlike with common jet
airliners promptly dropping outta the sky to ditch in the drink or
rocks, unless the aircrew manages to do another Gimli glide.

-- 
..med vennlig hilsen = with Kind Regards from Arnt... ;-)
...with a number of polar bear hunters in his ancestry...
  Scenarios always come in sets of three: 
  best case, worst case, and just in case.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]