Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-14 Thread Ed W

On 14/04/2012 04:48, Stan Hoeppner wrote:

On 4/13/2012 10:31 AM, Ed W wrote:


You mean those answers like:
 you need to read 'those' articles again

Referring to some unknown and hard to find previous emails is not the
same as answering?

No, referring to this:

On 4/12/2012 5:58 AM, Ed W wrote:


The claim by ZFS/BTRFS authors and others is that data silently bit
rots on it's own.

Is it not a correct assumption that you read this in articles?  If you
read this in books, scrolls, or chiseled tablets, my apologies for
assuming it was articles.



WHAT?!!  The original context was that you wanted me to learn some very 
specific thing that you accused me of misunderstanding, and then it 
turns out that the thing I'm supposed to learn comes from re-reading 
every email, every blog post, every video, every slashdot post, every 
wiki, every ... that mentions ZFS's reason for including end to end 
checksumming?!!


Please stop wasting our time and get specific

You have taken my email which contained a specific question, been asked 
of you multiple times now and yet you insist on only answering 
irrelevant details with a pointed and personal dig on each answer.  The 
rudeness is unnecessary, and your evasiveness of answers does not fill 
me with confidence that you actually know the answer...


For the benefit of anyone reading this via email archives or whatever, I 
think the conclusion we have reached is that: modern systems are now a) 
a complex sum of pieces, any of which can cause an error to be injected, 
b) the level of error correction which was originally specified as being 
sufficient is now starting to be reached in real systems, possibly even 
consumer systems.  There is no solution, however, the first step is to 
enhance detection.  Various solutions have been proposed, all increase 
cost, computation or have some disadvantage - however, one of the more 
promising detection mechanisms is an end to end checksum, which will 
then have the effect of augmenting ALL the steps in the chain, not just 
one specific step.  As of today, only a few filesystems offer this, roll 
on more adopting it


Regards

Ed W


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-14 Thread Jan-Frode Myklebust
On Fri, Apr 13, 2012 at 07:33:19AM -0500, Stan Hoeppner wrote:
  
  What I meant wasn't the drive throwing uncorrectable read errors but
  the drives are returning different data that each think is correct or
  both may have sent the correct data but one of the set got corrupted
  on the fly. After reading the articles posted, maybe the correct term
  would be the controller receiving silently corrupted data, say due to
  bad cable on one.
 
 This simply can't happen.  What articles are you referring to?  If the
 author is stating what you say above, he simply doesn't know what he's
 talking about.

It has happened to me, with RAID5 not RAID1. It was a firmware bug
in the raid controller that caused the RAID array to go silently
corrupted. The HW reported everything green -- but the filesystem was
reporting lots of strange errors..  This LUN was part of a larger
filesystem striped over multiple LUNs, so parts of the fs was OK, while
other parts was corrupt.

It was this bug:

   
http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_anyos_anycpu.chg
   - Fix 432525 - CR139339  Data corruption found on drive after
 reconstruct from GHSP (Global Hot Spare)


snip

 In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
 chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
 or transmission to occur, there would be no storage industry, and we'll
 all still be using pen and paper.  The questions you're asking were
 solved by hardware and software engineers decades ago.  You're fretting
 and asking about things that were solved decades ago.

Look at the plans are for your favorite fs:

http://www.youtube.com/watch?v=FegjLbCnoBw

They're planning on doing metadata checksumming to be sure they don't
receive corrupted metadata from the backend storage, and say that data
validation is a storage subsystem *or* application problem. 

Hardly a solved problem..


  -jf


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-14 Thread Ed W

On 14/04/2012 04:31, Stan Hoeppner wrote:

On 4/13/2012 10:31 AM, Ed W wrote:

On 13/04/2012 13:33, Stan Hoeppner wrote:

In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper.  The questions you're asking were
solved by hardware and software engineers decades ago.  You're fretting
and asking about things that were solved decades ago.

So why are so many people getting excited about it now?

So many?  I know of one person getting excited about it.


You love being vague don't you?  Go on, I'll bite again, do you mean 
yourself?


:-)


Data densities and overall storage sizes and complexity at the top end
of the spectrum are increasing at a faster rate than the
consistency/validation mechanisms.  That's the entire point of the
various academic studies on the issue.


Again, you love being vague.  By your dismissive academic studies 
phrase, do you mean studies done on a major industrial player, ie NetApp 
in this case?  Or do you mean that it's rubbish because they asked 
someone with some background in statistics to do the work, rather than 
asking someone sitting nearby in the office to do it?


I don't think the researcher broke into NetApp to do this research, so 
we have to conclude that the industrial partner was onboard.  NetApp 
seem to do a bunch of engineering of their own (got enough patents..) 
that I think we can safely assume they very much do their own research 
on this and it's not just academic...  I doubt they publish all their 
own internal research, be thankful you got to see some of the results 
this way...



   Note that the one study required
a sample set of 1.5 million disk drives.  If the phenomenon were a
regular occurrence as you would have everyone here believe, they could
have used a much smaller sample set.


Sigh... You could criticise the study if it had a small number of drives 
as being under-representive and now you criticise a large study for 
having too many observations...


You cannot have too many observations when measuring a small and 
unpredictable phenomena...


Where does it say that they could NOT have reproduced this study with 
just 10 drives?  If you have 1.5 million available, why not use all the 
results??




Ed, this is an academic exercise.  Academia leads industry.  Almost
always has.  Academia blows the whistle and waves hands, prompting
industry to take action.


Sigh... We are back to the start of the email thread again... Gosh you 
seem to love arguing and muddying the water for zero reason but to have 
the last word?


It's *trivial* to do a google search and hit *lots* of reports of 
corruptions in various parts of the system, from corrupting drivers, to 
hardware which writes incorrectly, to operating system flaws.  I just 
found a bunch more in the Redhat database today while looking for 
something else.  You yourself are very vocal on avoiding certain brands 
of HD controller which have been rumoured to cause corrupted data... 
(and thankyou for revealing that kind of thing - it's very helpful)


Don't veer off at a tangent now: The *original* email this has spawned 
is about a VERY specific point.  RAID1 appears to offer less protection 
against a class of error conditions than does RAID6.  Nothing more, 
nothing less.  Don't veer off and talk about the minutiae of testing 
studies at universities, this is a straightforward claim that you have 
been jumping around and avoiding answering with claims of needing to 
educate me on SCSI protocols and other fatuous responses. Nor deviate 
and discuss that RAID6 is inappropriate for many situations - we all get 
that...





There is nothing normal users need to do to address this problem.


...except sit tight and hope they don't loose anything important!

:-)



Having the prestigious degree that you do, you should already understand
the relationship between academic research and industry, and the
considerable lead times involved.


I'm guessing you haven't attended higher education then?  You are 
confusing graduate and post-graduate systems...


Byee

Ed W



Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-14 Thread Stan Hoeppner
On 4/14/2012 5:04 AM, Jan-Frode Myklebust wrote:
 On Fri, Apr 13, 2012 at 07:33:19AM -0500, Stan Hoeppner wrote:

 What I meant wasn't the drive throwing uncorrectable read errors but
 the drives are returning different data that each think is correct or
 both may have sent the correct data but one of the set got corrupted
 on the fly. After reading the articles posted, maybe the correct term
 would be the controller receiving silently corrupted data, say due to
 bad cable on one.

 This simply can't happen.  What articles are you referring to?  If the
 author is stating what you say above, he simply doesn't know what he's
 talking about.
 
 It has happened to me, with RAID5 not RAID1. It was a firmware bug
 in the raid controller that caused the RAID array to go silently
 corrupted. The HW reported everything green -- but the filesystem was
 reporting lots of strange errors..  This LUN was part of a larger
 filesystem striped over multiple LUNs, so parts of the fs was OK, while
 other parts was corrupt.
 
 It was this bug:
 

 http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_anyos_anycpu.chg
- Fix 432525 - CR139339  Data corruption found on drive after
  reconstruct from GHSP (Global Hot Spare)

Note my comments were specific to the RAID1 case, or a concatenated set
of RAID1 devices.  And note the discussion was framed around silent
corruption in the absence of bugs and hardware failure, or should I say,
where no bugs or hardware failures can be identified.

 snip
 
 In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
 chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
 or transmission to occur, there would be no storage industry, and we'll
 all still be using pen and paper.  The questions you're asking were
 solved by hardware and software engineers decades ago.  You're fretting
 and asking about things that were solved decades ago.
 
 Look at the plans are for your favorite fs:
 
   http://www.youtube.com/watch?v=FegjLbCnoBw
 
 They're planning on doing metadata checksumming to be sure they don't
 receive corrupted metadata from the backend storage, and say that data
 validation is a storage subsystem *or* application problem. 

You can't made sure you don't receive corrupted data.  You take steps to
mitigate the negative effects of it if and when it happens.  The XFS
devs are planning this for the future.  If the problem was here now,
this work would have already been done.

 Hardly a solved problem..

It has been up to this point.  The issue going forward is that current
devices don't employ sufficient consistency checking to meet future
needs.  And the disk drive makers apparently don't want to consume the
additional bits required to properly do this in the drives.

If they'd dedicate far more bits to ECC we may not have this issue.  But
since it appears this isn't going to change, kernel, filesystem and
application developers are taking steps to mitigate it.  Again, this
silent corruption issue as described in the various academic papers is
a future problem for most, not a current problem.  It's only a current
problem for those are the bleeding edge of large scale storage.  Note
that firmware bugs in individual products aren't part of this issue.
Those will be with us forever in various products because humans make
mistakes.  No amount of filesystem or application code can mitigate
those.  The solution to that is standard best practices: snapshots,
backups, or even mirroring all your storage across different vendor
hardware.

-- 
Stan


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-14 Thread Stan Hoeppner
On 4/14/2012 5:00 AM, Ed W wrote:
 On 14/04/2012 04:48, Stan Hoeppner wrote:
 On 4/13/2012 10:31 AM, Ed W wrote:

 You mean those answers like:
  you need to read 'those' articles again

 Referring to some unknown and hard to find previous emails is not the
 same as answering?
 No, referring to this:

 On 4/12/2012 5:58 AM, Ed W wrote:

 The claim by ZFS/BTRFS authors and others is that data silently bit
 rots on it's own.
 Is it not a correct assumption that you read this in articles?  If you
 read this in books, scrolls, or chiseled tablets, my apologies for
 assuming it was articles.

 
 WHAT?!!  The original context was that you wanted me to learn some very
 specific thing that you accused me of misunderstanding, and then it
 turns out that the thing I'm supposed to learn comes from re-reading
 every email, every blog post, every video, every slashdot post, every
 wiki, every ... that mentions ZFS's reason for including end to end
 checksumming?!!

No, the original context was your town crier statement that the sky is
falling due to silent data corruption.  I pointed out that this is not
the case, currently, that most wouldn't see this until quite a few years
down the road.  I provided facts to back my statement, which you didn't
seem to grasp or comprehend.  I pointed this out and your top popped
with a cloud of steam.

 Please stop wasting our time and get specific

Whose time am I wasting Ed?  You're the primary person one on this list
who wastes everyone's time with these drawn out threads, usually
unrelated to Dovecot.  I have been plenty specific.  The problem is you
lack the knowledge and understanding of hardware communication.  You're
upset because I'm not pointing out the knowledge you seem to lack?  Is
that not a waste of everyone's time?  Is that not be even more
insulting?  Causing even more excited/heated emails from you?

 You have taken my email which contained a specific question, been asked
 of you multiple times now and yet you insist on only answering
 irrelevant details with a pointed and personal dig on each answer.  The
 rudeness is unnecessary, and your evasiveness of answers does not fill
 me with confidence that you actually know the answer...

Ed, I have not been rude.  I've been attempting to prevent you dragging
us into the mud, which you've done, as you often do.  How specific would
you like me to get?  This is what you seem to be missing:

Drives perform per sector CRC before transmitting data to the HBA.  ATA,
SATA, SCSI, SAS, fiber channel devices and HBAs all perform CRC on wire
data.  The PCI/PCI-X/PCIe buses/channels and Southbridge all perform CRC
on wire data.  HyperTransport, and Intel's proprietary links also
perform CRC on wire transmissions.  Server memory is protected by ECC,
some by ChipKill which can tolerate double bit errors.

With today's systems and storage densities, with error correcting code
on all data paths within the system, and on the drives themselves,
silent data corruption is not an issue--in absence of defective
hardware or a bug, which are not relevant to the discussion.

 For the benefit of anyone reading this via email archives or whatever, I
 think the conclusion we have reached is that: modern systems are now a)
 a complex sum of pieces, any of which can cause an error to be injected,

Errors occur all the time.  And they're corrected nearly all of the
time, on modern complex systems.  Silent errors do not occur frequently,
usually not at all, on most modern systems.

 b) the level of error correction which was originally specified as being
 sufficient is now starting to be reached in real systems, 

FSVO 'real systems'.  The few occurrences of silent data corruption
I'm aware of have been documented in academic papers published by
researches working at taxpayer funded institutions.  In the case of
CERN, the problem was a firmware bug in the Western Digital drives that
caused an issue with the 3Ware controllers.  This kind of thing happens
when using COTS DIY hardware in the absence of proper load validation
testing.  So this case doesn't really fit the Henny-penny silent data
corruption scenario as a firmware bug caused it.  One that should have
been caught and corrected during testing.

In the other cases I'm aware of, all were HPC systems which generated
SDC under extended high loads, and these SDCs nearly all occurred
somewhere other than the storage systems--CPUs, RAM, interconnect, etc.
 HPC apps tend to run the CPUs, interconnects, storage, etc, at full
bandwidth for hours at a time, across tens of thousands of nodes, so the
probability of SDC is much higher simply due to scale.

 possibly even
 consumer systems.  

Possibly?  If you're going to post pure conjecture why not say possibly
even iPhones or Androids?  There's no data to back either claim.  Stick
to the facts.

 There is no solution, however, the first step is to
 enhance detection.  Various solutions have been proposed, all increase
 cost, computation or have some 

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Emmanuel Noobadmin
On 4/12/12, Stan Hoeppner s...@hardwarefreak.com wrote:
 On 4/11/2012 9:23 PM, Emmanuel Noobadmin wrote:
 I suppose the controller could throw an error if
 the two drives returned data that didn't agree with each other but it
 wouldn't know which is the accurate copy but that wouldn't protect the
 integrity of the data, at least not directly without additional human
 intervention I would think.

 When a drive starts throwing uncorrectable read errors, the controller
 faults the drive and tells you to replace it.  Good hardware RAID
 controllers are notorious for their penchant to kick drives that would
 continue to work just fine in mdraid or as a single drive for many more
 years.

What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.

If the controller simply returns the fastest result, it could be the
bad sector and that doesn't protect the integrity of the data right?

if the controller gets 1st half from one drive and 2nd half from the
other drive to speed up performance, we could still get the corrupted
half and the controller itself still can't tell if the sector it got
was corrupted isn't it?

If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Timo Sirainen
On 12.4.2012, at 15.10, Ed W wrote:

 On 12/04/2012 12:09, Timo Sirainen wrote:
 On 12.4.2012, at 13.58, Ed W wrote:
 
 The claim by ZFS/BTRFS authors and others is that data silently bit rots 
 on it's own. The claim is therefore that you can have a raid1 pair where 
 neither drive reports a hardware failure, but each gives you different data?
 That's one reason why I planned on adding a checksum to each message in 
 dbox. But I forgot to actually do that. I guess I could add it for new 
 messages in some upcoming version. Then Dovecot could optionally verify the 
 checksum before returning the message to client, and if it detects 
 corruption perhaps automatically read it from some alternative location 
 (e.g. if dsync replication is enabled ask from another replica). And Dovecot 
 index files really should have had some small (8/16/32bit) checksums of 
 stuff as well..
 
 
 I have to say - I haven't actually seen this happen... Do any of your big 
 mailstore contacts observe this, eg rackspace, etc?

I haven't heard. But then again people don't necessarily notice if it has.

 Things I might like to do *if* there were some suitable checksums available:
 - Use the checksum as some kind of guid either for the whole message, the 
 message minus the headers, or individual mime sections

Messages already have a GUID. And the rest of that is kind of done with the 
single instance storage stuff.. I was thinking of using SHA1 of the entire 
message with headers as the checksum, and save it into dbox metadata field. I 
also thought about checksumming the metadata fields as well, but that would 
need another checksum as the first one can have other uses as well besides 
verifying the message integrity.

 - Use the checksums to assist with replication speed/efficiency (dsync or 
 custom imap commands)

It would be of some use with dbox index rebuilding. I don't think it would help 
with dsync.

 - File RFCs for new imap features along the lemonde lines which allow 
 clients to have faster recovery from corrupted offline states...

Too much trouble, no one would implement it :)

 - Storage backends where emails are redundantly stored and might not ALL be 
 on a single server (find me the closest copy of email X) - derivations of 
 this might be interesting for compliance archiving of messages?
 - Fancy key-value storage backends might use checksums as part of the key 
 value (either for the whole or parts of the message)

GUID would work for these as well, without the possibility of a hash collision.

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Ed W

On 13/04/2012 12:51, Timo Sirainen wrote:

- Use the checksums to assist with replication speed/efficiency (dsync or 
custom imap commands)

It would be of some use with dbox index rebuilding. I don't think it would help 
with dsync.

..

- File RFCs for new imap features along the lemonde lines which allow clients 
to have faster recovery from corrupted offline states...

Too much trouble, no one would implement it :)


I presume you have seen that cyrus is working on various distributed 
options?  Standardising this through imap might work if they also buy 
into it?




- Storage backends where emails are redundantly stored and might not ALL be on 
a single server (find me the closest copy of email X) - derivations of this 
might be interesting for compliance archiving of messages?
- Fancy key-value storage backends might use checksums as part of the key value 
(either for the whole or parts of the message)

GUID would work for these as well, without the possibility of a hash collision.


I was thinking that the win for key-value store as a backend is if you 
can reduce the storage requirements or do better placement of the data 
(mail text replicated widely, attachments stored on higher latency 
storage?).  Hence whilst I don't see this being a win with current 
options, if it were done then it would almost certainly be per mime 
part, eg storing all large attachments in one place and the rest of the 
message somewhere else, perhaps with different redundancy levels per type


OK, this is all completely pie in the sky.  Please don't build it!  All 
I meant was that these are the kind of things that someone might one day 
desire to do and hence they would have competing requirements for what 
to checksum...


Cheers

Ed W


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Timo Sirainen
On 13.4.2012, at 15.17, Ed W wrote:

 On 13/04/2012 12:51, Timo Sirainen wrote:
 - Use the checksums to assist with replication speed/efficiency (dsync or 
 custom imap commands)
 It would be of some use with dbox index rebuilding. I don't think it would 
 help with dsync.
 ..
 - File RFCs for new imap features along the lemonde lines which allow 
 clients to have faster recovery from corrupted offline states...
 Too much trouble, no one would implement it :)
 
 I presume you have seen that cyrus is working on various distributed options? 
  Standardising this through imap might work if they also buy into it?

Probably more trouble than worth. I doubt anyone would want to run a 
cross-Dovecot/Cyrus cluster.

 - Storage backends where emails are redundantly stored and might not ALL be 
 on a single server (find me the closest copy of email X) - derivations of 
 this might be interesting for compliance archiving of messages?
 - Fancy key-value storage backends might use checksums as part of the key 
 value (either for the whole or parts of the message)
 GUID would work for these as well, without the possibility of a hash 
 collision.
 
 I was thinking that the win for key-value store as a backend is if you can 
 reduce the storage requirements or do better placement of the data (mail text 
 replicated widely, attachments stored on higher latency storage?).  Hence 
 whilst I don't see this being a win with current options, if it were done 
 then it would almost certainly be per mime part, eg storing all large 
 attachments in one place and the rest of the message somewhere else, perhaps 
 with different redundancy levels per type
 
 OK, this is all completely pie in the sky.  Please don't build it!  All I 
 meant was that these are the kind of things that someone might one day desire 
 to do and hence they would have competing requirements for what to checksum...

That can almost be done already .. the attachments are saved and accessed via a 
lib-fs API. It wouldn't be difficult to write a backend for some key-value 
databases. So with about one day's coding you could already have Dovecot save 
all message attachments to a key-value db, and you can configure redundancy in 
the db's configs.

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Stan Hoeppner
On 4/13/2012 1:12 AM, Emmanuel Noobadmin wrote:
 On 4/12/12, Stan Hoeppner s...@hardwarefreak.com wrote:
 On 4/11/2012 9:23 PM, Emmanuel Noobadmin wrote:
 I suppose the controller could throw an error if
 the two drives returned data that didn't agree with each other but it
 wouldn't know which is the accurate copy but that wouldn't protect the
 integrity of the data, at least not directly without additional human
 intervention I would think.

 When a drive starts throwing uncorrectable read errors, the controller
 faults the drive and tells you to replace it.  Good hardware RAID
 controllers are notorious for their penchant to kick drives that would
 continue to work just fine in mdraid or as a single drive for many more
 years.
 
 What I meant wasn't the drive throwing uncorrectable read errors but
 the drives are returning different data that each think is correct or
 both may have sent the correct data but one of the set got corrupted
 on the fly. After reading the articles posted, maybe the correct term
 would be the controller receiving silently corrupted data, say due to
 bad cable on one.

This simply can't happen.  What articles are you referring to?  If the
author is stating what you say above, he simply doesn't know what he's
talking about.

 If the controller simply returns the fastest result, it could be the
 bad sector and that doesn't protect the integrity of the data right?

I already answered this in a previous post.

 if the controller gets 1st half from one drive and 2nd half from the
 other drive to speed up performance, we could still get the corrupted
 half and the controller itself still can't tell if the sector it got
 was corrupted isn't it?

No, this is not correct.

 If the controller compares the two sectors from the drives, it may be
 able to tell us something is wrong but there isn't anyway for it to
 know which one of the sector was a good read and which isn't, or is
 there?

Yes it can, and it does.

Emmanuel, Ed, we're at a point where I simply don't have the time nor
inclination to continue answering these basic questions about the base
level functions of storage hardware.  You both have serious
misconceptions about how many things work.  To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.

I don't mind, and actually enjoy, passing knowledge.  But the amount
that seems to be required here to bring you up to speed is about 2^15
times above and beyond the scope of mailing list conversation.

In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper.  The questions you're asking were
solved by hardware and software engineers decades ago.  You're fretting
and asking about things that were solved decades ago.

-- 
Stan


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Jim Lawson
On 04/13/2012 08:33 AM, Stan Hoeppner wrote:
 What I meant wasn't the drive throwing uncorrectable read errors but
 the drives are returning different data that each think is correct or
 both may have sent the correct data but one of the set got corrupted
 on the fly. After reading the articles posted, maybe the correct term
 would be the controller receiving silently corrupted data, say due to
 bad cable on one.
 This simply can't happen.  What articles are you referring to?  If the
 author is stating what you say above, he simply doesn't know what he's
 talking about.


?!  Stan, are you really saying that silent data corruption simply
can't happen?  People who have been studying this have been talking
about it for years now.  It can happen in the same way that Emmanuel
describes.

USENIX FAST08:

http://static.usenix.org/event/fast08/tech/bairavasundaram.html

CERN:

http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf

LANL:

http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michalak.pdf

There are others if you search for it.  This problem has been well-known
in large (petabyte+) data storage systems for some time.

Jim


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Ed W

On 13/04/2012 13:21, Timo Sirainen wrote:

On 13.4.2012, at 15.17, Ed W wrote:


On 13/04/2012 12:51, Timo Sirainen wrote:

- Use the checksums to assist with replication speed/efficiency (dsync or 
custom imap commands)

It would be of some use with dbox index rebuilding. I don't think it would help 
with dsync.

..

- File RFCs for new imap features along the lemonde lines which allow clients 
to have faster recovery from corrupted offline states...

Too much trouble, no one would implement it :)

I presume you have seen that cyrus is working on various distributed options?  
Standardising this through imap might work if they also buy into it?

Probably more trouble than worth. I doubt anyone would want to run a 
cross-Dovecot/Cyrus cluster.


No definitely not.  Sorry I just meant that you are both working on 
similar things.  Standardising the basics that each use might be useful 
in the future



That can almost be done already .. the attachments are saved and accessed via a 
lib-fs API. It wouldn't be difficult to write a backend for some key-value 
databases. So with about one day's coding you could already have Dovecot save 
all message attachments to a key-value db, and you can configure redundancy in 
the db's configs.


Hmm, super.

Ed W


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Stan Hoeppner
On 4/13/2012 8:12 AM, Jim Lawson wrote:
 On 04/13/2012 08:33 AM, Stan Hoeppner wrote:
 What I meant wasn't the drive throwing uncorrectable read errors but
 the drives are returning different data that each think is correct or
 both may have sent the correct data but one of the set got corrupted
 on the fly. After reading the articles posted, maybe the correct term
 would be the controller receiving silently corrupted data, say due to
 bad cable on one.
 This simply can't happen.  What articles are you referring to?  If the
 author is stating what you say above, he simply doesn't know what he's
 talking about.
 
 
 ?!  Stan, are you really saying that silent data corruption simply
 can't happen?  

Yes, I did.  Did you read the context in which I made that statement?

 People who have been studying this have been talking
 about it for years now.  

Yes, they have.  Did you miss the paragraph where I stated exactly that?
 Did you also miss the part about the probably of such being dictated by
total storage system size and access rate?

 It can happen in the same way that Emmanuel
 describes.

No, it can't.  Not in the way Emmanuel described.  I already stated the
reason, and all of this research backs my statement.  You won't see this
with a 2 drive mirror, or a 20 drive RAID10.  Not until each drive has a
capacity in the 15TB+ range, if not more, and again, depending on the
total system size.  This doesn't address the RAID5, better known as
parity RAID write hole, which is a separate issue.  Which is also one
of the reasons I don't use it.

In lieu of an actual controller firmware bug, or mdraid or lvm bug,
you'll never see this on small scale systems.

 USENIX FAST08:
 
 http://static.usenix.org/event/fast08/tech/bairavasundaram.html
 
 CERN:
 
 http://storagemojo.com/2007/09/19/cerns-data-corruption-research/
 
 http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf
 
 LANL:
 
 http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michalak.pdf
 
 There are others if you search for it.  This problem has been well-known
 in large (petabyte+) data storage systems for some time.

And again, this is the crux of it.  One doesn't see this problem until
one hits extreme scale, which I spent at least a paragraph or two
explaining, referencing the same research.  Please re-read my post at
least twice, critically.  Then tell me if I've stated anything
substantively different than what any of these researches have.

The statements shouldn't wouldn't and can't are based on
probabilities.  Can't or won't does not need equal probability 0.
The probability of this type of silent data corruption occurring on a 2
disk or 20 disk array of today's drives is not zero over 10 years, but
it is so low the effective statement is can't or won't see this
corruption.  As I said, when we reach 15-30TB+ disk drives, this may
change for small count arrays.

-- 
Stan



Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Ed W

On 13/04/2012 06:29, Stan Hoeppner wrote:

On 4/12/2012 5:58 AM, Ed W wrote:


The claim by ZFS/BTRFS authors and others is that data silently bit
rots on it's own. The claim is therefore that you can have a raid1 pair
where neither drive reports a hardware failure, but each gives you
different data?

You need to read those articles again very carefully.  If you don't
understand what they mean by 1 in 10^15 bits non-recoverable read error
rate and combined probability, let me know.


OK, I'll bite.  I only have an honours degree in mathematics from a well 
known university, so grateful if you could dumb it down appropriately?


Lets start with what those articles are you referring to?  I don't see 
any articles if I go literally up the chain from this email, but you 
might be talking about any one of the lots of other emails in this 
thread or even some other email thread?


Wikipedia has it's faults, but it dumbs the silent corruption claim 
down to:

http://en.wikipedia.org/wiki/ZFS
an undetected error for every 67TB

And a CERN study apparently claims far higher than one in every 10^16 bits

Now, I'm NOT professing any experience of axe to grind here.  I'm simply 
asking by what feature do you believe either software or hardware RAID1 
is capable of detecting which pair is correct when both pairs of a raid 
one disk return different results and there is no hardware failure to 
clue us that one pair suffered a read error?  Please don't respond with 
a maths pissing competition, it's an innocent question about what levels 
of data checking are done on each piece of the hardware chain?  My 
(probably flawed) understanding is that popular RAID 1 implementations 
don't add any additional sector checksums over and above what the 
drives/filesystem/etc add already offer - is this the case?





And this has zero bearing on RAID1.  And RAID1 reads don't work the way
you describe above.  I explained this in some detail recently.


Where?



Been working that way for more than 2 decades Ed. :)  Note that RAID1
has that 1 for a reason.  It was the first RAID level.


What should I make of RAID0 then?

Incidentally do you disagree with the history of RAID evolution on 
Wikipedia?

http://en.wikipedia.org/wiki/RAID


Regards

Ed W


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Ed W

On 13/04/2012 13:33, Stan Hoeppner wrote:

What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.

This simply can't happen.  What articles are you referring to?  If the
author is stating what you say above, he simply doesn't know what he's
talking about.


It quite clearly can??!

Just grab your drive, lever the connector off a little bit until it's a 
bit flaky and off you go?  *THIS* type of problem I have heard of and 
you can find easy examples with a quick google search of any hobbyist 
storage board.  Very common other examples are such problems due to 
failing PSUs and other interference driven examples causing explicit 
disk errors (and once the error rate goes up, some will make it past the 
checksum)


Note this is NOT what I was originally asking about.  My interest is 
more about when the hardware is working reliably and as you agree, the 
error levels are vastly lower.  However, it would be incredibly foolish 
to claim that it's not trivial to construct a scenario where bad 
hardware causes plenty of silent corruption?



If the controller simply returns the fastest result, it could be the
bad sector and that doesn't protect the integrity of the data right?

I already answered this in a previous post.


Not obviously?!

I will also add my understanding that linux software RAID1,56 *DO NOT* 
read all disks and hence will not be aware when disks have different 
data.  In fact with software raid you need to run a regular scrub job 
to check this consistency.


I also believe that most commodity hardware raid implementations work 
exactly the same way and a background scrub is needed to detect 
inconsistent arrays. However, feel free to correct that understanding?





if the controller gets 1st half from one drive and 2nd half from the
other drive to speed up performance, we could still get the corrupted
half and the controller itself still can't tell if the sector it got
was corrupted isn't it?

No, this is not correct.


I definitely think you are wrong and Emmanuel is right?

If the controller gets a good read from the disk then it will trust that 
read and will NOT check the result with the other disk (or parity in the 
case of RAID5/6).  If that read was incorrect for some reason then the 
data will be passed as good.




If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?

Yes it can, and it does.


No it definitely does not!! At least not with linux software raid and I 
don't believe on commodity hardware controllers either!  (You would be 
able to tell because the disk IO would be doubled)


Linux software raid 1 isn't that smart, but reads only one disk and 
trusts the answer if the read did not trigger an error.  It does not 
check the other disk except during an explicit disk scrub.





Emmanuel, Ed, we're at a point where I simply don't have the time nor
inclination to continue answering these basic questions about the base
level functions of storage hardware.


You mean those answers like:
I answered that in another thread
or
you need to read 'those' articles again

Referring to some unknown and hard to find previous emails is not the 
same as answering?


Also you are wondering off at extreme tangents.  The question is simple:

- Disk 1 Read good, checksum = A
- Disk 2 Read good, checksum = B

Disks are a raid 1 pair.  How do we know which disk is correct.  Please 
specify raid 1 implementation and mechanism used with any answer




To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.


I really think not...  A simple statement of:

- Each sector on disk has a certain sized checksum
- Controller checks checksum on read
- Sent back over SATA connection, with a certain sized checksum
- After that you are on your own vs corruption

...Should cover it I think?




In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper.  The questions you're asking were
solved by hardware and software engineers decades ago.  You're fretting
and asking about things that were solved decades ago.


So why are so many people getting excited about it now?

Note, there have been plenty of shoddy disk controller implementations 
before today - ie 

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Maarten Bezemer


On Fri, 13 Apr 2012, Ed W wrote:

On 13/04/2012 13:33, Stan Hoeppner wrote:

What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.

This simply can't happen.  What articles are you referring to?  If the
author is stating what you say above, he simply doesn't know what he's
talking about.

It quite clearly can??!


I totally agree with Ed here. Drives sure can and sometimes really do 
return different data, without reporting errors. Also, data can get 
corrupted on any of the busses or chips it passes through.


The math about 10^15 or 10^16 and all that stuff is not only about array 
sizes. It's also about data transfer.


I've seen silent corruption on a few systems myself. (Luckily, only 3 
times in a couple years.) Those systems were only in the 2TB-5TB size 
category, which is substantially lower than the 67TB claimed elsewhere. 
Yet, statistically, it's well within normal probability levels.


Linux mdraid only reads one mirror as long as the drives don't return an 
error. Easy to check, the read speeds are way beyond a single drive's read 
speed. When the kernel would have to read all (possibly more than two) 
mirrors, and compare them, and make a decision based on this comparison, 
things would be horribly slow. Hardware raid typically uses this exact 
same approach. This goes for Areca, 3ware, LSI, which cover most of the 
regular (i.e. non-SAN) professional hardware raid setups.


If you don't believe it, just don't take my word for it but test it for 
yourself. Cleanly power down a raid1 array, take the individual drives, 
put them into a simple desktop machine, and write different data to both, 
using some raw disk writing tool like dd. Then put the drives back into 
the raid1 array, power it up, and re-read the information. You'll see data 
from both drives will be intermixed as parts of the reads come from one 
disk, and parts come from the other. Only when you order the raid array to 
do a verification pass, it'll start screaming and yelling. At least, I 
hope it will...



But as explained elsewhere, silent corruption can occur at numerous 
places. If you don't have an explicit checksumming/checking mechanism, 
there are indeed cases that will haunt you if you don't do regular 
scrubbing or at least do regular verification runs. Heck, that's why Linux 
mdadm comes with cron jobs to do just that, and hardware raid controllers 
have similar scheduling capabilities.


Of course, scrubbing/verification is not going to magically protect you 
from all problems. But you would at least get notifications if it detects 
problems.




If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?

Yes it can, and it does.


No it definitely does not!! At least not with linux software raid and I don't 
believe on commodity hardware controllers either!  (You would be able to tell 
because the disk IO would be doubled)


Obviously there is no way to tell which versions of a story are correct if 
you are not biased to believe one of the storytellers and distrust the 
other. You would have to add a checksum layer for that. (And hope the 
checksum isn't the part that got corrupted!)




To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.


I'm quite familiar with the basics of these protocols. I'm also quite 
familiar with the flaws in several implementations of seemingly 
straightforward protocols. More often than not, there's a pressing need 
to get new devices onto the market before the competition has something 
similar and you loose your advantage. More often than not, this results in 
suboptimal implementations of all those fine protocols and algorithms. And 
let's face it: flaws in error recovery routines often don't surface until 
someone actually needs those routines. As long as drives (or any other 
device) are functioning as expected, everything is all right. But as soon 
as something starts to get flaky, error recovery has to kick in but may 
just as well fail to do the right thing.


Just consider the real-world analogy of politicians. They do or say 
something stupid every once in a while, and error recovery (a.k.a. damage 
control) has to kick in. But even though those well trained professionals, 
having decades of experience in the political arena, sometimes simply fail 
to do the right thing. They may have overlooked some pesky details, or 

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Stan Hoeppner
On 4/13/2012 10:31 AM, Ed W wrote:
 On 13/04/2012 13:33, Stan Hoeppner wrote:

 In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
 chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
 or transmission to occur, there would be no storage industry, and we'll
 all still be using pen and paper.  The questions you're asking were
 solved by hardware and software engineers decades ago.  You're fretting
 and asking about things that were solved decades ago.
 
 So why are so many people getting excited about it now?

So many?  I know of one person getting excited about it.

Data densities and overall storage sizes and complexity at the top end
of the spectrum are increasing at a faster rate than the
consistency/validation mechanisms.  That's the entire point of the
various academic studies on the issue.  Note that the one study required
a sample set of 1.5 million disk drives.  If the phenomenon were a
regular occurrence as you would have everyone here believe, they could
have used a much smaller sample set.

Ed, this is an academic exercise.  Academia leads industry.  Almost
always has.  Academia blows the whistle and waves hands, prompting
industry to take action.

There is nothing normal users need to do to address this problem.  The
hardware and software communities will make the necessary adjustments to
address this issue before it filters down to the general user community
in a half decade or more--when normal users have a 10-20 drive array of
500TB to 1PB or more.

Having the prestigious degree that you do, you should already understand
the relationship between academic research and industry, and the
considerable lead times involved.

-- 
Stan


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-13 Thread Stan Hoeppner
On 4/13/2012 10:31 AM, Ed W wrote:

 You mean those answers like:

 you need to read 'those' articles again
 
 Referring to some unknown and hard to find previous emails is not the
 same as answering?

No, referring to this:

On 4/12/2012 5:58 AM, Ed W wrote:

 The claim by ZFS/BTRFS authors and others is that data silently bit
 rots on it's own.

Is it not a correct assumption that you read this in articles?  If you
read this in books, scrolls, or chiseled tablets, my apologies for
assuming it was articles.

-- 
Stan


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-12 Thread Stan Hoeppner
On 4/11/2012 9:23 PM, Emmanuel Noobadmin wrote:
 On 4/12/12, Stan Hoeppner s...@hardwarefreak.com wrote:
 On 4/11/2012 11:50 AM, Ed W wrote:
 One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
 event of bad blocks.  (I'm not sure what actually happens when md
 scrubbing finds a bad sector with raid1..?).  For low performance
 requirements I have become paranoid and been using RAID6 vs RAID10,
 filesystems with sector checksums seem attractive...

 Except we're using hardware RAID1 here and mdraid linear.  Thus the
 controller takes care of sector integrity.  RAID6 yields nothing over
 RAID10, except lower performance, and more usable space if more than 4
 drives are used.
 
 How would the control ensure sector integrity unless it is writing
 additional checksum information to disk? I thought only a few
 filesystems like ZFS does the sector checksum to detect if any data
 corruption occurred. I suppose the controller could throw an error if
 the two drives returned data that didn't agree with each other but it
 wouldn't know which is the accurate copy but that wouldn't protect the
 integrity of the data, at least not directly without additional human
 intervention I would think.

When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it.  Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years.  The mindset here is that anyone would rather spent $150-$2500
dollars on a replacement drive than take a chance with his/her valuable
data.

Yes I typed $2500.  EMC charges over $2000 for a single Seagate disk
drive with an EMC label and serial# on it.  The serial number is what
prevents one from taking the same off the shelf Seagate drive at $300
and mounting it in a $250,000 EMC array chassis.  The controller
firmware reads the S/N from each connected drive and will not allow
foreign drives to be used.  HP, IBM, Oracle/Sun, etc do this as well.
Which is why they make lots of profit, and is why I prefer open storage
systems.

-- 
Stan


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-12 Thread Ed W

On 12/04/2012 11:20, Stan Hoeppner wrote:

On 4/11/2012 9:23 PM, Emmanuel Noobadmin wrote:

On 4/12/12, Stan Hoeppners...@hardwarefreak.com  wrote:

On 4/11/2012 11:50 AM, Ed W wrote:

One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
event of bad blocks.  (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?).  For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...

Except we're using hardware RAID1 here and mdraid linear.  Thus the
controller takes care of sector integrity.  RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.

How would the control ensure sector integrity unless it is writing
additional checksum information to disk? I thought only a few
filesystems like ZFS does the sector checksum to detect if any data
corruption occurred. I suppose the controller could throw an error if
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.

When a drive starts throwing uncorrectable read errors, the controller
faults the drive and tells you to replace it.  Good hardware RAID
controllers are notorious for their penchant to kick drives that would
continue to work just fine in mdraid or as a single drive for many more
years.  The mindset here is that anyone would rather spent $150-$2500
dollars on a replacement drive than take a chance with his/her valuable
data.



I'm asking a subtlely different question.

The claim by ZFS/BTRFS authors and others is that data silently bit 
rots on it's own. The claim is therefore that you can have a raid1 pair 
where neither drive reports a hardware failure, but each gives you 
different data?  I can't personally claim to have observed this, so it 
remains someone else's theory...  (for background my experience is 
simply: RAID10 for high performance arrays and RAID6 for all my personal 
data - I intend to investigate your linear raid idea in the future though)


I do agree that if one drive reports a read error, then it's quite easy 
to guess which pair of the array is wrong...


Just as an aside, I don't have a lot of failure experience.  However, 
the few I have had (perhaps 6-8 events now) is that there is a massive 
correlation in failure time with RAID1, eg one pair I had lasted perhaps 
2 years and then both failed within 6 hours of each other. I also had a 
bad experience with RAID 5 that wasn't being scrubbed regularly and when 
one drive started reporting errors (ie lack of monitoring meant it had 
been bad for a while), the rest of the array turned out to be a 
patchwork of read errors - linux raid then turns out to be quite fragile 
in the presence of a small number of read failures and it's extremely 
difficult to salvage the 99% of the array which is ok due to the disks 
getting kicked out... (of course regular scrubs would have prevented 
getting so deep into that situation - it was a small cheap nas box 
without such features)


Ed W



Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-12 Thread Timo Sirainen
On 12.4.2012, at 13.58, Ed W wrote:

 The claim by ZFS/BTRFS authors and others is that data silently bit rots on 
 it's own. The claim is therefore that you can have a raid1 pair where neither 
 drive reports a hardware failure, but each gives you different data? 

That's one reason why I planned on adding a checksum to each message in dbox. 
But I forgot to actually do that. I guess I could add it for new messages in 
some upcoming version. Then Dovecot could optionally verify the checksum before 
returning the message to client, and if it detects corruption perhaps 
automatically read it from some alternative location (e.g. if dsync replication 
is enabled ask from another replica). And Dovecot index files really should 
have had some small (8/16/32bit) checksums of stuff as well..



Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-12 Thread Ed W

On 12/04/2012 02:18, Stan Hoeppner wrote:

On 4/11/2012 11:50 AM, Ed W wrote:

Re XFS.  Have you been watching BTRFS recently?

I will concede that despite the authors considering it production ready
I won't be using it for my servers just yet.  However, it's benchmarking
on single disk benchmarks fairly similarly to XFS and in certain cases
(multi-threaded performance) can be somewhat better.  I haven't yet seen
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
scales up.  Basically what I have seen seems competitive

Links?


http://btrfs.ipv5.de/index.php?title=Main_Page#Benchmarking

See the regular Phoronix benchmarks in particular.  However, I believe 
these are all single disk?




I don't have such hardware spare to benchmark, but I would be interested
to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
especially if they have compared a cutting edge btrfs kernel on the same
array?

http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulation._num_threads=128.html

This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays.
  If the disks had been setup as a concat of 68 RAID1 pairs, XFS would
have turned in numbers significantly higher, anywhere from a 100%
increase to 500%.


My instinct is that this is an irrelevant benchmark for BTRFS because 
its performance characteristics for these workloads have changed so 
significantly?  I would be far more interested in a 3.2 and then a 
3.6/3.7 benchmark in a years time


In particular recent benchmarks on Phoronix show btrfs exceeding XFS 
performance on heavily threaded benchmarks - however, I doubt this is 
representative of performance on a multi-disk benchmark?



It would be nice to see these folks update these
results with a 3.2.6 kernel, as both BTRFS and XFS have improved
significantly since 2.6.35.  EXT4 and JFS have seen little performance
work since.


My understanding is that there was a significant multi-thread 
performance boost for EXT4 in the last year kind of timeframe?  I don't 
have a link to hand, but someone did some work to reduce lock contention 
(??) which I seem to recall made a very large difference on multi-user 
or multi-cpu workloads?  I seem to recall that the summary was that it 
allowed Ext4 to scale up to a good fraction of XFS performance on 
medium sized systems? (I believe that XFS still continues to scale far 
better than anything else on large systems)


Point is that I think it's a bit unfair to say that little has changed 
on Ext4? It still seems to be developing faster than maintenance only


However, well OT...  The original question was: anyone tried very recent 
BTRFS on a multi-disk system.  Seems like the answer is no.  My proposal 
is that it may be worth watching in the future


Cheers

Ed W

P.S.  I have always been intrigued by the idea that a COW based 
filesystem could potentially implement much faster RAID parity, 
because it can avoid reading the whole stripe. The idea is that you 
treat unallocated space as zero, which means you can compute the 
incremental parity with only a read/write of the checksum value (and 
with a COW filesystem you only ever update by rewriting to new zero'd 
space). I had in mind something like a fixed parity disk (RAID4?) and 
allowing the parity disk to be write behind cached in ram (ie exposed 
to risk of: power fails AND data disk fails at the same time).  My code 
may not be following along for a while though...




Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-12 Thread Ed W

On 12/04/2012 12:09, Timo Sirainen wrote:

On 12.4.2012, at 13.58, Ed W wrote:


The claim by ZFS/BTRFS authors and others is that data silently bit rots on 
it's own. The claim is therefore that you can have a raid1 pair where neither drive 
reports a hardware failure, but each gives you different data?

That's one reason why I planned on adding a checksum to each message in dbox. 
But I forgot to actually do that. I guess I could add it for new messages in 
some upcoming version. Then Dovecot could optionally verify the checksum before 
returning the message to client, and if it detects corruption perhaps 
automatically read it from some alternative location (e.g. if dsync replication 
is enabled ask from another replica). And Dovecot index files really should 
have had some small (8/16/32bit) checksums of stuff as well..



I have to say - I haven't actually seen this happen... Do any of your 
big mailstore contacts observe this, eg rackspace, etc?


I think it's worth thinking about the failure cases before implementing 
something to be honest?  Just sticking in a checksum possibly doesn't 
help anyone unless it's on the right stuff and in the right place?


Off the top of my head:
- Someone butchers the file on disk (disk error or someone edits it with vi)
- Restore of some files goes subtly wrong, eg tool tries to be clever 
and fails, snapshot taken mid-write, etc?

- Filesystem crash (sudden power loss), how to deal with partial writes?


Things I might like to do *if* there were some suitable checksums 
available:
- Use the checksum as some kind of guid either for the whole message, 
the message minus the headers, or individual mime sections
- Use the checksums to assist with replication speed/efficiency (dsync 
or custom imap commands)
- File RFCs for new imap features along the lemonde lines which allow 
clients to have faster recovery from corrupted offline states...
- Single instance storage (presumably already done, and of course this 
has some subtleties in the face of deliberate attack)
- Possibly duplicate email suppression (but really this is an LDA 
problem...)
- Storage backends where emails are redundantly stored and might not ALL 
be on a single server (find me the closest copy of email X) - 
derivations of this might be interesting for compliance archiving of 
messages?
- Fancy key-value storage backends might use checksums as part of the 
key value (either for the whole or parts of the message)


The mail server has always looked like a kind of key-value store to my 
eye.  However, traditional key-value isn't usually optimised for 
streaming reads, hence dovecot seems like a key value store, 
optimised for sequential high speed streaming access to the key 
values...  Whilst it seems increasingly unlikely that a traditional 
key-value store will work well to replace say mdbox, I wonder if it's 
not worth looking at the replication strategies of key-value stores to 
see if those ideas couldn't lead to new features for mdbox?


Cheers

Ed W



Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-12 Thread Dirk Jahnke-Zumbusch

Hi there,

 I have to say - I haven't actually seen this happen... Do any of your
 big mailstore contacts observe this, eg rackspace, etc?

Just to throw in to the discussion that with (silent) data corruption
not only the disk is involved but many other parts of your systems.
So perhaps you would like to have a look at

https://indico.desy.de/getFile.py/access?contribId=65sessionId=42resId=0materialId=slidesconfId=257

http://indico.cern.ch/getFile.py/access?contribId=3sessionId=0resId=1materialId=paperconfId=13797

The documents are from 2007 but the principals are still the same.

Kind regards
Dirk



Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-12 Thread Stan Hoeppner
On 4/12/2012 5:58 AM, Ed W wrote:

 The claim by ZFS/BTRFS authors and others is that data silently bit
 rots on it's own. The claim is therefore that you can have a raid1 pair
 where neither drive reports a hardware failure, but each gives you
 different data?

You need to read those articles again very carefully.  If you don't
understand what they mean by 1 in 10^15 bits non-recoverable read error
rate and combined probability, let me know.

And this has zero bearing on RAID1.  And RAID1 reads don't work the way
you describe above.  I explained this in some detail recently.

 I do agree that if one drive reports a read error, then it's quite easy
 to guess which pair of the array is wrong...

Been working that way for more than 2 decades Ed. :)  Note that RAID1
has that 1 for a reason.  It was the first RAID level.  It was in
production for many many years before parity RAID hit the market.  It is
the most well understood of all RAID levels, and the simplest.

-- 
Stan


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-11 Thread Stan Hoeppner
On 4/10/2012 1:09 AM, Emmanuel Noobadmin wrote:
 On 4/10/12, Stan Hoeppner s...@hardwarefreak.com wrote:

 SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
 32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
 20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
 NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
 All other required parts are in the Wish List.  I've not written
 assembly instructions.  I figure anyone who would build this knows what
 s/he is doing.

 Price today:  $5,376.62
 
 This price looks like something I might be able to push through

It's pretty phenomenally low considering what all you get, especially 20
enterprise class drives.

 although I'll probably have to go SATA instead of SAS due to cost of
 keeping spares.

The 10K drives I mentioned are SATA not SAS.  WD's 7.2k RE and 10k
Raptor series drives are both SATA but have RAID specific firmware,
better reliability, longer warranties, etc.  The RAID specific firmware
is why both are tested and certified by LSI with their RAID cards.

 Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
 you a 10TB net Linux device and 10 stripe spindles of IOPS and
 bandwidth.  Using RAID6 would yield 18TB net and 18 spindles of read
 throughput, however parallel write throughput will be at least 3-6x
 slower than RAID10, which is why nobody uses RAID6 for transactional
 workloads.
 
 Not likely to go with RAID 5 or 6 due to concerns about the
 uncorrectable read errors risks on rebuild with large arrays. Is the

Not to mention rebuild times for large width RAID5/6.

 MegaRAID being used as the actual RAID controller or just as a HBA?

It's a top shelf RAID controller, 512MB cache, up to 240 drives, SSD
support, the works.  It's an LSI Feature Line card:
http://www.lsi.com/products/storagecomponents/Pages/6GBSATA_SASRAIDCards.aspx

The specs:
http://www.lsi.com/products/storagecomponents/Pages/MegaRAIDSAS9280-4i4e.aspx

You'll need the cache battery module for safe write caching, which I
forgot in the wish list (now added), $160:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118163Tpk=LSIiBBU08

With your workload and RAID10 you should run with all 512MB configured
as write cache.  Linux caches all reads so using any controller cache
for reads is a waste.  Using all 512MB for write cache will increase
random write IOPS.

Note the 9280 allows up to 64 LUNs, so you can do tiered storage within
this 20 bay chassis.   For spares management you'd probably not want to
bother with two different sized drives.

I didn't mention the 300GB 10K Raptors previously due to their limited
capacity.  Note they're only $15 more apiece than the 1TB RE4 drives in
the original parts list.  For a total of $300 more you get the same 40%
increase in IOPs of the 600GB model, but you'll only have 3TB net space
after RAID10.  If 3TB is sufficient space for your needs, that extra 40%
IOPS makes this config a no brainer.  The decreased latency of the 10K
drives will give a nice boost to VM read performance, especially when
using NFS.  Write performance probably won't be much different due to
the generous 512MB write cache on the controller.  I also forgot to
mention that with BBWC enabled you can turn off XFS barriers, which will
dramatically speed up Exim queues and Dovecot writes, all writes actually.

Again, you probably don't want the spares management overhead of two
different disk types on the shelf, but you could stick these 10K 300s in
the first 16 slots, and put the 2TB RE4 drive in the last 4 slots,
RAID10 on the 10K drives, RAID5 on the 2TB drives.  This yields an 8
spindle high IOPS RAID10 of 2.4TB and a lower performance RAID5 of 6TB
for near line storage such as your Dovecot alt storage, VM templates,
etc, 8.4TB net, 1.6TB less than the original 10TB setup.  Total
additional cost is $920 for this setup.  You'd have two XFS filesystems
(with quite different mkfs parameters).

 I have been avoiding hardware RAID because of a really bad experience
 with RAID 5 on an obsolete controller that eventually died without
 replacement and couldn't be recovered. Since then, it's always been
 RAID 1 and, after I discovered mdraid, using them as purely HBA with
 mdraid for the flexibility of being able to just pull the drives into
 a new system if necessary without having to worry about the
 controller.

Assuming you have the right connector configuration for your
drive/enclosure on the replacement card, you can usually swap out one
LSI RAID card with any other LSI RAID card in the same, or newer,
generation.  It'll read the configuration metadata from the disks and be
up an running in minutes.  This feature has been around all the way back
to the AMI/Mylex cards of the late 1990s.  LSI acquired both companies,
who were #1 and #2 in RAID, which is why LSI is so successful today.
Back in those days LSI simply supplied the ASICs to AMI and Mylex.  I
have an AMI MegaRAID 428, top of the line in 1998, lying around

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-11 Thread Ed W

Re XFS.  Have you been watching BTRFS recently?

I will concede that despite the authors considering it production ready 
I won't be using it for my servers just yet.  However, it's benchmarking 
on single disk benchmarks fairly similarly to XFS and in certain cases 
(multi-threaded performance) can be somewhat better.  I haven't yet seen 
any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it 
scales up.  Basically what I have seen seems competitive


I don't have such hardware spare to benchmark, but I would be interested 
to hear from someone who benchmarks your RAID1+linear+XFS suggestion, 
especially if they have compared a cutting edge btrfs kernel on the same 
array?


One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the 
event of bad blocks.  (I'm not sure what actually happens when md 
scrubbing finds a bad sector with raid1..?).  For low performance 
requirements I have become paranoid and been using RAID6 vs RAID10, 
filesystems with sector checksums seem attractive...


Regards

Ed W


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-11 Thread Adrian Minta

On 04/11/12 19:50, Ed W wrote:

...
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in 
the event of bad blocks.  (I'm not sure what actually happens when md 
scrubbing finds a bad sector with raid1..?).  For low performance 
requirements I have become paranoid and been using RAID6 vs RAID10, 
filesystems with sector checksums seem attractive...


RAID6 is very slow for write operations. That's why is the worst choice 
for maildir.





Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-11 Thread Charles Marcus

On 2012-04-11 4:48 PM, Adrian Minta adrian.mi...@gmail.com wrote:

On 04/11/12 19:50, Ed W wrote:

One of the snags of md RAID1 vs RAID6 is the lack of checksumming in
the event of bad blocks. (I'm not sure what actually happens when md
scrubbing finds a bad sector with raid1..?). For low performance
requirements I have become paranoid and been using RAID6 vs RAID10,
filesystems with sector checksums seem attractive...



RAID6 is very slow for write operations. That's why is the worst choice
for maildir.


He did say 'For *low* *performance* requirements... ... ;)

--

Best regards,

Charles


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-11 Thread Stan Hoeppner
On 4/10/2012 5:22 AM, Adrian Minta wrote:
 On 04/10/12 08:00, Stan Hoeppner wrote:
 Interestingly, I designed a COTS server back in January to handle at
 least 5k concurrent IMAP users, using best of breed components. If you
 or someone there has the necessary hardware skills, you could assemble
 this system and simply use it for NFS instead of Dovecot. The parts
 list:
 secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985 
 
 Don't forget the Battery Backup Unit for RAID card  !!!

Heh, thanks for the reminder Adrian. :)

I got to your email a little late--already corrected the omission.  Yes,
battery or flash backup for the RAID cache is always a necessity when
doing write-back caching.

-- 
Stan




Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-11 Thread Stan Hoeppner
On 4/11/2012 11:50 AM, Ed W wrote:
 Re XFS.  Have you been watching BTRFS recently?
 
 I will concede that despite the authors considering it production ready
 I won't be using it for my servers just yet.  However, it's benchmarking
 on single disk benchmarks fairly similarly to XFS and in certain cases
 (multi-threaded performance) can be somewhat better.  I haven't yet seen
 any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it
 scales up.  Basically what I have seen seems competitive

Links?

 I don't have such hardware spare to benchmark, but I would be interested
 to hear from someone who benchmarks your RAID1+linear+XFS suggestion,
 especially if they have compared a cutting edge btrfs kernel on the same
 array?

http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulation._num_threads=128.html

This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays.
 If the disks had been setup as a concat of 68 RAID1 pairs, XFS would
have turned in numbers significantly higher, anywhere from a 100%
increase to 500%.  It's hard to say because the Boxacle folks didn't
show the XFG AG config they used.  The concat+RAID1 setup can decrease
disk seeks by many orders of magnitude vs striping.  Everyone knows as
seeks go down IOPS go up.  Even with this very suboptimal disk setup,
XFS still trounces everything but JFS which is a close 2nd.  BTRFS is
way down in the pack.  It would be nice to see these folks update these
results with a 3.2.6 kernel, as both BTRFS and XFS have improved
significantly since 2.6.35.  EXT4 and JFS have seen little performance
work since.  In fact JFS has seen no commits but bug fixes and changes
to allow compiling with recent kernels.

 One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
 event of bad blocks.  (I'm not sure what actually happens when md
 scrubbing finds a bad sector with raid1..?).  For low performance
 requirements I have become paranoid and been using RAID6 vs RAID10,
 filesystems with sector checksums seem attractive...

Except we're using hardware RAID1 here and mdraid linear.  Thus the
controller takes care of sector integrity.  RAID6 yields nothing over
RAID10, except lower performance, and more usable space if more than 4
drives are used.

-- 
Stan


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-11 Thread Emmanuel Noobadmin
On 4/12/12, Stan Hoeppner s...@hardwarefreak.com wrote:
 On 4/11/2012 11:50 AM, Ed W wrote:
 One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the
 event of bad blocks.  (I'm not sure what actually happens when md
 scrubbing finds a bad sector with raid1..?).  For low performance
 requirements I have become paranoid and been using RAID6 vs RAID10,
 filesystems with sector checksums seem attractive...

 Except we're using hardware RAID1 here and mdraid linear.  Thus the
 controller takes care of sector integrity.  RAID6 yields nothing over
 RAID10, except lower performance, and more usable space if more than 4
 drives are used.

How would the control ensure sector integrity unless it is writing
additional checksum information to disk? I thought only a few
filesystems like ZFS does the sector checksum to detect if any data
corruption occurred. I suppose the controller could throw an error if
the two drives returned data that didn't agree with each other but it
wouldn't know which is the accurate copy but that wouldn't protect the
integrity of the data, at least not directly without additional human
intervention I would think.


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-10 Thread Emmanuel Noobadmin
On 4/10/12, Stan Hoeppner s...@hardwarefreak.com wrote:
 So I have to make do with OTS commodity parts and free software for
 the most parts.

 OTS meaning you build your own systems from components?  Too few in the
 business realm do so today. :(

For the inhouse stuff and budget customers yes, in fact both the email
servers are on seconded hardware that started life as something else.
I spec HP servers for our app servers to customers who are willing to
pay for their own colocated or onsite servers but still there are
customers who balk at the cost and so go OTS or virtualized.


 SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
 32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
 20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
 NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
 All other required parts are in the Wish List.  I've not written
 assembly instructions.  I figure anyone who would build this knows what
 s/he is doing.

 Price today:  $5,376.62

This price looks like something I might be able to push through
although I'll probably have to go SATA instead of SAS due to cost of
keeping spares.

 Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
 you a 10TB net Linux device and 10 stripe spindles of IOPS and
 bandwidth.  Using RAID6 would yield 18TB net and 18 spindles of read
 throughput, however parallel write throughput will be at least 3-6x
 slower than RAID10, which is why nobody uses RAID6 for transactional
 workloads.

Not likely to go with RAID 5 or 6 due to concerns about the
uncorrectable read errors risks on rebuild with large arrays. Is the
MegaRAID being used as the actual RAID controller or just as a HBA?

I have been avoiding hardware RAID because of a really bad experience
with RAID 5 on an obsolete controller that eventually died without
replacement and couldn't be recovered. Since then, it's always been
RAID 1 and, after I discovered mdraid, using them as purely HBA with
mdraid for the flexibility of being able to just pull the drives into
a new system if necessary without having to worry about the
controller.

 Both of the drives I've mentioned here are enterprise class drives,
 feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
 list.  The price of the 600GB Raptor has come down considerably since I
 designed this system, or I'd have used them instead.

 Anyway, lots of option out there.  But $6,500 is pretty damn cheap for a
 quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
 drives.

 The MegaRAID 9280-4i4e has an external SFF8088 port  For an additional
 $6,410 you could add an external Norco SAS expander JBOD chassis and 24
 more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
 10k spindles of IOPS performance from 44 total drives.  That's $13K for
 a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
 $1000/TB, $2.60/IOPS.  Significantly cheaper than an HP, Dell, IBM
 solution of similar specs, each of which will set you back at least 20
 large.

Would this setup work well too for serving up VM images? I've been
trying to find a solution for the virtualized app servers images as
well but the distributed FSes currently are all bad with random
reads/writes it seems. XFS seem to be good with large files like db
and vm images with random internal write/read so given my time
constraints, it would be nice to have a single configuration that
works generally well for all the needs I have to oversee.

 Note the chassis I've spec'd have single PSUs, not the dual or triple
 redundant supplies you'll see on branded hardware.  With a relatively
 stable climate controlled environment and a good UPS with filtering,
 quality single supplies are fine.  In fact, in the 4U form factor single
 supplies are usually more reliable due to superior IC packaging and
 airflow through the heatsinks, not to mention much quieter.

Same reason I do my best to avoid 1U servers, the space/heat issues
worries me. Yes, I'm guilty of worrying too much but that had saved me
on several occasions.


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-10 Thread Adrian Minta

On 04/10/12 08:00, Stan Hoeppner wrote:
Interestingly, I designed a COTS server back in January to handle at 
least 5k concurrent IMAP users, using best of breed components. If you 
or someone there has the necessary hardware skills, you could assemble 
this system and simply use it for NFS instead of Dovecot. The parts 
list: 
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985 


Don't forget the Battery Backup Unit for RAID card  !!!



Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-09 Thread Emmanuel Noobadmin
On 4/9/12, Stan Hoeppner s...@hardwarefreak.com wrote:
 So it seems you have two courses of action:
 1.  Identify individual current choke points and add individual systems
 and storage to eliminate those choke points.

 2.  Analyze your entire workflow and all systems, identifying all choke
 points, then design a completely new well integrated storage
 architecture that solves all current problems and addresses future needs

I started to do this and realize I have a serious mess on hand that
makes delving in other people's uncommented source code seem like a
joy :D

Management added to this by deciding if we're going to offload the
email storage to a network storage, we might as well consolidate
everything into that shared storage system so we don't have TBs of
un-utilized space. So I might not even be able to use your tested XFS
+ concat solution since it may not be optimal for VM images and
databases.

As the requirements' changed, I'll stop asking here as it's no longer
really relevant just for Dovecot purposes.

 You are a perfect candidate for VMware ESX.  The HA feature will do
 exactly what you want.  If one physical node in the cluster dies, HA
 automatically restarts the dead VMs on other nodes, transparently.
 Clients will will have to reestablish connections, but everything else
 will pretty much be intact.  Worse case scenario will possibly be a few
 corrupted mailboxes that were being written when the hardware crashed.

 A SAN is required for such a setup.

Thanks for the suggestion, I will need to find some time to look into
this although we've mostly been using KVM for virtualization so far.
Although the SAN part will probably prevent this from being accepted
due to cost.

 My lame excuse is that I'm just the web
 dev who got caught holding the server admin potato.

 Baptism by fire.  Ouch.  What doesn't kill you makes you stronger. ;)

True, but I'd hate to be the customer who get to pick up the pieces
when things explode due to unintended negligence by a dev trying to
level up by multi-classing as an admin.

 physical network interface.  You can do some of these things with free
 Linux hypervisors, but AFAIK the poor management interfaces for them
 make the price of ESX seem like a bargain.

Unfortunately, the usual kind of customers we have here, spending that
kind of budget isn't justifiable. The only reason we're providing
email services is because customers wanted freebies and they felt
there was no reason why we can't give them emails on our servers, they
are all servers after all.

So I have to make do with OTS commodity parts and free software for
the most parts.


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-09 Thread Stan Hoeppner
On 4/9/2012 2:15 PM, Emmanuel Noobadmin wrote:

 Unfortunately, the usual kind of customers we have here, spending that
 kind of budget isn't justifiable. The only reason we're providing
 email services is because customers wanted freebies and they felt
 there was no reason why we can't give them emails on our servers, they
 are all servers after all.
 
 So I have to make do with OTS commodity parts and free software for
 the most parts.

OTS meaning you build your own systems from components?  Too few in the
business realm do so today. :(

It sounds like budget overrides redundancy then.  You can do an NFS
cluster with SAN and GFS2, or two servers with their own storage and
DRBD mirroring.  Here's how to do the latter:
http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat

The total cost is about the same for each solution as an iSCSI SAN array
of drive count X is about the same cost as two JBOD disk arrays of count
X*2.  Redundancy in this case is expensive no matter the method.  Given
how infrequent host failures are, and the fact your storage is
redundant, it may make more sense to simply keep spare components on
hand and swap what fails--PSU, mobo, etc.

Interestingly, I designed a COTS server back in January to handle at
least 5k concurrent IMAP users, using best of breed components.  If you
or someone there has the necessary hardware skills, you could assemble
this system and simply use it for NFS instead of Dovecot.  The parts list:
secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985

In case the link doesn't work, the core components are:

SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
All other required parts are in the Wish List.  I've not written
assembly instructions.  I figure anyone who would build this knows what
s/he is doing.

Price today:  $5,376.62

Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
you a 10TB net Linux device and 10 stripe spindles of IOPS and
bandwidth.  Using RAID6 would yield 18TB net and 18 spindles of read
throughput, however parallel write throughput will be at least 3-6x
slower than RAID10, which is why nobody uses RAID6 for transactional
workloads.

If you need more transactional throughput you could use 20 WD6000HLHX
600GB 10K RPM WD Raptor drives.  You'll get 40% more throughput and 6TB
net space with RAID10.  They'll cost you $1200 more, or $6,576.62 total.
 Well worth the $1200 for 40% more throughput, if 6TB is enough.

Both of the drives I've mentioned here are enterprise class drives,
feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
list.  The price of the 600GB Raptor has come down considerably since I
designed this system, or I'd have used them instead.

Anyway, lots of option out there.  But $6,500 is pretty damn cheap for a
quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
drives.

The MegaRAID 9280-4i4e has an external SFF8088 port  For an additional
$6,410 you could add an external Norco SAS expander JBOD chassis and 24
more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
10k spindles of IOPS performance from 44 total drives.  That's $13K for
a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
$1000/TB, $2.60/IOPS.  Significantly cheaper than an HP, Dell, IBM
solution of similar specs, each of which will set you back at least 20
large.

Note the chassis I've spec'd have single PSUs, not the dual or triple
redundant supplies you'll see on branded hardware.  With a relatively
stable climate controlled environment and a good UPS with filtering,
quality single supplies are fine.  In fact, in the 4U form factor single
supplies are usually more reliable due to superior IC packaging and
airflow through the heatsinks, not to mention much quieter.

-- 
Stan


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-08 Thread Stan Hoeppner
On 4/7/2012 9:43 AM, Emmanuel Noobadmin wrote:
 On 4/7/12, Stan Hoeppner s...@hardwarefreak.com wrote:
 
 Firstly, thanks for the comprehensive reply. :)
 
 I'll assume networked storage nodes means NFS, not FC/iSCSI SAN, in
 which case you'd have said SAN.
 
 I haven't decided on that but it would either be NFS or iSCSI over
 Gigabit. I don't exactly get a big budget for this. iSCSI because I
 planned to do md/mpath over two separate switches so that if one
 switch explodes, the email service would still work.

So it seems you have two courses of action:

1.  Identify individual current choke points and add individual systems
and storage to eliminate those choke points.

2.  Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs

Adding an NFS server and moving infrequently (old) accessed files to
alternate storage will alleviate your space problems.  But it will
probably not fix some of the other problems you mention, such as servers
bogging down and becoming unresponsive, as that's not a space issue.
The cause of that would likely be an IOPS issue, meaning you don't have
enough storage spindles to service requests in a timely manner.

 Less complexity and cost is always better.  CPU throughput isn't a
 factor in mail workloads--it's all about IO latency.  A 1U NFS server
 with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
 less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
 
 My worry is that if that one server dies, everything is dead. With at
 least a pair of servers, I could keep it running, or if necessary,
 restore the accounts on the dead servers from backup, make some config
 changes and have everything back running while waiting for replacement
 hardware.

You are a perfect candidate for VMware ESX.  The HA feature will do
exactly what you want.  If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact.  Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.

A SAN is required for such a setup.  I had extensive experience with ESX
and HA about 5 years ago and it works as advertised.  After 5 years it
can only have improved.  It's not cheap but usually pays for itself
due to being able to consolidate the workload of dozens of physical
servers into just 2 or 3 boxes.

  I don't recall seeing your user load or IOPS requirements so I'm making
 some educated guesses WRT your required performance and total storage.
 
 I'm embarrassed to admit I don't have hard numbers on the user load
 except the rapidly dwindling disk space count and the fact when the
 web-based mail application try to list and check disk quota, it can
 bring the servers to a crawl. 

Maybe just starting with a description of your current hardware setup
and number of total users/mailboxes would be a good starting point.  How
many servers do you have, what storage is connected to each, percent of
MUA POP/IMAP connections from user PCs versus those from webmail
applications, etc, etc.

Probably the single most important piece of information would be the
hardware specs of your current Dovecot server, CPUs/RAM/disk array, etc,
and what version of Dovecot you're running.

The focus of your email is building a storage server strictly to offload
old mail and free up space on the Dovecot server.  From the sound of
things, this may not be sufficient to solve all your problems.

 My lame excuse is that I'm just the web
 dev who got caught holding the server admin potato.

Baptism by fire.  Ouch.  What doesn't kill you makes you stronger. ;)

 is nearly irrelevant for a mail workload, you can see it's much cheaper
 to scale capacity and IOPS with a single node w/fat storage than with
 skinny nodes w/thin storage.  Ok, so here's the baseline config I threw
 together:
 
 One of my concern is that heavy IO on the same server slow the overall
 performance even though the theoretical IOPS of the total drives are
 the same on 1 and on X servers. Right now, the servers are usually
 screeching to a halt, to the point of even locking out SSH access due
 to IOWait sending the load in top to triple digits.

If multiple servers are screeching to a halt due to iowait, either all
of your servers individual disks are overloaded, or you already have
shared storage.  We really need more info on your current architecture.
 Right now we don't knw if we're talking about 4 servers or 40., 100
users or 10,000.

 Some host failure redundancy is about all you'd gain from the farm
 setup.  Dovecot shouldn't barf due to one NFS node being down, only
 hiccup.  I.e. only imap process accessing files on the downed node would
 have trouble.
 
 But if I only have one big storage 

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-07 Thread Stan Hoeppner
On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote:

Hi Emmanuel,

 I'm trying to improve the setup of our Dovecot/Exim mail servers to
 handle the increasingly huge accounts (everybody thinks it's like
 infinitely growing storage like gmail and stores everything forever in
 their email accounts) by changing from Maildir to mdbox, and to take
 advantage of offloading older emails to alternative networked storage
 nodes.

I'll assume networked storage nodes means NFS, not FC/iSCSI SAN, in
which case you'd have said SAN.

 The question now is whether having a single large server or will a
 number of 1U servers with the same total capacity be better? 

Less complexity and cost is always better.  CPU throughput isn't a
factor in mail workloads--it's all about IO latency.  A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
 I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I came up with the following system that should be close to suitable,
for ~$10k USD.  The 4 node system runs ~$12k USD.  At $2k this isn't
substantially higher.  But when we double the storage of each
architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a
difference of ~$7k.  That's $1k shy of another 12 disk JBOD.  Since CPU
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage.  Ok, so here's the baseline config I threw
together:

http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-3328421-4091396-4158470-4158440.html?dnr=1
8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB)
http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx
http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3930445-3954787-4021626-4021628.html?dnr=1
w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12
allocation groups, 12TB usable.  Format the md device with the defaults:

$ mkfs.xfs /dev/md0

Mount with inode64.  No XFS stripe alignment to monkey with.  No md
chunk size or anything else to worry about.  XFS' allocation group
design is pure elegance here.

If 12 TB isn't sufficient, or if you need more space later, you can
daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add
cables.  This quadruples IOPS, throughput, and capacity--96TB total,
48TB net.  Simply create 6 more mdraid1 devices and grow the linear
array with them.  Then do an xfs_growfs to bring the extra 12TB of free
space into the filesystem.

If you're budget conscious and/or simply prefer quality inexpensive
whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD
chassis for $7400 USD.  That twice the drives, capacity, IOPS, for
~$2500 less than the HP JBOD.  And unlike the HP 'enterprise SATA'
drives, the 2TB WD Black series have a 5 year warranty, and work great
with mdraid.  Chassis and drives at Newegg:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792

You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our
LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles
in our concat+RAID1 setup, 144TB net space.

 Will be
 using RAID 1 pairs, likely XFS based on reading Hoeppner's
 recommendation on this and the mdadm list.

To be clear, the XFS configuration I recommend/promote for mailbox
storage is very specific and layered.  The layers must all be used
together to get the performance.  These layers consist of using multiple
hardware or software RAID1 pairs and concatenating them with an md
linear array.  You then format that md device with the XFS defaults, or
a specific agcount if you know how to precisely tune AG layout based on
disk size and your anticipated concurrency level of writers.

Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple thin node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created.  The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key.  By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

1.  You dramatically reduce disk head seeking by using the concat array.
 With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern.  Each user mailbox is stored in a different directory.

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-07 Thread Emmanuel Noobadmin
On 4/7/12, Stan Hoeppner s...@hardwarefreak.com wrote:

Firstly, thanks for the comprehensive reply. :)

 I'll assume networked storage nodes means NFS, not FC/iSCSI SAN, in
 which case you'd have said SAN.

I haven't decided on that but it would either be NFS or iSCSI over
Gigabit. I don't exactly get a big budget for this. iSCSI because I
planned to do md/mpath over two separate switches so that if one
switch explodes, the email service would still work.

 Less complexity and cost is always better.  CPU throughput isn't a
 factor in mail workloads--it's all about IO latency.  A 1U NFS server
 with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
 less juice and dissipates less heat than 4 1U servers each w/ 4 drives.

My worry is that if that one server dies, everything is dead. With at
least a pair of servers, I could keep it running, or if necessary,
restore the accounts on the dead servers from backup, make some config
changes and have everything back running while waiting for replacement
hardware.

  I don't recall seeing your user load or IOPS requirements so I'm making
 some educated guesses WRT your required performance and total storage.

I'm embarrassed to admit I don't have hard numbers on the user load
except the rapidly dwindling disk space count and the fact when the
web-based mail application try to list and check disk quota, it can
bring the servers to a crawl. My lame excuse is that I'm just the web
dev who got caught holding the server admin potato.

 is nearly irrelevant for a mail workload, you can see it's much cheaper
 to scale capacity and IOPS with a single node w/fat storage than with
 skinny nodes w/thin storage.  Ok, so here's the baseline config I threw
 together:

One of my concern is that heavy IO on the same server slow the overall
performance even though the theoretical IOPS of the total drives are
the same on 1 and on X servers. Right now, the servers are usually
screeching to a halt, to the point of even locking out SSH access due
to IOWait sending the load in top to triple digits.


 Some host failure redundancy is about all you'd gain from the farm
 setup.  Dovecot shouldn't barf due to one NFS node being down, only
 hiccup.  I.e. only imap process accessing files on the downed node would
 have trouble.

But if I only have one big storage node and that went down, Dovecot
would barf wouldn't it?
Or would the mdbox format mean Dovecot would still use the local
storage, just that users can't access the offloaded messages?

 Also, I could possibly arrange them in a sort
 of network raid 1 to gain redundancy over single machine failure.

 Now you're sounding like Charles Marcus, but worse. ;)  Stay where you
 are, and brush your hair away from your forehead.  I'm coming over with
 my branding iron that says K.I.S.S

Lol, I have no idea who Charles is, but I always feel safer if there
was some kind of backup. Especially since I don't have the time to
dedicate myself to server administration, by the time I notice
something is bad, it might be too late for anything but the backup.

Of course management and clients don't agree with me since
backup/redundancy costs money. :)


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-07 Thread Robin



Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple thin node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created.  The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key.  By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

1.  You dramatically reduce disk head seeking by using the concat array.
  With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern.  Each user mailbox is stored in a different directory.
Each directory was created in a different AG.  So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ.  Why 192
instead of 96?  The modification time in the directory metadata must be
updated for each index file, among other things.


Does the XFS allocator automatically distribute AGs in this way even 
when disk usage is extremely light, i.e, a freshly formatted system with 
user directories initially created, and then the actual mailbox contents 
copied into them?


If this is indeed the case, then what you describe is a wondrous 
revelation, since you're scaling out the number of simultaneous metadata 
reads+writes/second as you add RAID1 pairs, if my understanding of this 
is correct.  I'm assuming of course, but should look at the code, that 
metadata locks imposed by the filesystem distribute as the number of 
pairs increase - if it's all just one Big Lock, then that wouldn't be 
the case.


Forgive my laziness, as I could just experiment and take a look at the 
on-disk structures myself, but I don't have four empty drives handy to 
experiment.


The bandwidth improvements due to striping (RAID0/5/6 style) are no help 
for metadata-intensive IO loads, and probably of little value for even 
mdbox loads too, I suspect, unless the mdbox max size is set to 
something pretty large, no?


Have you tried other filesystems and seen if they distribute metadata in 
a similarly efficient and scalable manner across concatenated drive sets?


Is there ANY point to using striping at all, a la RAID10 in this?  I'd 
have thought just making as many RAID1 pairs out of your drives as 
possible would be the ideal strategy - is this not the case?


=R=


Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

2012-04-07 Thread Stan Hoeppner
On 4/7/2012 3:45 PM, Robin wrote:
 
 Putting XFS on a singe RAID1 pair, as you seem to be describing above
 for the multiple thin node case, and hitting one node with parallel
 writes to multiple user mail dirs, you'll get less performance than
 EXT3/4 on that mirror pair--possibly less than half, depending on the
 size of the disks and thus the number of AGs created.  The 'secret' to
 XFS performance with this workload is concatenation of spindles.
 Without it you can't spread the AGs--thus directories, thus parallel
 file writes--horizontally across the spindles--and this is the key.  By
 spreading AGs 'horizontally' across the disks in a concat, instead of
 'vertically' down a striped array, you accomplish two important things:

 1.  You dramatically reduce disk head seeking by using the concat array.
   With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
 evenly spaced vertically down each disk in the array, following the
 stripe pattern.  Each user mailbox is stored in a different directory.
 Each directory was created in a different AG.  So if you have 96 users
 writing their dovecot index concurrently, you have at worst case a
 minimum 192 head movements occurring back and forth across the entire
 platter of each disk, and likely not well optimized by TCQ/NCQ.  Why 192
 instead of 96?  The modification time in the directory metadata must be
 updated for each index file, among other things.
 
 Does the XFS allocator automatically distribute AGs in this way even
 when disk usage is extremely light, i.e, a freshly formatted system with
 user directories initially created, and then the actual mailbox contents
 copied into them?

It doesn't distribute AGs.  There are a static number created during
mkfs.xfs.  The inode64 allocator round robins new directory creation
across the AGs, and does the same with files created in those
directories.  Having the directory metadata and file extents in the same
AG decreases head movement and thus seek latency for mixed
metadata/extent high IOPS workloads.

 If this is indeed the case, then what you describe is a wondrous
 revelation, since you're scaling out the number of simultaneous metadata
 reads+writes/second as you add RAID1 pairs, if my understanding of this
 is correct.  

Correct.  And adding more space and IOPS is uncomplicated.  No chunk
calculations, no restriping of the array.  You simply grow the md linear
array adding the new disk device.  Then grow XFS to add the new free
space to the filesystem.  AFAIK this can be done infinitely,
theoretically.  I'm guessing md has a device count limit somewhere.  If
not your bash line buffer might. ;)

 I'm assuming of course, but should look at the code, that
 metadata locks imposed by the filesystem distribute as the number of
 pairs increase - if it's all just one Big Lock, then that wouldn't be
 the case.

XFS locking is done as minimally as possibly and is insanely fast.  I've
not come across any reported performance issues relating to it.  And
yes, any single metadata lock will occur in a single AG on one mirror
pair using the concat setup.

 Forgive my laziness, as I could just experiment and take a look at the
 on-disk structures myself, but I don't have four empty drives handy to
 experiment.

Don't sweat it.  All of this stuff is covered in the XFS Filesystem
Structure Guide, exciting reading if you enjoy a root canal while
watching snales race:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html

 The bandwidth improvements due to striping (RAID0/5/6 style) are no help
 for metadata-intensive IO loads, and probably of little value for even
 mdbox loads too, I suspect, unless the mdbox max size is set to
 something pretty large, no?

The problem with striped parity RAID is not allocation, which takes
place in free space and is pretty fast.  The problem is the extra read
seeks and bandwidth of the RMW cycle when you modify an existing stripe.
 Updating a single flag in a Dovecot index causes md or the hardware
RAID controller to read the entire stripe into buffer space or RAID
cache, modify the flag byte, recalculate parity, then write the whole
stripe and parity block back out across all the disks.

With a linear concat of RAID1 pairs we're simply rewriting a single 4KB
filesystem block, maybe only a single 512B sector.  I'm at the edge of
my knowledge here.  I don't know exactly how Timo does the index
updates.  Regardless of the method, the index update is light years
faster with the concat setup as there is no RMW and full stripe
writeback as with the RAID5/6 case.

 Have you tried other filesystems and seen if they distribute metadata in
 a similarly efficient and scalable manner across concatenated drive sets?

EXT, any version, does not.  ReiserFS does not.  Both require disk
striping to achieve any parallelism.  With concat they both simply start
writing at the beginning sectors of the first RAID1 pair and 4 years
later maybe reach the last pair as