Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Richard Elling
I hate to drag this thread on, but...

Erik Trimble wrote:
> OK, we cut off this thread now.
>
>
> Bottom line here is that when it comes to making statements about SATA
> vs SAS, there are ONLY two statements which are currently absolute:
>
> (1)  a SATA drive has better GB/$ than a SAS drive
>   

In general, yes.

> (2)  a SAS drive has better throughput and IOPs than a SATA drive
>   

Disagree.  We proved that the transport layer protocol has no bearing
on throughput or iops.  Several vendors offer drives which are
identical in all respects except for transport layer protocol: SAS or
SATA.  You can choose either transport layer protocol and the
performance remains the same.

> This is comparing apples to apples (that is, drives in the same
> generation, available at the same time):
>   

Add same size.  One common mis-correlation in this thread is comparing
3.5" SATA disks against 2.5" SAS disks -- there is a big difference in
the capacity, power consumption, and seek time as the disk diameter changes.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Marc Bevand
Erik Trimble  Sun.COM> writes:
> 
> Bottom line here is that when it comes to making statements about SATA
> vs SAS, there are ONLY two statements which are currently absolute:
> 
> (1)  a SATA drive has better GB/$ than a SAS drive
> (2)  a SAS drive has better throughput and IOPs than a SATA drive

Yes, and to represent statements (1) and (2) in a more exhaustive table:

  Best X per Y | Dollar   Watt   Rack Unit (or "per drive")
---+---
Capacity   | SATA(1)  SATA   SATA
Throughput | SATA SASSAS(2)
IOPS   | SATA SASSAS(2)

If (a) people understood that each of these 9 performance numbers can be
measured independently from each other, and (b) knew which of these numbers
matter for a given workload (very often multiple of them do, so a
compromise has to be made), then there would be no more circular SATA vs.
SAS debates.

-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Erik Trimble
OK, we cut off this thread now.


Bottom line here is that when it comes to making statements about SATA
vs SAS, there are ONLY two statements which are currently absolute:

(1)  a SATA drive has better GB/$ than a SAS drive

(2)  a SAS drive has better throughput and IOPs than a SATA drive

This is comparing apples to apples (that is, drives in the same
generation, available at the same time):



ANY OTHER RANKING depends on your prioritization of cost, serviceability
(i.e. vendor support), throughput, IOPS, space, power, and redundancy.  


When we have this discussion, PLEASE, folks, ASK SPECIFICALLY ABOUT YOUR
CONFIGURATION - that is, to those asking the initial question, SPECIFY
your ranking of the above criteria. Otherwise, to all those on this
list, DON'T ANSWER - we constantly devolve into tit-for-tat otherwise.



That's it, folks.


-Erik




On Thu, 2008-10-02 at 13:39 -0500, Tim wrote:
> 
> 
> On Thu, Oct 2, 2008 at 11:43 AM, Joerg Schilling
> <[EMAIL PROTECTED]> wrote:
> 
> 
> 
> You seem to missunderstand drive physics.
> 
> With modern drives, seek times are not a dominating factor. It
> is the latency
> time that is rather important and this is indeed
> 1/rotanilnal-speed.
> 
> On the other side you did missunderstand another important
> fact in drive
> physics:
> 
> The sustained transfer speed of a drive is proportional to the
> linear data
> density on the medium.
> 
> The third mistake you make is to see that confuse the effects
> from the
> drive interface type with the effects from different drive
> geometry. The only
> coincidence here is that the drive geometry is typically
> updated more
> frequently for SATA drives than it is for SAS drives. This
> way, you benefit
> from the higher data density of a recent SATA drive and get a
> higher sustained
> data rate.
> 
> BTW: I am not saying it makes no sense to buy SAS drives but
> it makes sense to
> look at _all_ important parameters. Power consumption is a
> really important
> issue here and the reduced MTBF from using more disks is
> another one.
> 
> Jörg
> 
> --
> 
> Please, give me a list of enterprises currently using SATA drives for
> their database workloads, vmware workloads... hell any workload
> besides email archiving, lightly used cifs shares, or streaming
> sequential transfers of large files.  I'm glad you can sit there with
> a spec sheet and tell me how you think things are going to work.  I
> can tell you from real-life experience you're not even remotely
> correct in your assumptions.
> 
> --Tim 
> 
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Nicolas Williams
On Thu, Oct 02, 2008 at 12:46:59PM -0500, Bob Friesenhahn wrote:
> On Thu, 2 Oct 2008, Ahmed Kamal wrote:
> > What is the "real/practical" possibility that I will face data loss during
> > the next 5 years for example ? As storage experts please help me
> > interpret whatever numbers you're going to throw, so is it a "really really
> > small chance", or would you be worried about it ?
> 
> No one can tell you the answer to this.  Certainly math can be 
> formulated which shows that the pyramids in Egypt will be completely 
> flattened before you experience data loss.  Obviously that math has 
> little value once it exceeds the lifetime of the computer by many 
> orders of magnitude.
> 
> Raidz2 is about as good as it gets on paper.  Bad things can still 
> happen which are unrelated to what raidz2 protects against or are a 
> calamity which impacts all the hardware in the system.  If you care 
> about your data, make sure that you have a working backup system.

Bad things like getting all your drives from a bad drive batch, so they
all fail near simultaneously and much sooner than you'd expect.

As always, do backup your data.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Al Hopper
On Thu, Oct 2, 2008 at 12:46 PM, Bob Friesenhahn
<[EMAIL PROTECTED]> wrote:
> On Thu, 2 Oct 2008, Ahmed Kamal wrote:
>>
>> What is the "real/practical" possibility that I will face data loss during
>> the next 5 years for example ? As storage experts please help me
>> interpret whatever numbers you're going to throw, so is it a "really really
>> small chance", or would you be worried about it ?
>
> No one can tell you the answer to this.  Certainly math can be
> formulated which shows that the pyramids in Egypt will be completely
> flattened before you experience data loss.  Obviously that math has
> little value once it exceeds the lifetime of the computer by many
> orders of magnitude.
>
> Raidz2 is about as good as it gets on paper.  Bad things can still
> happen which are unrelated to what raidz2 protects against or are a
> calamity which impacts all the hardware in the system.  If you care
> about your data, make sure that you have a working backup system.
>

Agreed.  But make that an offsite backup (Amazon S3 comes to mind).

We are already at the point in the evolution of computer technology
where errors caused by people are far more likely to cause data loss.
Or a water leak over the weekend or etc.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Al Hopper
On Thu, Oct 2, 2008 at 3:09 AM, Marc Bevand <[EMAIL PROTECTED]> wrote:
> Erik Trimble  Sun.COM> writes:
>> Marc Bevand wrote:
>> > 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, IOPS/$.
>> > You can't argue against that. To paraphrase what was said earlier in this
>> > thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm
>> > (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the
>> > best IOPS/RU, 15000rpm drives have the advantage. Etc.
>>
>> Be very careful about that. 73GB SAS drives aren't that expensive, so
>> you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB
>> SATA drives (per Sun list pricing for J4200 drives).  SATA doesn't
>> always win the IOPS/$.   Remember, a SAS drive can provide more than 2x
>> the number of IOPs a SATA drive can.
>
> Well let's look at a concrete example:
> - cheapest 15k SAS drive (73GB): $180 [1]
> - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37)
> The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5x
>
> [1] http://www.newegg.com/Product/Product.aspx?Item=N82E16822116057
> [2] http://www.newegg.com/Product/Product.aspx?Item=N82E16822136075
>

But Marc - that scenario is not realistic.  Your budget drives won't
come with a 5 year warranty and the MTBF to go along with this
warranty.  To level the playing field lets narrow it down to:

a) drives with a 5 year warranty
b) drives that are designed for continuous 7x24 operation

In the case of SATA drives, the Western Digital "RE" series and
Seagate "ES" immediately spring to mind.  Another very interesting
SATA product is the WD Velociraptor - which blurs the line between 7k2
SATA and 15k RPM SAS drives.

For the application the OP talked about, I doubt that he is going to
run out and  purchase low-cost drives that are more at home is a
low-end desktop than they are in a ZFS based filesystem.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Tim
On Thu, Oct 2, 2008 at 10:56 AM, Bob Friesenhahn <
[EMAIL PROTECTED]> wrote:

>
>
> I doubt that anyone will successfully argue that SAS drives offer the
> best IOPS/$ value as long as space, power, and reliability factors may
> be ignored.  However, these sort of enterprise devices exist in order
> to allow enterprises to meet critical business demands which are
> otherwise not possible to be met.
>
> There are situations where space, power, and cooling are limited and
> the cost of the equipment is less of an issue since the space, power,
> and cooling cost more than the equipment.
>
> Bob
>

It's called USABLE IOPS/$.  You can throw 500 drives at a workload, if
you're attempting to access lots of small files in random ways, it won't
make a lick of difference.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Tim
On Thu, Oct 2, 2008 at 11:43 AM, Joerg Schilling <
[EMAIL PROTECTED]> wrote:

>
>
> You seem to missunderstand drive physics.
>
> With modern drives, seek times are not a dominating factor. It is the
> latency
> time that is rather important and this is indeed 1/rotanilnal-speed.
>
> On the other side you did missunderstand another important fact in drive
> physics:
>
> The sustained transfer speed of a drive is proportional to the linear data
> density on the medium.
>
> The third mistake you make is to see that confuse the effects from the
> drive interface type with the effects from different drive geometry. The
> only
> coincidence here is that the drive geometry is typically updated more
> frequently for SATA drives than it is for SAS drives. This way, you benefit
> from the higher data density of a recent SATA drive and get a higher
> sustained
> data rate.
>
> BTW: I am not saying it makes no sense to buy SAS drives but it makes sense
> to
> look at _all_ important parameters. Power consumption is a really important
> issue here and the reduced MTBF from using more disks is another one.
>
> Jörg
>
> --
>

Please, give me a list of enterprises currently using SATA drives for their
database workloads, vmware workloads... hell any workload besides email
archiving, lightly used cifs shares, or streaming sequential transfers of
large files.  I'm glad you can sit there with a spec sheet and tell me how
you think things are going to work.  I can tell you from real-life
experience you're not even remotely correct in your assumptions.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Bob Friesenhahn
On Thu, 2 Oct 2008, Ahmed Kamal wrote:
>
> What is the "real/practical" possibility that I will face data loss during
> the next 5 years for example ? As storage experts please help me
> interpret whatever numbers you're going to throw, so is it a "really really
> small chance", or would you be worried about it ?

No one can tell you the answer to this.  Certainly math can be 
formulated which shows that the pyramids in Egypt will be completely 
flattened before you experience data loss.  Obviously that math has 
little value once it exceeds the lifetime of the computer by many 
orders of magnitude.

Raidz2 is about as good as it gets on paper.  Bad things can still 
happen which are unrelated to what raidz2 protects against or are a 
calamity which impacts all the hardware in the system.  If you care 
about your data, make sure that you have a working backup system.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Joerg Schilling
Bob Friesenhahn <[EMAIL PROTECTED]> wrote:

> On Thu, 2 Oct 2008, Joerg Schilling wrote:
> >
> > I am not going to accept blanket numbers... If you claim that a 15k drive
> > offers more than 2x the IOPS/s of a 7.2k drive you would need to show your
> > computation. SAS and SATA use the same cable and in case you buy server 
> > grade
> > SATA disks, you also get tagged command queuing.
>
> I sense some confusion here on your part related to basic physics. 
> Even with identical seek times, if the drive spins 2X as fast then it 
> is able to return the data in 1/2 the time already.  But the best SAS 
> drives have seek times 2-3X faster than SATA drives.  In order to 
> return data, the drive has to seek the head, and then wait for the 
> data to start to arrive, which is entirely dependent on rotation rate. 
> These are reasons why enterprise SAS drives offer far more IOPS than 
> SATA drives.

You seem to missunderstand drive physics.

With modern drives, seek times are not a dominating factor. It is the latency 
time that is rather important and this is indeed 1/rotanilnal-speed.

On the other side you did missunderstand another important fact in drive 
physics:

The sustained transfer speed of a drive is proportional to the linear data
density on the medium. 

The third mistake you make is to see that confuse the effects from the 
drive interface type with the effects from different drive geometry. The only 
coincidence here is that the drive geometry is typically updated more 
frequently for SATA drives than it is for SAS drives. This way, you benefit 
from the higher data density of a recent SATA drive and get a higher sustained 
data rate.

BTW: I am not saying it makes no sense to buy SAS drives but it makes sense to
look at _all_ important parameters. Power consumption is a really important 
issue here and the reduced MTBF from using more disks is another one.

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Ahmed Kamal
Thanks for the info. I am not really after big performance, I am already on
SATA and it's good enough for me. What I really really can't afford is data
loss. The CAD designs our engineers are working on can sometimes be really
worth a lot. But still we're a small company and would rather save and buy
SATA drives if it is "Safe"
I now understand MTBF is next to useless (at least directly), the RAID
optimizer tables don't take how failure rates go up with years, so it's not
really accurate. My question now is if I will use high quality Barracuda
nearline 1TB sata 7200 disks, and configure them as 8 disks in a raidz2
configuration.

What is the "real/practical" possibility that I will face data loss during
the next 5 years for example ? As storage experts please help me
interpret whatever numbers you're going to throw, so is it a "really really
small chance", or would you be worried about it ?

Thanks

On Thu, Oct 2, 2008 at 12:24 PM, Marc Bevand <[EMAIL PROTECTED]> wrote:

> Marc Bevand  gmail.com> writes:
> >
> > Well let's look at a concrete example:
> > - cheapest 15k SAS drive (73GB): $180 [1]
> > - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37)
> > The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not
> 180/40=4.5x
>
> Doh! I said the opposite of what I meant. Let me rephrase: "The SAS drive
> offers at most 2x-3x the IOPS (optimistic), but at 180/40=4.5x the price.
> Therefore the SATA drive has better IOPS/$."
>
> (Joerg: I am on your side of the debate !)
>
> -marc
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Bob Friesenhahn
On Thu, 2 Oct 2008, Marc Bevand wrote:
>
> Doh! I said the opposite of what I meant. Let me rephrase: "The SAS drive
> offers at most 2x-3x the IOPS (optimistic), but at 180/40=4.5x the price.
> Therefore the SATA drive has better IOPS/$."

I doubt that anyone will successfully argue that SAS drives offer the 
best IOPS/$ value as long as space, power, and reliability factors may 
be ignored.  However, these sort of enterprise devices exist in order 
to allow enterprises to meet critical business demands which are 
otherwise not possible to be met.

There are situations where space, power, and cooling are limited and 
the cost of the equipment is less of an issue since the space, power, 
and cooling cost more than the equipment.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Bob Friesenhahn
On Thu, 2 Oct 2008, Joerg Schilling wrote:
>
> I am not going to accept blanket numbers... If you claim that a 15k drive
> offers more than 2x the IOPS/s of a 7.2k drive you would need to show your
> computation. SAS and SATA use the same cable and in case you buy server grade
> SATA disks, you also get tagged command queuing.

I sense some confusion here on your part related to basic physics. 
Even with identical seek times, if the drive spins 2X as fast then it 
is able to return the data in 1/2 the time already.  But the best SAS 
drives have seek times 2-3X faster than SATA drives.  In order to 
return data, the drive has to seek the head, and then wait for the 
data to start to arrive, which is entirely dependent on rotation rate. 
These are reasons why enterprise SAS drives offer far more IOPS than 
SATA drives.

> For 10 TB you need ~ 165 73 GB SAS drives  -> ~ 22000 Euro
>   2475 Watt stand by
>   ~ 4 GBytes/s sustained
>   read if you are lucky
>   and have a good
>   controller
> Max IOPS -> 34000 (not realistic)

I am not sure where you get the idea that 34000 is not realistic.  It 
is perfectly in line with the performance I have measured with my own 
12 drive array.

As I mentioned yesterday, proper design isolates storage which 
requires maximum IOPS from storage which requires maximum storage 
capacity or lowest cost.  Therefore, it is entirely pointless to 
discuss how many 73GB SAS drives are required to achieve 10TB of 
storage since this thinking almost always represents poor design.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Marc Bevand
Marc Bevand  gmail.com> writes:
> 
> Well let's look at a concrete example:
> - cheapest 15k SAS drive (73GB): $180 [1]
> - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37)
> The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5x

Doh! I said the opposite of what I meant. Let me rephrase: "The SAS drive 
offers at most 2x-3x the IOPS (optimistic), but at 180/40=4.5x the price. 
Therefore the SATA drive has better IOPS/$."

(Joerg: I am on your side of the debate !)

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Joerg Schilling
Marc Bevand <[EMAIL PROTECTED]> wrote:

> > Be very careful about that. 73GB SAS drives aren't that expensive, so 
> > you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB 
> > SATA drives (per Sun list pricing for J4200 drives).  SATA doesn't 
> > always win the IOPS/$.   Remember, a SAS drive can provide more than 2x 
> > the number of IOPs a SATA drive can.
>
> Well let's look at a concrete example:
> - cheapest 15k SAS drive (73GB): $180 [1]
> - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37)
> The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5x

I am not going to accept blanket numbers... If you claim that a 15k drive 
offers more than 2x the IOPS/s of a 7.2k drive you would need to show your 
computation. SAS and SATA use the same cable and in case you buy server grade 
SATA disks, you also get tagged command queuing.

Let's use a concrete example:

For 10 TB you need ~ 12 1TB SATA drives (server grade) -> 2160 Euro
96 Watt stand by 
~ 1 GByte/s sustained 
read
Max IOPS -> 1200


For 10 TB you need ~ 165 73 GB SAS drives  ->   ~ 22000 Euro
2475 Watt stand by
~ 4 GBytes/s sustained 
read if you are lucky 
and have a good 
controller
Max IOPS -> 34000 (not realistic)

More important in this case: the MTFB os the SATA based system is much bigger
than the MTBF of the SAS based system.

In theory, you really get 2x-3x the IOPS/$, but do you like to pay as many $ 
and do you like to like to pay for the space and the energy?

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-02 Thread Marc Bevand
Erik Trimble  Sun.COM> writes:
> Marc Bevand wrote:
> > 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, IOPS/$. 
> > You can't argue against that. To paraphrase what was said earlier in this 
> > thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm 
> > (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the
> > best IOPS/RU, 15000rpm drives have the advantage. Etc.
>
> Be very careful about that. 73GB SAS drives aren't that expensive, so 
> you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB 
> SATA drives (per Sun list pricing for J4200 drives).  SATA doesn't 
> always win the IOPS/$.   Remember, a SAS drive can provide more than 2x 
> the number of IOPs a SATA drive can.

Well let's look at a concrete example:
- cheapest 15k SAS drive (73GB): $180 [1]
- cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37)
The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5x

[1] http://www.newegg.com/Product/Product.aspx?Item=N82E16822116057
[2] http://www.newegg.com/Product/Product.aspx?Item=N82E16822136075

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Erik Trimble
Marc Bevand wrote:
> Tim  tcsac.net> writes:
>   
>> That's because the faster SATA drives cost just as much money as
>> their SAS counterparts for less performance and none of the
>> advantages SAS brings such as dual ports.
>> 
>
> SAS drives are far from always being the best choice, because absolute IOPS 
> or 
> throughput numbers do not matter. What only matters in the end is (TB, 
> throughput, or IOPS) per (dollar, Watt, or Rack Unit).
>
> 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, and 
> IOPS/$. 
> You can't argue against that. To paraphrase what was said earlier in this 
> thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm 
> (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the 
> best 
> IOPS/RU, 15000rpm drives have the advantage. Etc.
>
> -marc
>   
Be very careful about that. 73GB SAS drives aren't that expensive, so 
you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB 
SATA drives (per Sun list pricing for J4200 drives).  SATA doesn't 
always win the IOPS/$.   Remember, a SAS drive can provide more than 2x 
the number of IOPs a SATA drive can. Likewise, throughput on a 15k drive 
can be roughly 2x a 7.2k drive, depending on I/O load.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Marc Bevand
Tim  tcsac.net> writes:
> 
> That's because the faster SATA drives cost just as much money as
> their SAS counterparts for less performance and none of the
> advantages SAS brings such as dual ports.

SAS drives are far from always being the best choice, because absolute IOPS or 
throughput numbers do not matter. What only matters in the end is (TB, 
throughput, or IOPS) per (dollar, Watt, or Rack Unit).

7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, and IOPS/$. 
You can't argue against that. To paraphrase what was said earlier in this 
thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm 
(SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the best 
IOPS/RU, 15000rpm drives have the advantage. Etc.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Bill Sommerfeld
On Wed, 2008-10-01 at 11:54 -0600, Robert Thurlow wrote:
> > like they are not good enough though, because unless this broken
> > router that Robert and Darren saw was doing NAT, yeah, it should not
> > have touch the TCP/UDP checksum.

NAT was not involved.

> I believe we proved that the problem bit flips were such
> that the TCP checksum was the same, so the original checksum
> still appeared correct.

That's correct.   

The pattern we found in corrupted data was that there would be two
offsetting bit-flips.  

A 0->1 was followed 256 or 512 or 1024 bytes later by a 1->0 
Or vice-versa.  (It was always the same bit; in the cases I analyzed,
the corrupted files contained C source code and the bit-flips were
obvious).  Under the 16-bit one's-complement checksum used by TCP, these
two  changes cancel each other out and the resulting packet has the same
checksum.

> > BTW which router was it, or you
> > can't say because you're in the US? :)
> 
> I can't remember; it was aging at the time.

to use excruciatingly precise terminology, I believe the switch in
question is marketed as a combo L2 bridge/L3 router but in this case may
have been acting as a bridge rather than a router. 

After we noticed the data corruption we looked at TCP counters on hosts
on that subnet and noticed a high rate of failed checksums, so clearly
the TCP checksum was catching *most* of the corrupted packets; we just
didn't look at the counters until after we saw data corruption.

- Bill









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Bob Friesenhahn
On Wed, 1 Oct 2008, Joerg Schilling wrote:
>
> Did you recently look at spec files from drive manufacturers?

Yes.

> If you look at drives in the same category, the difference between a 
> SATA and a

The problem is that these drives (SAS / SATA) are generally not in the 
same category so your comparison does not make sense.  There is very 
little overlap between the "exotic sports car" class and the "family 
mini van" class.  In some very few cases we see some transition 
vehicles such as station wagons in a sport form factor.

Most drive vendors try to make sure that the drives are in truely 
distinct classes in order to preserve the profit margins on the more 
expensive drives.  In some cases we see SAS interfaces fitted to 
drives which are fundamentally SATA-class drives but such products are 
rare.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Nicolas Williams
On Wed, Oct 01, 2008 at 11:54:55AM -0600, Robert Thurlow wrote:
> Miles Nordin wrote:
> 
> > sounds
> > like they are not good enough though, because unless this broken
> > router that Robert and Darren saw was doing NAT, yeah, it should not
> > have touch the TCP/UDP checksum.
> 
> I believe we proved that the problem bit flips were such
> that the TCP checksum was the same, so the original checksum
> still appeared correct.

The bit flips came in pairs, IIRC.  I forget the details, but it's
probably buried somewhere in my (and many others') e-mail.

> > BTW which router was it, or you
> > can't say because you're in the US? :)
> 
> I can't remember; it was aging at the time.

I can't remember either -- it was a few years ago.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Tim
On Wed, Oct 1, 2008 at 12:51 PM, Joerg Schilling <
[EMAIL PROTECTED]> wrote:

>
> Did you recently look at spec files from drive manufacturers?
>
> If you look at drives in the same category, the difference between a SATA
> and a
> SAS disk is only the firmware and the way the drive mechanism has been
> selected.
> Another difference is that SAS drives may have two SAS interfaces instead
> of the
> single SATA interface found in the SATA drives.
>
> IOPS/s depend on seek times, latency times and probably on disk cache size.
>
> If you have a drive with 1 ms seek time, the seek time is not really
> important.
> What's important is the latency time which is 4ms for a 7200 rpm drive and
> only
> 2 ms for 15000 rpm drive.
>
> People who talk about SAS usually forget that they try to compare 15000 rpm
> SAS drives with 7200 rpm SATA drives. There are faster SATA drives but
> these
> drives consume more power.
>

That's because the faster SATA drives cost just as much money as their SAS
counterparts for less performance and none of the advantages SAS brings such
as dual ports.  Not to mention none of them can be dual sourced making it a
non-starter in the enterprise.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Nicolas Williams
On Wed, Oct 01, 2008 at 12:22:56PM -0500, Tim wrote:
> > - This will mainly be used for NFS sharing. Everyone is saying it will have
> > "bad" performance. My question is, how "bad" is bad ? Is it worse than a
> > plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid
> > card with caching disabled ? coz that's what I currently have. Is it say
> > worse that a Linux box sharing over soft raid ?
> 
> Whoever is saying that is being dishonest.  NFS is plenty fast for most
> workloads.  There are very, VERY few workloads in the enterprise that are
> I/O bound, they are almost all IOPS bound.

NFS is bad for workloads that involve lots of operations that NFS
requires to be synchronous and which the application doesn't
parallelize.  Things like open(2) and close(2), for example, which means
applications like tar(1).

The solution is to get a fast slog device.  (Or to use an NFS server
that violates the synchrony requirement.)

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Robert Thurlow
Miles Nordin wrote:

> sounds
> like they are not good enough though, because unless this broken
> router that Robert and Darren saw was doing NAT, yeah, it should not
> have touch the TCP/UDP checksum.

I believe we proved that the problem bit flips were such
that the TCP checksum was the same, so the original checksum
still appeared correct.

 > BTW which router was it, or you
> can't say because you're in the US? :)

I can't remember; it was aging at the time.

Rob T
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Joerg Schilling
Bob Friesenhahn <[EMAIL PROTECTED]> wrote:

> On Wed, 1 Oct 2008, Joerg Schilling wrote:
> >
> > SATA and SAS disks usually base on the same drive mechanism. The seek times
> > are most likely identical.
>
> This must be some sort of "urban legend".  While the media composition 
> and drive chassis is similar, the rest of the product clearly differs. 
> The seek times for typical SAS drives are clearly much better, and the 
> typical drive rotates much faster.

Did you recently look at spec files from drive manufacturers?

If you look at drives in the same category, the difference between a SATA and a 
SAS disk is only the firmware and the way the drive mechanism has been selected.
Another difference is that SAS drives may have two SAS interfaces instead of the
single SATA interface found in the SATA drives.

IOPS/s depend on seek times, latency times and probably on disk cache size.

If you have a drive with 1 ms seek time, the seek time is not really important.
What's important is the latency time which is 4ms for a 7200 rpm drive and only
2 ms for 15000 rpm drive.

People who talk about SAS usually forget that they try to compare 15000 rpm 
SAS drives with 7200 rpm SATA drives. There are faster SATA drives but these 
drives consume more power.

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Richard Elling
Ahmed Kamal wrote:
> Thanks for all the opinions everyone, my current impression is:
>
> - I do need as much RAM as I can afford (16GB look good enough for me)
> - SAS disks offers better iops & better MTBF than SATA. But Sata 
> offers enough performance for me (to saturate a gig link), and its 
> MTBF is around 100 years, which is I guess good enough for me too. If 
> I wrap 5 or 6 SATA disks in a raidz2 that should give me "enough" 
> protection and performance. It seems I will go with sata then for now. 
> I hope for all practical purposes the raidz2 array of say 6 sata 
> drives are "very well protected" for say the next 10 years! (If not 
> please tell me)

OK, so what the specs don't tell you is how MTBF changes over time.
It is very common to see an MTBF quoted, but you will almost never
see it described as a function of age.  Rather, you will see something in
the specs about expected service lifetime, and how the environment can
decrease the service lifetime (read: decrease the MTBF over time more
rapidly).  I've not seen a consumer grade disk spec with 10 years of
expected service life -- some are 5 years.  In other words, as time goes
by, you should plan to replace them.  A more lengthy discussion of
this, and why we measure field reliability in other ways, see:
http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent

> - This will mainly be used for NFS sharing. Everyone is saying it will 
> have "bad" performance. My question is, how "bad" is bad ? Is it worse 
> than a plain Linux server sharing NFS over 4 sata disks, using a 
> crappy 3ware raid card with caching disabled ? coz that's what I 
> currently have. Is it say worse that a Linux box sharing over soft raid ?
> - If I will be using 6 sata disks in raidz2, I understand to improve 
> performance I can add a 15k SAS drive as a Zil device, is this correct 
> ? Is the zil device per pool. Do I loose any flexibility by using it ? 
> Does it become a SPOF say ? Typically how much percentage improvement 
> should I expect to get from such a zil device ?

See the best practices guide:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Bob Friesenhahn
On Wed, 1 Oct 2008, Joerg Schilling wrote:

>> Ummm, no.  SATA and SAS seek times are not even in the same universe.  They
>> most definitely do not use the same mechanics inside.  Whoever told you that
>> rubbish is an outright liar.
>
> It is extremely unlikely that two drives from the same manufacturer and with 
> the
> same RPM differ in seek times if you compare a SAS variant with a SATA 
> variant.

I did find a manufacturer (Seagate) which does offer a SAS variant of 
what is normally a SATA drive.  Is this the specific product you are 
talking about?

The interface itself is perhaps not all that important but drive 
vendors have traditionally selected that "SCSI" based products are 
based on high performance hardware with a focus on reliability while 
"ATA" based products are based on low or medium performance hardware 
with a focus on cost.  There is very little overlap between these 
distinct product lines.  It is rare to find similarity between the 
specification sheets.  It is quite rare to find similar rotation rates 
or seek times.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Miles Nordin
> "cd" == Casper Dik <[EMAIL PROTECTED]> writes:

cd> The whole packet lives in the memory of the switch/router and
cd> if that memory is broken the packet will be send damaged.

that's true, but by algorithmically modifying the checksum to match
your ttl decrementing and MAC address label-swapping rather than
recomputing it from scratch, it's possible for an L2 or even L3 switch
to avoid ``splitting the protection domain''.  It'll still send the
damaged packet, but with a wrong FCS, so it'll just get dropped by the
next input port and eventually retransmitted.  This is what 802.1d
suggests.

I suspect one reason the IP/UDP/TCP checksums were specified as simple
checksums rather than CRC's like the Ethernet L2 FCS, is that it's
really easy and obvious how to algorithmically modify them.  sounds
like they are not good enough though, because unless this broken
router that Robert and Darren saw was doing NAT, yeah, it should not
have touch the TCP/UDP checksum.  BTW which router was it, or you
can't say because you're in the US? :)

I would expect any cost-conscious router or switch manufacturer to use
the same Ethernet MAC ASIC's as desktops, so the checksums would
likely be computed right before transmission using the ``offload''
feature of the Ethernet chip, but of course we can't tell because
they're all proprietary.  Eventually I bet it will become commonplace
for Ethernet MAC's to do IPsec offload, so we'll have to remember the
``avoid splitting the protection domain'' idea when that starts
happening.


pgpIRJL9G6bGy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Joerg Schilling
Tim <[EMAIL PROTECTED]> wrote:

>
> Ummm, no.  SATA and SAS seek times are not even in the same universe.  They
> most definitely do not use the same mechanics inside.  Whoever told you that
> rubbish is an outright liar.

It is extremely unlikely that two drives from the same manufacturer and with the
same RPM differ in seek times if you compare a SAS variant with a SATA variant.

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Bob Friesenhahn
On Wed, 1 Oct 2008, [EMAIL PROTECTED] wrote:
>
> To get the same storage between capacity with SAS drives and SATA drives,
> you'd probably have to put the SAS drives in a RAID-5/6/Z configuration to
> be more space efficient. However by doing this you'd be losing spindles,
> and therefore IOPS. With SATA drives, since you can buy more for the same
> budget, you could put them in a RAID-10 configuration. While the
> individual disk many be slower, you'd have more spindles in the zpool, so
> that should help with the IOPS.

I will agree with that except to point out that there are many 
applications which require performance but not a huge amount of 
storage.  For many critical applications, even 10s of gigabytes is a 
lot of storage.  Based on this, I would say that most applications 
where SAS is desireable are the ones which desire the most reliability 
and performance whereas the applications where SATA is desireable are 
the ones which place a priority on bulk storage capacity.

If you are concerned about total storage capacity and you are also 
specifying SAS for performance/reliability for critical data then it 
is likely that there is something wrong with your plan for storage and 
how the data is distributed.

There is a reason why when you go to the store you see tack hammers, 
construction hammers, and sledge hammers.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Nicolas Williams
On Tue, Sep 30, 2008 at 09:54:04PM -0400, Miles Nordin wrote:
> ok, I get that S3 went down due to corruption, and that the network
> checksums I mentioned failed to prevent the corruption.  The missing
> piece is: belief that the corruption occurred on the network rather
> than somewhere else.
> 
> Their post-mortem sounds to me as though a bit flipped inside the
> memory of one server could be spread via this ``gossip'' protocol to
> infect the entire cluster.  The replication and spreadability of the
> data makes their cluster into a many-terabyte gamma ray detector.

A bit flipped inside an end of an end-to-end system will not be
detected by that system.  So the CPU, memory and memory bus of an end
have to be trusted and so require their own corruption detection
mechanisms (e.g., ECC memory).

In the S3 case it sounds like there's a lot of networking involved, and
that they weren't providing integrity protection for the gossip
protocol.  Given a two-bit-flip-that-passed-all-Ethernet-and-TCP-CRCs
event that we had within Sun a few years ago (much alluded to elsewhere
in this thread), and which happened in one faulty switch, I would
suspect the switch.  Also, years ago when 100Mbps Ethernet first came on
the market I saw lots of bad cat-5 wiring issues, where a wire would go
bad and start introducing errors just a few months into its useful life.
I don't trust the networking equipment -- I prefer end-to-end
protection.

Just because you have to trust that the ends behave correctly doesn't
mean that you should have to trust everything in the middle too.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Tim
On Wed, Oct 1, 2008 at 11:53 AM, Ahmed Kamal <
[EMAIL PROTECTED]> wrote:

> Thanks for all the opinions everyone, my current impression is:
> - I do need as much RAM as I can afford (16GB look good enough for me)
>

Depends on both the workload, and the amount of storage behind it.  From
your descriptions though, I think you'll be ok.


> - SAS disks offers better iops & better MTBF than SATA. But Sata offers
> enough performance for me (to saturate a gig link), and its MTBF is around
> 100 years, which is I guess good enough for me too. If I wrap 5 or 6 SATA
> disks in a raidz2 that should give me "enough" protection and performance.
> It seems I will go with sata then for now. I hope for all practical purposes
> the raidz2 array of say 6 sata drives are "very well protected" for say the
> next 10 years! (If not please tell me)
>

***If you have a sequential workload.  It's not a blanket "SATA is fast
enough".



> - This will mainly be used for NFS sharing. Everyone is saying it will have
> "bad" performance. My question is, how "bad" is bad ? Is it worse than a
> plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid
> card with caching disabled ? coz that's what I currently have. Is it say
> worse that a Linux box sharing over soft raid ?
>

Whoever is saying that is being dishonest.  NFS is plenty fast for most
workloads.  There are very, VERY few workloads in the enterprise that are
I/O bound, they are almost all IOPS bound.


> - If I will be using 6 sata disks in raidz2, I understand to improve
> performance I can add a 15k SAS drive as a Zil device, is this correct ? Is
> the zil device per pool. Do I loose any flexibility by using it ? Does it
> become a SPOF say ? Typically how much percentage improvement should I
> expect to get from such a zil device ?
>

ZIL's come with their own fun.  Isn't there still the issue of losing the
entire pool if you lose the ZIL?  And you can't get it back without
extensive, ugly work?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Miles Nordin
> "t" == Tim  <[EMAIL PROTECTED]> writes:

 t> So what would be that the application has to run on Solaris.
 t> And requires a LUN to function.

ITYM requires two LUN's, or else when your filesystem becomes corrupt
after a crash the sysadmin will get blamed for it.  Maybe you can
deduplicate the ZFS mirror LUNs on the storage back-end or something.


pgpFW901WOk9u.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Ahmed Kamal
Thanks for all the opinions everyone, my current impression is:
- I do need as much RAM as I can afford (16GB look good enough for me)
- SAS disks offers better iops & better MTBF than SATA. But Sata offers
enough performance for me (to saturate a gig link), and its MTBF is around
100 years, which is I guess good enough for me too. If I wrap 5 or 6 SATA
disks in a raidz2 that should give me "enough" protection and performance.
It seems I will go with sata then for now. I hope for all practical purposes
the raidz2 array of say 6 sata drives are "very well protected" for say the
next 10 years! (If not please tell me)
- This will mainly be used for NFS sharing. Everyone is saying it will have
"bad" performance. My question is, how "bad" is bad ? Is it worse than a
plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid
card with caching disabled ? coz that's what I currently have. Is it say
worse that a Linux box sharing over soft raid ?
- If I will be using 6 sata disks in raidz2, I understand to improve
performance I can add a 15k SAS drive as a Zil device, is this correct ? Is
the zil device per pool. Do I loose any flexibility by using it ? Does it
become a SPOF say ? Typically how much percentage improvement should I
expect to get from such a zil device ?

Thanks

On Wed, Oct 1, 2008 at 6:22 PM, Tim <[EMAIL PROTECTED]> wrote:

>
>
> On Wed, Oct 1, 2008 at 11:20 AM, <[EMAIL PROTECTED]> wrote:
>
>>
>>
>> >Ummm, no.  SATA and SAS seek times are not even in the same universe.=
>> >  They
>> >most definitely do not use the same mechanics inside.  Whoever told y=
>> >ou that
>> >rubbish is an outright liar.
>>
>>
>> Which particular disks are you guys talking about?
>>
>> I;m thinking you guys are talking about the same 3.5" w/ the same RPM,
>> right?  We're not comparing 10K/2.5 SAS drives agains 7.2K/3.5 SATA
>> devices, are we?
>>
>> Casper
>>
>>
> I'm talking about 10k and 15k SAS drives, which is what the OP was talking
> about from the get-go.  Apparently this is yet another case of subsequent
> posters completely ignoring the topic and taking us off on tangents that
> have nothing to do with the OP's problem.
>
> --Tm
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Bob Friesenhahn
On Wed, 1 Oct 2008, Joerg Schilling wrote:
>
> SATA and SAS disks usually base on the same drive mechanism. The seek times
> are most likely identical.

This must be some sort of "urban legend".  While the media composition 
and drive chassis is similar, the rest of the product clearly differs. 
The seek times for typical SAS drives are clearly much better, and the 
typical drive rotates much faster.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread dmagda
On Wed, October 1, 2008 10:18, Joerg Schilling wrote:

> SATA and SAS disks usually base on the same drive mechanism. The seek
> times are most likely identical.
>
> Some SATA disks support tagged command queueing and others do not.
> I would asume that there is no speed difference between SATA with command
> queueing and SAS.

I guess the meaning in my e-mail wasn't clear: because SAS drives are
generally more expensive on a per unit basis, for a given budget, you can
buy fewer of them than SATA drives.

To get the same storage between capacity with SAS drives and SATA drives,
you'd probably have to put the SAS drives in a RAID-5/6/Z configuration to
be more space efficient. However by doing this you'd be losing spindles,
and therefore IOPS. With SATA drives, since you can buy more for the same
budget, you could put them in a RAID-10 configuration. While the
individual disk many be slower, you'd have more spindles in the zpool, so
that should help with the IOPS.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Tim
On Wed, Oct 1, 2008 at 11:20 AM, <[EMAIL PROTECTED]> wrote:

>
>
> >Ummm, no.  SATA and SAS seek times are not even in the same universe.=
> >  They
> >most definitely do not use the same mechanics inside.  Whoever told y=
> >ou that
> >rubbish is an outright liar.
>
>
> Which particular disks are you guys talking about?
>
> I;m thinking you guys are talking about the same 3.5" w/ the same RPM,
> right?  We're not comparing 10K/2.5 SAS drives agains 7.2K/3.5 SATA
> devices, are we?
>
> Casper
>
>
I'm talking about 10k and 15k SAS drives, which is what the OP was talking
about from the get-go.  Apparently this is yet another case of subsequent
posters completely ignoring the topic and taking us off on tangents that
have nothing to do with the OP's problem.

--Tm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Casper . Dik


>Ummm, no.  SATA and SAS seek times are not even in the same universe.=
>  They
>most definitely do not use the same mechanics inside.  Whoever told y=
>ou that
>rubbish is an outright liar.


Which particular disks are you guys talking about?

I;m thinking you guys are talking about the same 3.5" w/ the same RPM,
right?  We're not comparing 10K/2.5 SAS drives agains 7.2K/3.5 SATA
devices, are we?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Tim
On Wed, Oct 1, 2008 at 10:28 AM, Bob Friesenhahn <
[EMAIL PROTECTED]> wrote:

> On Wed, 1 Oct 2008, Tim wrote:
>
>>
>>> I think you'd be surprised how large an organisation can migrate most,
>>> if not all of their application servers to zones one or two Thumpers.
>>>
>>> Isn't that the reason for buying in "server appliances"?
>>>
>>>  I think you'd be surprised how quickly they'd be fired for putting that
>> much
>> risk into their enterprise.
>>
>
> There is the old saying that "No one gets fired for buying IBM".  If one
> buys an IBM system which runs 30 isolated instances of Linux, all of which
> are used for mission critical applications, is this a similar risk to
> consolidating storage on a Thumper since we are really talking about just
> one big system?
>
> In what way is consolidating on Sun/Thumper more or less risky to an
> enterprise than consolidating on a big IBM server with many subordinate OS
> instances?
>
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
>
Are you honestly trying to compare a Thumper's reliability to an IBM
mainframe?  Please tell me that's a joke...  We can start at redundant,
hot-swappable components and go from there.  The thumper can't even hold a
candle to Sun's own older sparc platforms.  It's not even in the same game
as the IBM mainframes.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Tim
On Wed, Oct 1, 2008 at 9:18 AM, Joerg Schilling <
[EMAIL PROTECTED]> wrote:

> David Magda <[EMAIL PROTECTED]> wrote:
>
> > On Sep 30, 2008, at 19:09, Tim wrote:
> >
> > > SAS has far greater performance, and if your workload is extremely
> > > random,
> > > will have a longer MTBF.  SATA drives suffer badly on random
> > > workloads.
> >
> > Well, if you can probably afford more SATA drives for the purchase
> > price, you can put them in a striped-mirror set up, and that may help
> > things. If your disks are cheap you can afford to buy more of them
> > (space, heat, and power not withstanding).
>
> SATA and SAS disks usually base on the same drive mechanism. The seek times
> are most likely identical.
>
> Some SATA disks support tagged command queueing and others do not.
> I would asume that there is no speed difference between SATA with command
> queueing and SAS.
> Jörg
>
>

Ummm, no.  SATA and SAS seek times are not even in the same universe.  They
most definitely do not use the same mechanics inside.  Whoever told you that
rubbish is an outright liar.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Bob Friesenhahn
On Wed, 1 Oct 2008, Tim wrote:
>>
>> I think you'd be surprised how large an organisation can migrate most,
>> if not all of their application servers to zones one or two Thumpers.
>>
>> Isn't that the reason for buying in "server appliances"?
>>
> I think you'd be surprised how quickly they'd be fired for putting that much
> risk into their enterprise.

There is the old saying that "No one gets fired for buying IBM".  If 
one buys an IBM system which runs 30 isolated instances of Linux, all 
of which are used for mission critical applications, is this a similar 
risk to consolidating storage on a Thumper since we are really talking 
about just one big system?

In what way is consolidating on Sun/Thumper more or less risky to an 
enterprise than consolidating on a big IBM server with many 
subordinate OS instances?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Matthew Sweeney

On 10/01/08 10:46, Al Hopper wrote:

On Wed, Oct 1, 2008 at 9:34 AM, Moore, Joe <[EMAIL PROTECTED]> wrote:
  

Ian Collins wrote:


I think you'd be surprised how large an organisation can migrate most,
if not all of their application servers to zones one or two Thumpers.

Isn't that the reason for buying in "server appliances"?

  

Assuming that the application servers can coexist in the "only" 16GB available on a 
thumper, and the "only" 8GHz of CPU core speed, and the fact that the System controller 
is a massive single point of failure for both the applications and the storage.

You may have a difference of opinion as to what a large organization is, but 
the reality is that the thumper series is good for some things in a large 
enterprise, and not good for some things.




Agreed.  My biggest issue with the Thumper is that all the disks are
7,200RPM SATA and have limited IOPS.   I'd like to see the Thumper
configurations offered allowing a user chosen mixture
of SAS and SATA drives with 7,200 and 15K RPM spindle speeds.   And
yes - I agree - you need as much RAM in the box as you can afford; ZFS
loves lots and lots of RAM and your users will love the performance
that large memory ZFS boxes provide.

Did'nt they just offer a thumper with more RAM recently???

  
The x4540  has twice the DIMM slots and # of  cores.  It also uses an 
LSI disk controller.  Still 48 sata disks @ 7200 rpm. 

You can "build a thumper" using any rack mount server you like and the 
J4200/J4400 JBOD arrays.  Then you can mix and match drives types (SATA 
and SAS).  The server portion could have as many as 16/32 cores and 
32/64 DIMM slots  (the x4450/X4640).  You'll use up a little more rack 
space but the drives will be serviceable without shutting down the system.


I think Thumper/Thor fills a specific role (maximum disk density in a 
minimum chassis).  I'd doubt that it will change much.



--

Matt Sweeney
Systems Engineer
Sun Microsystems
585-368-5930/x29097 desk
585-727-0573cell


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Moore, Joe
Darren J Moffat wrote:
> Moore, Joe wrote:
> > Given the fact that NFS, as implemented in his client
> systems, provides no end-to-end reliability, the only data
> protection that ZFS has any control over is after the write()
> is issued by the NFS server process.
>
> NFS can provided on the wire protection if you enable Kerberos support
> (there are usually 3 options for Kerberos: krb5 (or sometimes called
> krb5a) which is Auth only, krb5i which is Auth plus integrity provided
> by the RPCSEC_GSS layer, krb5p Auth+Integrity+Encrypted data.
>
> I have personally seen krb5i NFS mounts catch problems when
> there was a
> router causing failures that the TCP checksum don't catch.

No doubt, additional layers of data protection are available.  I don't know the 
state of RPCSEC on Linux, so I can't comment on this, certainly your experience 
brings valuable insight into this discussion.

It is also recommended (when iSCSI is an appropriate transport) to run over 
IPSEC in ESP mode to also ensure data-packet-content consistancy.  Certainly 
NFS over IPSEC/ESP would be more resistant to on-the-wire corruption.

Either of these would give better data reliability than pure NFS, just like ZFS 
on the backend gives better data reliability than for example, UFS or EXT3.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Al Hopper
On Wed, Oct 1, 2008 at 9:34 AM, Moore, Joe <[EMAIL PROTECTED]> wrote:
>
>
> Ian Collins wrote:
>> I think you'd be surprised how large an organisation can migrate most,
>> if not all of their application servers to zones one or two Thumpers.
>>
>> Isn't that the reason for buying in "server appliances"?
>>
>
> Assuming that the application servers can coexist in the "only" 16GB 
> available on a thumper, and the "only" 8GHz of CPU core speed, and the fact 
> that the System controller is a massive single point of failure for both the 
> applications and the storage.
>
> You may have a difference of opinion as to what a large organization is, but 
> the reality is that the thumper series is good for some things in a large 
> enterprise, and not good for some things.
>

Agreed.  My biggest issue with the Thumper is that all the disks are
7,200RPM SATA and have limited IOPS.   I'd like to see the Thumper
configurations offered allowing a user chosen mixture
of SAS and SATA drives with 7,200 and 15K RPM spindle speeds.   And
yes - I agree - you need as much RAM in the box as you can afford; ZFS
loves lots and lots of RAM and your users will love the performance
that large memory ZFS boxes provide.

Did'nt they just offer a thumper with more RAM recently???

-- 
Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Darren J Moffat
Moore, Joe wrote:
> Toby Thain Wrote:
>> ZFS allows the architectural option of separate storage without losing end 
>> to end protection, so the distinction is still important. Of course this 
>> means ZFS itself runs on the application server, but so what?
> 
> The OP in question is not running his network clients on Solaris or 
> OpenSolaris or FreeBSD or MacOSX, but rather a collection of Linux 
> workstations.  Unless there's been a recent port of ZFS to Linux, that makes 
> a big What.
> 
> Given the fact that NFS, as implemented in his client systems, provides no 
> end-to-end reliability, the only data protection that ZFS has any control 
> over is after the write() is issued by the NFS server process.

NFS can provided on the wire protection if you enable Kerberos support 
(there are usually 3 options for Kerberos: krb5 (or sometimes called 
krb5a) which is Auth only, krb5i which is Auth plus integrity provided 
by the RPCSEC_GSS layer, krb5p Auth+Integrity+Encrypted data.

I have personally seen krb5i NFS mounts catch problems when there was a 
router causing failures that the TCP checksum don't catch.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Al Hopper
On Wed, Oct 1, 2008 at 8:52 AM, Brian Hechinger <[EMAIL PROTECTED]> wrote:
> On Wed, Oct 01, 2008 at 01:03:28AM +0200, Ahmed Kamal wrote:
>>
>> Hmm ... well, there is a considerable price difference, so unless someone
>> says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200
>> drives. By the way, how many of those would saturate a single (non trunked)
>> Gig ethernet link ? Workload NFS sharing of software and homes. I think 4
>> disks should be about enough to saturate it ?
>
> You keep mentioning that you plan on using NFS, and everyone seems to keep
> ignoring the fact that in order to make NFS performance reasonable you're
> really going to want a couple very fast slog devices.  Since I don't have
> the correct amount of money to afford a very fast slog device, I can't
> speak to which one is the best price/performance ratio, but there are tons
> of options out there.

+1 for the slog devices - make them 15k RPM SAS

Also the OP has not stated how his Linux clients intend to use this
fileserver.  In particular, we need to understand how many IOPS (I/O
Ops/Sec) are required and whether the typical workload is sequencial
(large or small file) or random and the percentage or read to write
operations.

Often a mix of different ZFS configs are required to provide a
complete and flexible solution.  Here is a rough generalization:

- for large file sequential I/O with high reliability go raidz2 with 6
disks minimun and use SATA disks.
- for workloads with random I/O patterns and you need lots of IOPS -
use a ZFS multi-way mirror and 15k RPM SAS disks.   For example, a
3-way mirror will distribute the reads across 3 drives - so you'll see
3 * (single disk) IOPS for reads and 1* IOPS for writes.  Consider 4
or more way mirrors for heavy (random) read workloads.

Usually  it makes sense to configure more that one ZFS pool config and
then use the zpool that is appropriate for each specific workload.
Also this config (diversity) future-proofs your fileserver - because
its very difficult to predict how your usage patterns will change a
year down the road[1].

Also, bear in mind that, in the future, you may wish to replace disks
with SSDs (or add SSDs) to this fileserver - when the pricing is more
reasonable.  So only spend what you absolutely need to spend to meet
todays requirements.  You can always "push in"
newer/bigger/better/faster *devices* down the road and this will
provide you with a more flexible fileserver as your needs evolve.
This is a huge strength for ZFS.

Feel free to email me off list if you want more specific recommendations.

[1] on a 10 disk system we have:
a) a 5 disk RAIDZ pool
b) a 3-way mirror (pool)
c) a 2-way mirror (pool)
If I was to do it again, I'd make a) a 6-disk RAIDZ2 config to take
advantage of the higher reliability provided by this config.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Moore, Joe


Ian Collins wrote:
> I think you'd be surprised how large an organisation can migrate most,
> if not all of their application servers to zones one or two Thumpers.
>
> Isn't that the reason for buying in "server appliances"?
>

Assuming that the application servers can coexist in the "only" 16GB available 
on a thumper, and the "only" 8GHz of CPU core speed, and the fact that the 
System controller is a massive single point of failure for both the 
applications and the storage.

You may have a difference of opinion as to what a large organization is, but 
the reality is that the thumper series is good for some things in a large 
enterprise, and not good for some things.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Moore, Joe

Toby Thain Wrote:
> ZFS allows the architectural option of separate storage without losing end to 
> end protection, so the distinction is still important. Of course this means 
> ZFS itself runs on the application server, but so what?

The OP in question is not running his network clients on Solaris or OpenSolaris 
or FreeBSD or MacOSX, but rather a collection of Linux workstations.  Unless 
there's been a recent port of ZFS to Linux, that makes a big What.

Given the fact that NFS, as implemented in his client systems, provides no 
end-to-end reliability, the only data protection that ZFS has any control over 
is after the write() is issued by the NFS server process.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Joerg Schilling
David Magda <[EMAIL PROTECTED]> wrote:

> On Sep 30, 2008, at 19:09, Tim wrote:
>
> > SAS has far greater performance, and if your workload is extremely  
> > random,
> > will have a longer MTBF.  SATA drives suffer badly on random  
> > workloads.
>
> Well, if you can probably afford more SATA drives for the purchase  
> price, you can put them in a striped-mirror set up, and that may help  
> things. If your disks are cheap you can afford to buy more of them  
> (space, heat, and power not withstanding).

SATA and SAS disks usually base on the same drive mechanism. The seek times
are most likely identical.

Some SATA disks support tagged command queueing and others do not.
I would asume that there is no speed difference between SATA with command 
queueing and SAS.
Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Brian Hechinger
On Wed, Oct 01, 2008 at 01:03:28AM +0200, Ahmed Kamal wrote:
> 
> Hmm ... well, there is a considerable price difference, so unless someone
> says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200
> drives. By the way, how many of those would saturate a single (non trunked)
> Gig ethernet link ? Workload NFS sharing of software and homes. I think 4
> disks should be about enough to saturate it ?

You keep mentioning that you plan on using NFS, and everyone seems to keep
ignoring the fact that in order to make NFS performance reasonable you're
really going to want a couple very fast slog devices.  Since I don't have
the correct amount of money to afford a very fast slog device, I can't
speak to which one is the best price/performance ratio, but there are tons
of options out there.

-brian
-- 
"Coding in C is like sending a 3 year old to do groceries. You gotta
tell them exactly what you want or you'll end up with a cupboard full of
pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Joerg Schilling
Tim <[EMAIL PROTECTED]> wrote:

> > Hmm ... well, there is a considerable price difference, so unless someone
> > says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200
> > drives. By the way, how many of those would saturate a single (non trunked)
> > Gig ethernet link ? Workload NFS sharing of software and homes. I think 4
> > disks should be about enough to saturate it ?
> >
>
> SAS has far greater performance, and if your workload is extremely random,
> will have a longer MTBF.  SATA drives suffer badly on random workloads.

The SATA Barracuda ST310003 I recently bought has a MTBF of 136 years.
If you believe that you may compare MTBF values in the range > 100 years,
you may do something wrong.

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Darren J Moffat
[EMAIL PROTECTED] wrote:
>> On Tue, 30 Sep 2008, Robert Thurlow wrote:
>>
 Modern NFS runs over a TCP connection, which includes its own data 
 validation.  This surely helps.
>>> Less than we'd sometimes like :-)  The TCP checksum isn't
>>> very strong, and we've seen corruption tied to a broken
>>> router, where the Ethernet checksum was recomputed on
>>> bad data, and the TCP checksum didn't help.  It sucked.
>> TCP does not see the router.  The TCP and ethernet checksums are at 
>> completely different levels.  Routers do not pass ethernet packets. 
>> They pass IP packets. Your statement does not make technical sense.
> 
> I think he was referring to a broken VLAN switch.
> 
> But even then, any active component will take bist from the
> wire, check the MAC, changes what needed and redo the MAC and
> other checksums which needed changes.  The whole packet lives
> in the memory of the switch/router and if that memory is broken
> the packet will be send damaged.  

Which is why you need a network end-to-end strong checksum for iSCSI.  I 
recommend that IPsec AH (at least but in many cases ESP) be deployed. 
If you care enough about your data to set checksum=sha256 for the ZFS 
datasets then make sure you care enough to setup IPsec and use 
HMAC-SHA256 for on the wire integrity protection too.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-10-01 Thread Casper . Dik

>On Tue, 30 Sep 2008, Robert Thurlow wrote:
>
>>> Modern NFS runs over a TCP connection, which includes its own data 
>>> validation.  This surely helps.
>>
>> Less than we'd sometimes like :-)  The TCP checksum isn't
>> very strong, and we've seen corruption tied to a broken
>> router, where the Ethernet checksum was recomputed on
>> bad data, and the TCP checksum didn't help.  It sucked.
>
>TCP does not see the router.  The TCP and ethernet checksums are at 
>completely different levels.  Routers do not pass ethernet packets. 
>They pass IP packets. Your statement does not make technical sense.

I think he was referring to a broken VLAN switch.

But even then, any active component will take bist from the
wire, check the MAC, changes what needed and redo the MAC and
other checksums which needed changes.  The whole packet lives
in the memory of the switch/router and if that memory is broken
the packet will be send damaged.  

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 11:58 PM, Nicolas Williams <[EMAIL PROTECTED]
> wrote:

> On Tue, Sep 30, 2008 at 08:54:50PM -0500, Tim wrote:
> > As it does in ANY fileserver scenario, INCLUDING zfs.  He is building a
> > FILESERVER.  This is not an APPLICATION server.  You seem to be stuck on
> > this idea that everyone is using ZFS on the server they're running the
> > application.  That does a GREAT job of creating disparate storage
> islands,
> > something EVERY enterprise is trying to get rid of.  Not create more of.
>
> First off there's an issue of design.  Wherever possible end-to-end
> protection is better (and easier to implement and deploy) than
> hop-by-hop protection.
>
> Hop-by-hop protection implies a lot of trust.  Yes, in a NAS you're
> going to have at least one hop: from the client to the server.  But how
> does the necessity of one hop mean that N hops is fine?  One hop is
> manageable.  N hops is a disaster waiting to happen.


Who's talking about N hops?  WAFL gives you the exact same amount of hops as
ZFS.


>
>
> Second, NAS is not the only way to access remote storage.  There's also
> SAN (e.g., iSCSI).  So you might host a DB on a ZFS pool backed by iSCSI
> targets.  If you do that with a random iSCSI target implementation then
> you get end-to-end integrity protection regardless of what else the
> vendor does for you in terms of hop-by-hop integrity protection.  And
> you can even host the target on a ZFS pool, in which case there's two
> layers of integrity protection, and so some waste of disk space, but you
> get the benefit of very flexible volume management on both, the
> initiator and the target.
>

I don't recall saying it was.  The original poster is talking about a
FILESERVER, not iSCSI targets.  As off topic as it is, the current iSCSI
target is hardly fully baked or production ready.


>
> Third, who's to say that end-to-end integrity protection can't possibly
> be had in a NAS environment?  Sure, with today's protocols you can't
> have it -- you can get hop-by-hop protection with at least one hop (see
> above) -- but having end-to-end integrity protection built-in to the
> filesystem may enable new NAS protocols that do provide end-to-end
> protection.  (This is a variant of the first point above: good design
> decisions pay off.)
>
>
Which would apply to WAFL as well as ZFS.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Wed, Oct 1, 2008 at 12:24 AM, Ian Collins <[EMAIL PROTECTED]> wrote:

> Tim wrote:
> >
> > As it does in ANY fileserver scenario, INCLUDING zfs.  He is building
> > a FILESERVER.  This is not an APPLICATION server.  You seem to be
> > stuck on this idea that everyone is using ZFS on the server they're
> > running the application.  That does a GREAT job of creating disparate
> > storage islands, something EVERY enterprise is trying to get rid of.
> > Not create more of.
>
> I think you'd be surprised how large an organisation can migrate most,
> if not all of their application servers to zones one or two Thumpers.
>
> Isn't that the reason for buying in "server appliances"?
>
> Ian
>
>
I think you'd be surprised how quickly they'd be fired for putting that much
risk into their enterprise.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 10:44 PM, Toby Thain <[EMAIL PROTECTED]>wrote:

>
>
> ZFS allows the architectural option of separate storage without losing end
> to end protection, so the distinction is still important. Of course this
> means ZFS itself runs on the application server, but so what?
>
> --Toby
>
>
So what would be that the application has to run on Solaris.  And requires a
LUN to function.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ian Collins
Tim wrote:
>
> As it does in ANY fileserver scenario, INCLUDING zfs.  He is building
> a FILESERVER.  This is not an APPLICATION server.  You seem to be
> stuck on this idea that everyone is using ZFS on the server they're
> running the application.  That does a GREAT job of creating disparate
> storage islands, something EVERY enterprise is trying to get rid of. 
> Not create more of.

I think you'd be surprised how large an organisation can migrate most,
if not all of their application servers to zones one or two Thumpers. 

Isn't that the reason for buying in "server appliances"?

Ian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Nicolas Williams
On Tue, Sep 30, 2008 at 08:54:50PM -0500, Tim wrote:
> As it does in ANY fileserver scenario, INCLUDING zfs.  He is building a
> FILESERVER.  This is not an APPLICATION server.  You seem to be stuck on
> this idea that everyone is using ZFS on the server they're running the
> application.  That does a GREAT job of creating disparate storage islands,
> something EVERY enterprise is trying to get rid of.  Not create more of.

First off there's an issue of design.  Wherever possible end-to-end
protection is better (and easier to implement and deploy) than
hop-by-hop protection.

Hop-by-hop protection implies a lot of trust.  Yes, in a NAS you're
going to have at least one hop: from the client to the server.  But how
does the necessity of one hop mean that N hops is fine?  One hop is
manageable.  N hops is a disaster waiting to happen.

Second, NAS is not the only way to access remote storage.  There's also
SAN (e.g., iSCSI).  So you might host a DB on a ZFS pool backed by iSCSI
targets.  If you do that with a random iSCSI target implementation then
you get end-to-end integrity protection regardless of what else the
vendor does for you in terms of hop-by-hop integrity protection.  And
you can even host the target on a ZFS pool, in which case there's two
layers of integrity protection, and so some waste of disk space, but you
get the benefit of very flexible volume management on both, the
initiator and the target.

Third, who's to say that end-to-end integrity protection can't possibly
be had in a NAS environment?  Sure, with today's protocols you can't
have it -- you can get hop-by-hop protection with at least one hop (see
above) -- but having end-to-end integrity protection built-in to the
filesystem may enable new NAS protocols that do provide end-to-end
protection.  (This is a variant of the first point above: good design
decisions pay off.)

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Toby Thain


On 30-Sep-08, at 9:54 PM, Tim wrote:




On Tue, Sep 30, 2008 at 8:50 PM, Toby Thain  
<[EMAIL PROTECTED]> wrote:




NetApp's block-appended checksum approach appears similar but is  
in fact much stronger. Like many arrays, NetApp formats its drives  
with 520-byte sectors. It then groups them into 8-sector blocks:  
4K of data (the WAFL filesystem blocksize) and 64 bytes of  
checksum. When WAFL reads a block it compares the checksum to the  
data just like an array would, but there's a key difference: it  
does this comparison after the data has made it through the I/O  
path, so it validates that the block made the journey from platter  
to memory without damage in transit.




This is not end to end protection; they are merely saying the data  
arrived in the storage subsystem's memory verifiably intact. The  
data still has a long way to go before it reaches the application.


--Toby


As it does in ANY fileserver scenario, INCLUDING zfs.  He is  
building a FILESERVER.  This is not an APPLICATION server.  You  
seem to be stuck on this idea that everyone is using ZFS on the  
server they're running the application.


ZFS allows the architectural option of separate storage without  
losing end to end protection, so the distinction is still important.  
Of course this means ZFS itself runs on the application server, but  
so what?


--Toby

That does a GREAT job of creating disparate storage islands,  
something EVERY enterprise is trying to get rid of.  Not create  
more of.


--Tim



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Richard Elling
Ahmed Kamal wrote:
>
>
> So, performance aside, does SAS have other benefits ? Data
> integrity ? How would a 8 raid1 sata compare vs another 8 smaller
> SAS disks in raidz(2) ?
> Like apples and pomegranates.  Both should be able to saturate a
> GbE link.
>
>  
> You're the expert, but isn't the 100M/s for streaming not random 
> read/write. For that, I suppose the disk drops to around 25M/s which 
> is why I was mentioning 4 sata disks.
>
> When I was asking for comparing the 2 raids, It's was aside from 
> performance, basically sata is obviously cheaper, it will saturate the 
> gig link, so performance yes too, so the question becomes which has 
> better data protection ( 8 sata raid1 or 8 sas raidz2)

Good question.  Since you are talking about different disks, the
vendor specs are different.  The  500 GByte Seagate Barracuda
7200.11 I described above is rated with an MTBF of 750,000 hours,
even  though it comes in either a SATA or SAS interface -- but that isn't
so  interesting.  A 450 GByte Seagate Cheetah 15k.6 (SAS) has a rated
MTBF of 1.6M hours.  Putting that into RAIDoptimizer we see:

Disk   RAID  MTTDL[1](yrs) MTTDL[2](yrs)

Barracuda  1+0 284,966 5,351
   z2  180,663,117 6,784,904
Cheetah1+0   1,316,385   126,839
   z21,807,134,968   348,249,968

For ZFS, 50% space used, logistical MTTR=24 hours, mirror
resync time = 60 GBytes/hr

In general, (2-way) mirrors are single parity, raidz2 is double parity.
If you use a triple mirror, then the numbers will be closer to the raidz2
numbers.

For explanations of these models, see my blog,
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 8:50 PM, Toby Thain <[EMAIL PROTECTED]>wrote:

>
>
> * NetApp's block-appended checksum approach appears similar but is in fact
> much stronger. Like many arrays, NetApp formats its drives with 520-byte
> sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL
> filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it
> compares the checksum to the data just like an array would, but there's a
> key difference: it does this comparison after the data has made it through
> the I/O path, so it validates that the block made the journey from platter
> to memory without damage in transit.*
>
>
> This is not end to end protection; they are merely saying the data arrived
> in the storage subsystem's memory verifiably intact. The data still has a
> long way to go before it reaches the application.
>
> --Toby
>
>
As it does in ANY fileserver scenario, INCLUDING zfs.  He is building a
FILESERVER.  This is not an APPLICATION server.  You seem to be stuck on
this idea that everyone is using ZFS on the server they're running the
application.  That does a GREAT job of creating disparate storage islands,
something EVERY enterprise is trying to get rid of.  Not create more of.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Miles Nordin
> "rt" == Robert Thurlow <[EMAIL PROTECTED]> writes:
> "dm" == David Magda <[EMAIL PROTECTED]> writes:

dm> Not of which helped Amazon when their S3 service went down due
dm> to a flipped bit:

ok, I get that S3 went down due to corruption, and that the network
checksums I mentioned failed to prevent the corruption.  The missing
piece is: belief that the corruption occurred on the network rather
than somewhere else.

Their post-mortem sounds to me as though a bit flipped inside the
memory of one server could be spread via this ``gossip'' protocol to
infect the entire cluster.  The replication and spreadability of the
data makes their cluster into a many-terabyte gamma ray detector.

I wonder if they even use a meaningful VPN.

  > Modern NFS runs over a TCP connection, which includes its own
  > data validation.  This surely helps.

Yeah fine, but IP and UDP and Ethernet also have checksums.  The one
in TCP isn't much fancier.

rt> The TCP checksum isn't very strong, and we've seen corruption
rt> tied to a broken router, where the Ethernet checksum was
rt> recomputed on bad data, and the TCP checksum didn't help.  It
rt> sucked.

That's more like what I was looking for.

The other concept from your first post of ``protection domains'' is
interesting, too (of one domain including ZFS and NFS).  Of course,
what do you do when you get an error on an NFS client, throw ``stale
NFS file handle?''  Even speaking hypothetically, it depends on good
exception handling for its value, which has been a big trouble spot
for ZFS so far.

This ``protection domain'' concept is already enshrined in IEEE
802.1d---bridges are not supposed to recalculate the FCS, and if they
need to mangle the packet they're supposed to update the FCS
algorithmically based on fancy math and only the bits they changed,
not just recalculate it over the whole packet.  They state this is to
protect against bad RAM inside the bridge.  I don't know if anyone
DOES that, but it's written into the spec.

But if the network is L3, then FCS and IP checksums (ttl decrement)
will have to be recalculated, so the ``protection domain'' is partly
split leaving only the UDP/TCP checksum contiguous.


pgpDuHpk4l3x2.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Toby Thain


On 30-Sep-08, at 6:31 PM, Tim wrote:




On Tue, Sep 30, 2008 at 5:19 PM, Erik Trimble  
<[EMAIL PROTECTED]> wrote:


To make Will's argument more succinct (), with a NetApp,  
undetectable (by the NetApp) errors can be introduced at the HBA  
and transport layer (FC Switch, slightly damage cable) levels.
ZFS will detect such errors, and fix them (if properly configured).  
NetApp has no such ability.


Also, I'm not sure that a NetApp (or EMC) has the ability to find  
bit-rot.  ...




NetApp's block-appended checksum approach appears similar but is in  
fact much stronger. Like many arrays, NetApp formats its drives  
with 520-byte sectors. It then groups them into 8-sector blocks: 4K  
of data (the WAFL filesystem blocksize) and 64 bytes of checksum.  
When WAFL reads a block it compares the checksum to the data just  
like an array would, but there's a key difference: it does this  
comparison after the data has made it through the I/O path, so it  
validates that the block made the journey from platter to memory  
without damage in transit.




This is not end to end protection; they are merely saying the data  
arrived in the storage subsystem's memory verifiably intact. The data  
still has a long way to go before it reaches the application.


--Toby



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ahmed Kamal
Hm, richard's excellent Graphs here http://blogs.sun.com/relling/tags/mttdl
as well as his words say he prefers mirroring over raidz/raidz2 almost
always. It's better for performance and MTTDL.

Since 8 sata raid1 is cheaper and probably more reliable than 8 raidz2 sas
(and I dont need extra sas performance), and offers better performance and
MTTDL than 8 sata raidz2, I guess I will go with 8-sata-raid1 then!
Hope I'm not horribly mistaken :)

On Wed, Oct 1, 2008 at 3:18 AM, Tim <[EMAIL PROTECTED]> wrote:

>
>
> On Tue, Sep 30, 2008 at 8:13 PM, Ahmed Kamal <
> [EMAIL PROTECTED]> wrote:
>
>>
>>> So, performance aside, does SAS have other benefits ? Data integrity ?
>>> How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2)
>>> ?
>>> Like apples and pomegranates.  Both should be able to saturate a GbE
>>> link.
>>>
>>
>> You're the expert, but isn't the 100M/s for streaming not random
>> read/write. For that, I suppose the disk drops to around 25M/s which is why
>> I was mentioning 4 sata disks.
>>
>> When I was asking for comparing the 2 raids, It's was aside from
>> performance, basically sata is obviously cheaper, it will saturate the gig
>> link, so performance yes too, so the question becomes which has better data
>> protection ( 8 sata raid1 or 8 sas raidz2)
>>
>
> SAS's main benefits are seek time and max IOPS.
>
> --Tim
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Bob Friesenhahn
On Tue, 30 Sep 2008, Robert Thurlow wrote:

>> Modern NFS runs over a TCP connection, which includes its own data 
>> validation.  This surely helps.
>
> Less than we'd sometimes like :-)  The TCP checksum isn't
> very strong, and we've seen corruption tied to a broken
> router, where the Ethernet checksum was recomputed on
> bad data, and the TCP checksum didn't help.  It sucked.

TCP does not see the router.  The TCP and ethernet checksums are at 
completely different levels.  Routers do not pass ethernet packets. 
They pass IP packets. Your statement does not make technical sense.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 8:13 PM, Ahmed Kamal <
[EMAIL PROTECTED]> wrote:

>
>> So, performance aside, does SAS have other benefits ? Data integrity ? How
>> would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ?
>> Like apples and pomegranates.  Both should be able to saturate a GbE link.
>>
>
> You're the expert, but isn't the 100M/s for streaming not random
> read/write. For that, I suppose the disk drops to around 25M/s which is why
> I was mentioning 4 sata disks.
>
> When I was asking for comparing the 2 raids, It's was aside from
> performance, basically sata is obviously cheaper, it will saturate the gig
> link, so performance yes too, so the question becomes which has better data
> protection ( 8 sata raid1 or 8 sas raidz2)
>

SAS's main benefits are seek time and max IOPS.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Bob Friesenhahn
On Wed, 1 Oct 2008, Ahmed Kamal wrote:

> Always assuming 2 spare disks, and Using the sata disks, I would configure
> them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would
> get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?),
> and better data reliability ?? (don't really know about that one) ?
>
> Is this a recommended setup ? It looks too good to be true ?

Using mirrors will surely make up quite a lot for disks with slow seek 
times.  Reliability is acceptable for most purposes.  Resilver should 
be pretty fast.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ahmed Kamal
>
>
> So, performance aside, does SAS have other benefits ? Data integrity ? How
> would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ?
> Like apples and pomegranates.  Both should be able to saturate a GbE link.
>

You're the expert, but isn't the 100M/s for streaming not random read/write.
For that, I suppose the disk drops to around 25M/s which is why I was
mentioning 4 sata disks.

When I was asking for comparing the 2 raids, It's was aside from
performance, basically sata is obviously cheaper, it will saturate the gig
link, so performance yes too, so the question becomes which has better data
protection ( 8 sata raid1 or 8 sas raidz2)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Richard Elling
Ahmed Kamal wrote:
>
> I observe that there are no disk vendors supplying SATA disks
> with speed > 7,200 rpm.  It is no wonder that a 10k rpm disk
> outperforms a 7,200 rpm disk for random workloads.  I'll attribute
> this to intentional market segmentation by the industry rather than
> a deficiency in the transfer protocol (SATA).
>
>
> I don't really need more performance that what's needed to saturate a 
> gig link (4 sata disks?)

It depends on the disk.  A Seagate Barracuda 500 GByte SATA disk is
rated at a media speed of 105 MBytes/s which is near the limit of a
GbE link.  In theory, one disk would be close, two should do it.

> So, performance aside, does SAS have other benefits ? Data integrity ? 
> How would a 8 raid1 sata compare vs another 8 smaller SAS disks in 
> raidz(2) ?

Like apples and pomegranates.  Both should be able to saturate a GbE link.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ahmed Kamal
>
> I observe that there are no disk vendors supplying SATA disks
> with speed > 7,200 rpm.  It is no wonder that a 10k rpm disk
> outperforms a 7,200 rpm disk for random workloads.  I'll attribute
> this to intentional market segmentation by the industry rather than
> a deficiency in the transfer protocol (SATA).
>

I don't really need more performance that what's needed to saturate a gig
link (4 sata disks?)
So, performance aside, does SAS have other benefits ? Data integrity ? How
would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Richard Elling
Tim wrote:
>
>
> On Tue, Sep 30, 2008 at 7:15 PM, David Magda <[EMAIL PROTECTED] 
> > wrote:
>
> On Sep 30, 2008, at 19:09, Tim wrote:
>
> SAS has far greater performance, and if your workload is
> extremely random,
> will have a longer MTBF.  SATA drives suffer badly on random
> workloads.
>
>
> Well, if you can probably afford more SATA drives for the purchase
> price, you can put them in a striped-mirror set up, and that may
> help things. If your disks are cheap you can afford to buy more of
> them (space, heat, and power not withstanding).
>
>
> More disks will not solve SATA's problem.  I run into this on a daily 
> basis working on enterprise storage.  If it's for just 
> archive/storage, or even sequential streaming, it shouldn't be a big 
> deal.  If it's random workload, there's pretty much nothing you can do 
> to get around it short of more front-end cache and intelligence which 
> is simply a band-aid, not a fix.

I observe that there are no disk vendors supplying SATA disks
with speed > 7,200 rpm.  It is no wonder that a 10k rpm disk
outperforms a 7,200 rpm disk for random workloads.  I'll attribute
this to intentional market segmentation by the industry rather than
a deficiency in the transfer protocol (SATA).
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Robert Thurlow
Miles Nordin wrote:

> There are checksums in the ethernet FCS, checksums in IP headers,
> checksums in UDP headers (which are sometimes ignored), and checksums
> in TCP (which are not ignored).  There might be an RPC layer checksum,
> too, not sure.
> 
> Different arguments can be made against each, I suppose, but did you
> have a particular argument in mind?
> 
> Have you experienced corruption with NFS that you can blame on the
> network, not the CPU/memory/busses of the server and client?

Absolutely.  See my recent post in this thread.  The TCP
checksum is not that strong, and a router broken the
right way can regenerate a correct-looking Ethernet
checksum on bad data.  krb5i fixed it nicely.

Rob T
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Robert Thurlow
Bob Friesenhahn wrote:
> On Wed, 1 Oct 2008, Ahmed Kamal wrote:
>> So, I guess this makes them equal. How about a new "reliable NFS" protocol,
>> that computes the hashes on the client side, sends it over the wire to be
>> written remotely on the zfs storage node ?!
> 
> Modern NFS runs over a TCP connection, which includes its own data 
> validation.  This surely helps.

Less than we'd sometimes like :-)  The TCP checksum isn't
very strong, and we've seen corruption tied to a broken
router, where the Ethernet checksum was recomputed on
bad data, and the TCP checksum didn't help.  It sucked.

Rob T
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ahmed Kamal
>
> Well, if you can probably afford more SATA drives for the purchase
> price, you can put them in a striped-mirror set up, and that may help
> things. If your disks are cheap you can afford to buy more of them
> (space, heat, and power not withstanding).
>

Hmm, that's actually cool !
If I configure the system with

10 x 400G 10k rpm disk == cost ==> 13k$
10 x 1TB SATA 7200 == cost ==> 9k$

Always assuming 2 spare disks, and Using the sata disks, I would configure
them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would
get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?),
and better data reliability ?? (don't really know about that one) ?

Is this a recommended setup ? It looks too good to be true ?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 7:15 PM, David Magda <[EMAIL PROTECTED]> wrote:

> On Sep 30, 2008, at 19:09, Tim wrote:
>
>  SAS has far greater performance, and if your workload is extremely random,
>> will have a longer MTBF.  SATA drives suffer badly on random workloads.
>>
>
> Well, if you can probably afford more SATA drives for the purchase price,
> you can put them in a striped-mirror set up, and that may help things. If
> your disks are cheap you can afford to buy more of them (space, heat, and
> power not withstanding).
>
>
More disks will not solve SATA's problem.  I run into this on a daily basis
working on enterprise storage.  If it's for just archive/storage, or even
sequential streaming, it shouldn't be a big deal.  If it's random workload,
there's pretty much nothing you can do to get around it short of more
front-end cache and intelligence which is simply a band-aid, not a fix.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread David Magda
On Sep 30, 2008, at 19:09, Tim wrote:

> SAS has far greater performance, and if your workload is extremely  
> random,
> will have a longer MTBF.  SATA drives suffer badly on random  
> workloads.

Well, if you can probably afford more SATA drives for the purchase  
price, you can put them in a striped-mirror set up, and that may help  
things. If your disks are cheap you can afford to buy more of them  
(space, heat, and power not withstanding).

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread David Magda
On Sep 30, 2008, at 19:44, Miles Nordin wrote:

> There are checksums in the ethernet FCS, checksums in IP headers,
> checksums in UDP headers (which are sometimes ignored), and checksums
> in TCP (which are not ignored).  There might be an RPC layer checksum,
> too, not sure.

Not of which helped Amazon when their S3 service went down due to a  
flipped bit:

> More specifically, we found that there were a handful of messages on  
> Sunday morning that had a single bit corrupted such that the message  
> was still intelligible, but the system state information was  
> incorrect. We use MD5 checksums throughout the system, for example,  
> to prevent, detect, and recover from corruption that can occur  
> during receipt, storage, and retrieval of customers' objects.  
> However, we didn't have the same protection in place to detect  
> whether this particular internal state information had been corrupted.


http://status.aws.amazon.com/s3-20080720.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Miles Nordin
> "rt" == Robert Thurlow <[EMAIL PROTECTED]> writes:

rt> introduces a separate protection domain for the NFS link.

There are checksums in the ethernet FCS, checksums in IP headers,
checksums in UDP headers (which are sometimes ignored), and checksums
in TCP (which are not ignored).  There might be an RPC layer checksum,
too, not sure.

Different arguments can be made against each, I suppose, but did you
have a particular argument in mind?

Have you experienced corruption with NFS that you can blame on the
network, not the CPU/memory/busses of the server and client?


I've experienced enough to make me buy stories of corruption in disks,
disk interfaces, and memory.  but not yet with TCP so I'd like to hear
the story as well as the hypothetical argument, if there is one.


pgpE7CaaWOORQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Nicolas Williams
On Tue, Sep 30, 2008 at 06:09:30PM -0500, Tim wrote:
> On Tue, Sep 30, 2008 at 6:03 PM, Ahmed Kamal <
> [EMAIL PROTECTED]> wrote:
> > BTW, for everyone saying zfs is more reliable because it's closer to the
> > application than a netapp, well at least in my case it isn't. The solaris
> > box will be NFS sharing and the apps will be running on remote Linux boxes.
> > So, I guess this makes them equal. How about a new "reliable NFS" protocol,
> > that computes the hashes on the client side, sends it over the wire to be
> > written remotely on the zfs storage node ?!
> 
> Won't be happening anytime soon.

If you use RPCSEC_GSS with integrity protection then you've got it
already.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Bob Friesenhahn
On Wed, 1 Oct 2008, Ahmed Kamal wrote:
> So, I guess this makes them equal. How about a new "reliable NFS" protocol,
> that computes the hashes on the client side, sends it over the wire to be
> written remotely on the zfs storage node ?!

Modern NFS runs over a TCP connection, which includes its own data 
validation.  This surely helps.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Bob Friesenhahn
On Tue, 30 Sep 2008, Miles Nordin wrote:
>
> I think you need two zpools, or zpool + LVM2/XFS, some kind of
> two-filesystem setup, because of the ZFS corruption and
> panic/freeze-on-import problems.  Having two zpools helps with other

If ZFS provides such a terrible experience for you can I be brave 
enough to suggest that perhaps you are on the wrong mailing list and 
perhaps you should be watching the pinwheels with HFS+?  ;-)

While we surely do hear all the horror stories on this list, I don't 
think that ZFS is as wildly unstable as you make it out to be.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Robert Thurlow
Ahmed Kamal wrote:

> BTW, for everyone saying zfs is more reliable because it's closer to the 
> application than a netapp, well at least in my case it isn't. The 
> solaris box will be NFS sharing and the apps will be running on remote 
> Linux boxes. So, I guess this makes them equal. How about a new 
> "reliable NFS" protocol, that computes the hashes on the client side, 
> sends it over the wire to be written remotely on the zfs storage node ?!

We've actually prototyped an NFS protocol extension that does
this, but the challenges are integrating it with ZFS to form
a single protection domain, and getting the protocol to be a
standard.

For now, an option you have is Kerberos with data integrity;
the sender computes a CRC of the data and the receiver can
verify it to rule out OTW corruption.  This is, of course,
not end-to-end from platter to memory, but introduces a
separate protection domain for the NFS link.

Rob T
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 6:03 PM, Ahmed Kamal <
[EMAIL PROTECTED]> wrote:

>
>
>>
> Hmm ... well, there is a considerable price difference, so unless someone
> says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200
> drives. By the way, how many of those would saturate a single (non trunked)
> Gig ethernet link ? Workload NFS sharing of software and homes. I think 4
> disks should be about enough to saturate it ?
>

SAS has far greater performance, and if your workload is extremely random,
will have a longer MTBF.  SATA drives suffer badly on random workloads.


>
>
> BTW, for everyone saying zfs is more reliable because it's closer to the
> application than a netapp, well at least in my case it isn't. The solaris
> box will be NFS sharing and the apps will be running on remote Linux boxes.
> So, I guess this makes them equal. How about a new "reliable NFS" protocol,
> that computes the hashes on the client side, sends it over the wire to be
> written remotely on the zfs storage node ?!
>

Won't be happening anytime soon.


--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ahmed Kamal
>
>  Intel mainstream (and indeed many tech companies') stuff is purposely
> stratified from the enterprise stuff by cutting out features like ECC and
> higher memory capacity and using different interface form factors.


Well I guess I am getting a Xeon anyway


> There is nothing magical about SAS drives. Hard drives are for the most
> part all built with the same technology.  The MTBF on that is 1.4M hours vs
> 1.2M hours for the enterprise 1TB SATA disk, which isn't a big difference.
>  And for comparison, the WD3000BLFS is a consumer drive with 1.4M hours
> MTBF.
>
>
Hmm ... well, there is a considerable price difference, so unless someone
says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200
drives. By the way, how many of those would saturate a single (non trunked)
Gig ethernet link ? Workload NFS sharing of software and homes. I think 4
disks should be about enough to saturate it ?

BTW, for everyone saying zfs is more reliable because it's closer to the
application than a netapp, well at least in my case it isn't. The solaris
box will be NFS sharing and the apps will be running on remote Linux boxes.
So, I guess this makes them equal. How about a new "reliable NFS" protocol,
that computes the hashes on the client side, sends it over the wire to be
written remotely on the zfs storage node ?!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 5:19 PM, Erik Trimble <[EMAIL PROTECTED]> wrote:

>
> To make Will's argument more succinct (), with a NetApp, undetectable
> (by the NetApp) errors can be introduced at the HBA and transport layer (FC
> Switch, slightly damage cable) levels.   ZFS will detect such errors, and
> fix them (if properly configured). NetApp has no such ability.
>
> Also, I'm not sure that a NetApp (or EMC) has the ability to find bit-rot.
>  That is, they can determine if a block is written correctly, but I don't
> know if they keep the block checksum around permanently, and, how redundant
> that stored block checksum is.  If they don't permanently write the block
> checksum somewhere, then the NetApp has no way to determine if a READ block
> is OK, and hasn't suffered from bit-rot (aka disk block failure).  And, if
> it's not either multiply stored, then they have the potential to lose the
> ability to do READ verification.  Neither are problems of ZFS.
>
>
> In many of my production environments, I've got at least 2 different FC
> switches between my hosts and disks.  And, with longer cables, comes more of
> the chance that something gets bent a bit too much. Finally, HBAs are not
> the most reliable things I've seen (sadly).
>
>
>
* NetApp's block-appended checksum approach appears similar but is in fact
much stronger. Like many arrays, NetApp formats its drives with 520-byte
sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL
filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it
compares the checksum to the data just like an array would, but there's a
key difference: it does this comparison after the data has made it through
the I/O path, so it validates that the block made the journey from platter
to memory without damage in transit. *
**
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread A Darren Dunham
On Tue, Sep 30, 2008 at 03:19:40PM -0700, Erik Trimble wrote:
> To make Will's argument more succinct (), with a NetApp, 
> undetectable (by the NetApp) errors can be introduced at the HBA and 
> transport layer (FC Switch, slightly damage cable) levels.   ZFS will 
> detect such errors, and fix them (if properly configured). NetApp has no 
> such ability.

It sounds like you mean the Netapp can't detect silent errors in it's
own storage.  It can (in a manner similar, but not identical to ZFS).

The difference is that the Netapp is always remote from the application,
and cannot detect corruption introduced before it arrives at the filer.

> Also, I'm not sure that a NetApp (or EMC) has the ability to find 
> bit-rot.  That is, they can determine if a block is written correctly, 
> but I don't know if they keep the block checksum around permanently, 
> and, how redundant that stored block checksum is.  If they don't 
> permanently write the block checksum somewhere, then the NetApp has no 
> way to determine if a READ block is OK, and hasn't suffered from bit-rot 
> (aka disk block failure).  And, if it's not either multiply stored, then 
> they have the potential to lose the ability to do READ verification.  
> Neither are problems of ZFS.

A netapp filer does have a permanent block checksum that can verify
reads.  To my knowledge, it is not redundant.  But then if it fails, you
can just declare that block bad and fall back on the RAID/mirror
redundancy to supply the data.

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Erik Trimble
Will Murnane wrote:
> On Tue, Sep 30, 2008 at 21:48, Tim <[EMAIL PROTECTED]> wrote:
>   
>>> why ZFS can do this and hardware solutions can't (being several
>>> unreliable subsystems away from the data).
>>>   
>> So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD
>> any different than a NetApp filer, running ONTAP with a QLogic HBA directly
>> connected to an FC JBOD?  How is it "several unreliable subsystems away from
>> the data"?
>>
>> That's a great talking point but it's far from accurate.
>> 
> Do your applications run on the NetApp filer?  The idea of ZFS as I
> see it is to checksum the data from when the application puts the data
> into memory until it reads it out of memory again.  Separate filers
> can checksum from when data is written into their buffers until they
> receive the request for that data, but to get from the filer to the
> machine running the application the data must be sent across an
> unreliable medium.  If data is corrupted between the filer and the
> host, the corruption cannot be detected.  Perhaps the filer could use
> a special protocol and include the checksum for each block, but then
> the host must verify the checksum for it to be useful.
>
> Contrast this with ZFS.  It takes the application data, checksums it,
> and writes the data and the checksum out across the (unreliable) wire
> to the (unreliable) disk.  Then when a read request comes, it reads
> the data and checksum across the (unreliable) wire, and verifies the
> checksum on the *host* side of the wire.  If the data is corrupted any
> time between the checksum being calculated on the host and checked on
> the host, it can be detected.  This adds a couple more layers of
> verifiability than filer-based checksums.
>
> Will
>   

To make Will's argument more succinct (), with a NetApp, 
undetectable (by the NetApp) errors can be introduced at the HBA and 
transport layer (FC Switch, slightly damage cable) levels.   ZFS will 
detect such errors, and fix them (if properly configured). NetApp has no 
such ability.

Also, I'm not sure that a NetApp (or EMC) has the ability to find 
bit-rot.  That is, they can determine if a block is written correctly, 
but I don't know if they keep the block checksum around permanently, 
and, how redundant that stored block checksum is.  If they don't 
permanently write the block checksum somewhere, then the NetApp has no 
way to determine if a READ block is OK, and hasn't suffered from bit-rot 
(aka disk block failure).  And, if it's not either multiply stored, then 
they have the potential to lose the ability to do READ verification.  
Neither are problems of ZFS.


In many of my production environments, I've got at least 2 different FC 
switches between my hosts and disks.  And, with longer cables, comes 
more of the chance that something gets bent a bit too much. Finally, 
HBAs are not the most reliable things I've seen (sadly).

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Will Murnane
On Tue, Sep 30, 2008 at 21:48, Tim <[EMAIL PROTECTED]> wrote:
>> why ZFS can do this and hardware solutions can't (being several
>> unreliable subsystems away from the data).
> So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD
> any different than a NetApp filer, running ONTAP with a QLogic HBA directly
> connected to an FC JBOD?  How is it "several unreliable subsystems away from
> the data"?
>
> That's a great talking point but it's far from accurate.
Do your applications run on the NetApp filer?  The idea of ZFS as I
see it is to checksum the data from when the application puts the data
into memory until it reads it out of memory again.  Separate filers
can checksum from when data is written into their buffers until they
receive the request for that data, but to get from the filer to the
machine running the application the data must be sent across an
unreliable medium.  If data is corrupted between the filer and the
host, the corruption cannot be detected.  Perhaps the filer could use
a special protocol and include the checksum for each block, but then
the host must verify the checksum for it to be useful.

Contrast this with ZFS.  It takes the application data, checksums it,
and writes the data and the checksum out across the (unreliable) wire
to the (unreliable) disk.  Then when a read request comes, it reads
the data and checksum across the (unreliable) wire, and verifies the
checksum on the *host* side of the wire.  If the data is corrupted any
time between the checksum being calculated on the host and checked on
the host, it can be detected.  This adds a couple more layers of
verifiability than filer-based checksums.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 4:26 PM, Toby Thain <[EMAIL PROTECTED]>wrote:

>
> On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote:
>
> > Thanks for all the answers .. Please find more questions below :)
> >
> > - Good to know EMC filers do not have end2end checksums! What about
> > netapp ?
>
> Blunty - no remote storage can have it by definition. The checksum
> needs to be computed as close as possible to the application. What's
> why ZFS can do this and hardware solutions can't (being several
> unreliable subsystems away from the data).
>
> --Toby
>
> > ...
>

So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD
any different than a NetApp filer, running ONTAP with a QLogic HBA directly
connected to an FC JBOD?  How is it "several unreliable subsystems away from
the data"?

That's a great talking point but it's far from accurate.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Erik Trimble
Toby Thain wrote:
> On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote:
>
>   
>> Thanks for all the answers .. Please find more questions below :)
>>
>> - Good to know EMC filers do not have end2end checksums! What about  
>> netapp ?
>> 
>
> Blunty - no remote storage can have it by definition. The checksum  
> needs to be computed as close as possible to the application. What's  
> why ZFS can do this and hardware solutions can't (being several  
> unreliable subsystems away from the data).
>
> --Toby
>
>   

Well

That's not _strictly_ true. ZFS can still munge things up as a result of 
faulty memory.  And, it's entirely possible to build a hardware end2end 
system which is at least as reliable as ZFS (e.g. is only faultable due 
to host memory failures).  It's just neither easy, nor currently 
available from anyone I know. Doing such checking is far easier at the 
filesystem level than any other place, which is a big strength of ZFS 
over other hardware solutions.  Several of the storage vendors (EMC and 
NetApp included) I do believe support hardware checksumming over on the 
SAN/NAS device, but that still leaves them vulnerable to HBA and 
transport medium (e.g. FibreChannel/SCSI/Ethernet) errors, which they 
don't currently have a solution for.

I'd be interested in seeing if anyone has statistics about where errors 
occur in the data stream. My gut tells me that (from most common to least):

(1) hard drives
(2) transport medium (particularly if it's Ethernet)
(3) SAN/NAS controller cache
(4) Host HBA
(5) SAN/NAS controller
(6) Host RAM
(7) Host bus issues

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Toby Thain

On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote:

> Thanks for all the answers .. Please find more questions below :)
>
> - Good to know EMC filers do not have end2end checksums! What about  
> netapp ?

Blunty - no remote storage can have it by definition. The checksum  
needs to be computed as close as possible to the application. What's  
why ZFS can do this and hardware solutions can't (being several  
unreliable subsystems away from the data).

--Toby

> ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread MC
Just to confuse you more, I mean, give you another point of view:

> - CPU: 1 Xeon Quad Core E5410 2.33GHz 12MB Cache 1333MHz

The reason the Xeon line is good is because it allows you to squeeze maximum 
performance out of a given processor technology from Intel, possibly getting 
the highest performance density.  The reason it is bad is because it isn't that 
much better for a lot more money.  A mainstream processor is 80% of the 
performance for 20% of the price, so unless you need the highest possible 
performance density, you can save money going mainstream.  Not that you should. 
 Intel mainstream (and indeed many tech companies') stuff is purposely 
stratified from the enterprise stuff by cutting out features like ECC and 
higher memory capacity and using different interface form factors.

> - 10  Seagate 400GB 10K 16MB SAS HDD

There is nothing magical about SAS drives. Hard drives are for the most part 
all built with the same technology.  The MTBF on that is 1.4M hours vs 1.2M 
hours for the enterprise 1TB SATA disk, which isn't a big difference.  And for 
comparison, the WD3000BLFS is a consumer drive with 1.4M hours MTBF.

And we know that enterprise SATA drives are the same as the consumer drives, 
just with different firmware optimized for server workloads and longer testing 
designed to detect infant mortality, which affects MTBF just as much as old-age 
failure.  The MTBF difference from this extra testing at the start is huge.  So 
you can tell right there that the perceived extra reliability scam they're 
running is bunk.  The SAS interface is a psychological tool to help disguise 
the fact that we're all using roughly the same stuff :)  Do your own 24 hour or 
7-day stress-testing before deployment to weed out bad drives.  

Apparently old humans don't live that much longer than they did in years gone 
by, instead much fewer of our babies die, which makes the average lifespan of 
everyone go up :)  

You know that 1TB SATA works for you now.  Don't let some big greedy company 
convince you otherwise.  That extra money should be spent on your payroll, not 
on filling EMC's coffers.

ZFS provides a new landscape for storage.  It is entirely possible that a 
server built with mainstream hardware can be cheaper, faster, and at least as 
reliable as an EMC system.  Manageability and interoperability and all those 
things are another issue however.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Tim
On Tue, Sep 30, 2008 at 2:10 PM, Ahmed Kamal <
[EMAIL PROTECTED]> wrote:

>
> * Now that I'm using ECC RAM, and enterprisey disks, Does this put this
> solution in par with low end netapp 2020 for example ?
>
>
*sort of*.  What are you going to be using it for?  Half the beauty of
NetApp are all the add-on applications you run server side.  The snapmanager
products.

If you're just using it for basic single head file serving, I'd say you're
pretty much on par.  IMO, NetApp's clustering is still far superior (yes
folks, from a fileserver perspecctive, not an application clustering
perspective) to anything Solaris has to offer right now, and also much,
much, MUCH easier to configure/manage.  Let me know when I can plug an
infiniband cable between two Solaris boxes and type "cf enable" and we'll
talk :)

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ahmed Kamal
Thanks guys, it seems the problem is even more difficult than I thought, and
it seems there is no real measure for the software quality of the zfs stack
vs others, neutralizing the hardware used under both. I will be using ECC
RAM, since you mentioned it, and I will shift to using "enterprise" disks (I
had initially thought zfs will always recovers from cheapo sata disks,
making other disks only faster but not also safer), but now I am shifting to
10krpm SAS disks

So, I am changing my question into "Do you see any obvious problems with the
following setup I am considering"

- CPU: 1 Xeon Quad Core E5410 2.33GHz 12MB Cache 1333MHz
- 16GB ECC FB-DIMM 667MHz (8 x 2GB)
- 10  Seagate 400GB 10K 16MB SAS HDD

The 10 disks will be: 2 spare + 2 parity for raidz2 + 6 data => 2.4TB
useable space

* Do I need more CPU power ? How do I measure that ? What about RAM ?!
* Now that I'm using ECC RAM, and enterprisey disks, Does this put this
solution in par with low end netapp 2020 for example ?

I will be replicating the important data daily to a Linux box, just in case
I hit a wonderful zpool bug. Any final advice before I take the blue bill ;)

Thanks a lot


On Tue, Sep 30, 2008 at 8:40 PM, Miles Nordin <[EMAIL PROTECTED]> wrote:

> > "ak" == Ahmed Kamal <[EMAIL PROTECTED]> writes:
>
>ak> I need to answer and weigh against the cost.
>
> I suggest translating the reliability problems into a cost for
> mitigating them: price the ZFS alternative as two systems, and keep
> the second system offline except for nightly backup.  Since you care
> mostly about data loss, not availability, this should work okay.  You
> can lose 1 day of data, right?
>
> I think you need two zpools, or zpool + LVM2/XFS, some kind of
> two-filesystem setup, because of the ZFS corruption and
> panic/freeze-on-import problems.  Having two zpools helps with other
> things, too, like if you need to destroy and recreate the pool to
> remove a slog or a vdev, or change from mirroring to raidz2, or
> something like that.
>
> I don't think it's realistic to give a quantitative MTDL for loss
> caused by software bugs, from netapp or from ZFS.
>
>ak> The EMC guy insisted we use 10k Fibre/SAS drives at least.
>
> I'm still not experienced at dealing with these guys without wasting
> huge amounts of time.  I guess one strategy is to call a bunch of
> them, so they are all wasting your time in parallel.  Last time I
> tried, the EMC guy wanted to meet _in person_ in the financial
> district, and then he just stopped calling so I had to guesstimate his
> quote from some low-end iSCSI/FC box that Dell was reselling.  Have
> you called netapp, hitachi, storagetek?  The IBM NAS is netapp so you
> could call IBM if netapp ignores you, but you probably want the
> storevault which is sold differently.  The HP NAS looks weird because
> it runs your choice of Linux or Windows instead of
> WeirdNASplatform---maybe read some more about that one.
>
> Of course you don't get source, but it surprised me these guys are
> MUCH worse than ordinary proprietary software.  At least netapp stuff,
> you may as well consider it leased.  They leverage the ``appliance''
> aspect, and then have sneaky licenses, that attempt to obliterate any
> potential market for used filers.  When you're cut off from support
> you can't even download manuals.  If you're accustomed to the ``first
> sale doctrine'' then ZFS with source has a huge advantage over netapp,
> beyond even ZFS's advantage over proprietary software.  The idea of
> dumping all my data into some opaque DRM canister lorded over by
> asshole CEO's who threaten to sick their corporate lawyers on users on
> the mailing list offends me just a bit, but I guess we have to follow
> the ``market forces.''
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Miles Nordin
> "ak" == Ahmed Kamal <[EMAIL PROTECTED]> writes:

ak> I need to answer and weigh against the cost.

I suggest translating the reliability problems into a cost for
mitigating them: price the ZFS alternative as two systems, and keep
the second system offline except for nightly backup.  Since you care
mostly about data loss, not availability, this should work okay.  You
can lose 1 day of data, right?

I think you need two zpools, or zpool + LVM2/XFS, some kind of
two-filesystem setup, because of the ZFS corruption and
panic/freeze-on-import problems.  Having two zpools helps with other
things, too, like if you need to destroy and recreate the pool to
remove a slog or a vdev, or change from mirroring to raidz2, or
something like that.

I don't think it's realistic to give a quantitative MTDL for loss
caused by software bugs, from netapp or from ZFS.

ak> The EMC guy insisted we use 10k Fibre/SAS drives at least.

I'm still not experienced at dealing with these guys without wasting
huge amounts of time.  I guess one strategy is to call a bunch of
them, so they are all wasting your time in parallel.  Last time I
tried, the EMC guy wanted to meet _in person_ in the financial
district, and then he just stopped calling so I had to guesstimate his
quote from some low-end iSCSI/FC box that Dell was reselling.  Have
you called netapp, hitachi, storagetek?  The IBM NAS is netapp so you
could call IBM if netapp ignores you, but you probably want the
storevault which is sold differently.  The HP NAS looks weird because
it runs your choice of Linux or Windows instead of
WeirdNASplatform---maybe read some more about that one.

Of course you don't get source, but it surprised me these guys are
MUCH worse than ordinary proprietary software.  At least netapp stuff,
you may as well consider it leased.  They leverage the ``appliance''
aspect, and then have sneaky licenses, that attempt to obliterate any
potential market for used filers.  When you're cut off from support
you can't even download manuals.  If you're accustomed to the ``first
sale doctrine'' then ZFS with source has a huge advantage over netapp,
beyond even ZFS's advantage over proprietary software.  The idea of
dumping all my data into some opaque DRM canister lorded over by
asshole CEO's who threaten to sick their corporate lawyers on users on
the mailing list offends me just a bit, but I guess we have to follow
the ``market forces.''


pgp0i0uaWcrRi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Richard Elling
Ahmed Kamal wrote:
> I guess I am mostly interested in MTDL for a zfs system on whitebox 
> hardware (like pogo), vs dataonTap on netapp hardware. Any numbers ?

It depends to a large degree on the disks chosen.  NetApp uses enterprise
class disks and you can expect better reliability from such disks.  I've 
blogged
about a few different MTTDL models and posted some model results.
http://blogs.sun.com/relling/tags/mttdl

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Bob Friesenhahn
On Tue, 30 Sep 2008, Ahmed Kamal wrote:

> I guess I am mostly interested in MTDL for a zfs system on whitebox hardware
> (like pogo), vs dataonTap on netapp hardware. Any numbers ?

Barring kernel bugs or memory errors, Richard Elling's blog entry 
seems to be the best place use as a guide:

http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl

It is pretty easy to build a ZFS pool with data loss probabilities (on 
paper) which are about as low as you winning the state jumbo lottery 
jackpot based on a ticket you found on the ground.  However, if you 
want to compete with an EMC system, then you will want to purchase 
hardware of similar grade.  If you purchase a cheapo system from Dell 
without ECC memory then the actual data reliability will suffer.

ZFS protects you against corruption in the data storage path.  It does 
not protect you against main memory errors or random memory overwrites 
due to a horrific kernel bug.  ZFS also does not protect against data 
loss due to user error, which remains the primary factor in data loss.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Ahmed Kamal
I guess I am mostly interested in MTDL for a zfs system on whitebox hardware
(like pogo), vs dataonTap on netapp hardware. Any numbers ?

On Tue, Sep 30, 2008 at 4:36 PM, Bob Friesenhahn <
[EMAIL PROTECTED]> wrote:

> On Tue, 30 Sep 2008, Ahmed Kamal wrote:
>
>>
>> - I still don't have my original question answered, I want to somehow
>> assess
>> the reliability of that zfs storage stack. If there's no hard data on
>> that,
>> then if any storage expert who works with lots of systems can give his
>> "impression" of the reliability compared to the big two, that would be
>> great
>>
>
> The reliability of that zfs storage stack primarily depends on the
> reliability of the hardware it runs on.  Note that there is a huge
> difference between 'reliability' and 'mean time to data loss' (MTDL). There
> is also the concern about 'availability' which is a function of how often
> the system fails, and the time to correct a failure.
>
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Bob Friesenhahn
On Tue, 30 Sep 2008, Ahmed Kamal wrote:
>
> - I still don't have my original question answered, I want to somehow assess
> the reliability of that zfs storage stack. If there's no hard data on that,
> then if any storage expert who works with lots of systems can give his
> "impression" of the reliability compared to the big two, that would be great

The reliability of that zfs storage stack primarily depends on the 
reliability of the hardware it runs on.  Note that there is a huge 
difference between 'reliability' and 'mean time to data loss' (MTDL). 
There is also the concern about 'availability' which is a function of 
how often the system fails, and the time to correct a failure.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quantifying ZFS reliability

2008-09-30 Thread Richard Elling
Ahmed Kamal wrote:
> Thanks for all the answers .. Please find more questions below :)
>
> - Good to know EMC filers do not have end2end checksums! What about 
> netapp ?

If they are not at the end, they can't do end-to-end data validation.
Ideally, application writers would do this, but it is a lot of work.  ZFS
does this on behalf of applications which use ZFS.  Hence my comment
about ZFS being complementary to your storage device decision.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >