Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-20 Thread Richard Elling
comment below…

On Apr 18, 2013, at 5:17 AM, Edward Ned Harvey (openindiana) 
 wrote:

>> From: Timothy Coalson [mailto:tsc...@mst.edu]
>> 
>> Did you also compare the probability of bit errors causing data loss
>> without a complete pool failure?  2-way mirrors, when one device
>> completely
>> dies, have no redundancy on that data, and the copy that remains must be
>> perfect or some data will be lost.  
> 
> I had to think about this comment for a little while to understand what you 
> were saying, but I think I got it.  I'm going to rephrase your question:
> 
> If one device in a 2-way mirror becomes unavailable, then the remaining 
> device has no redundancy.  So if a bit error is encountered on the (now 
> non-redundant) device, then it's an uncorrectable error.  Question is, did I 
> calculate that probability?
> 
> Answer is, I think so.  Modelling the probability of drive failure (either 
> complete failure or data loss) is very complex and non-linear.  Also 
> dependent on the specific model of drive in question, and the graphs are 
> typically not available.  So what I did was to start with some MTBDL graphs 
> that I assumed to be typical, and then assume every data-loss event meant 
> complete drive failure.  Already I'm simplifying the model beyond reality, 
> but the simplification focuses on worst case, and treats every bit error as 
> complete drive failure.  This is why I say "I think so," to answer your 
> question.  
> 
> Then, I didn't want to embark on a mathematician's journey of derivatives and 
> integrals over some non-linear failure rate graphs, so I linearized...  I 
> forget now (it was like 4-6 years ago) but I would have likely seen that 
> drives were unlikely to fail in the first 2 years, and about 50% likely to 
> fail after 3 years, and nearly certain to fail after 5 years, so I would have 
> likely modeled that as a linearly increasing probability of failure rate up 
> to 4 years, where it's assumed 100% failure rate at 4 years.

This technique shows a good appreciation of the expected lifetime of components.
Some of the more sophisticated models use a Weibull distribution, and this 
works 
particularly well for computing devices. The problem for designers is that the 
Weibull
model parameters are not publically published by the vendors. You need some time
in the field to collect these, so it is impractical for the systems designers.

At the end of the day, we have two practical choices:
1. Prepare for planned obsolescence and replacement of devices when the 
   expected lifetime metric is reached. The best proxy for HDD expected 
lifetime
   is the warranty period, and you'll often notice that enterprise 
drives have a better
   spec than consumer drives -- you tend to get what you pay for.

2. Measure your environment very carefully and take proactive action 
when the
   system begins to display signs of age-related wear out. This is a 
good idea
   in all cases, but the techniques are not widely adopted… yet.

> Yes, this modeling introduces inaccuracy, but that inaccuracy is in the 
> noise.  Maybe in the first 2 years, I'm 25% off in my estimates to the 
> positive, and after 4 years I'm 25% off in the negative, or something like 
> that.  But when the results show 10^-17 probability for one configuration and 
> 10^-19 probability for a different configuration, then the 25% error is 
> irrelevant.  It's easy to see which configuration is more probable to fail, 
> and it's also easy to see they're both well within acceptable limits for most 
> purposes (especially if you have good backups.)

For reliability measurements, this is not a bad track record. There are lots of 
other, 
environmental and historical factors that impact real life. As an analogy, for 
humans,
early death tends to be dominated by accidents rather than chronic health 
conditions.
For example, children tend to die in automobile accidents, while octogenarians 
tend 
to die from heart attacks, organ failure, or cancer -- different failure modes 
as a function
of age.
 -- richard

>> Also, as for time to resilver, I'm guessing that depends largely on where
>> bottlenecks are (it has to read effectively all of the remaining disks in
>> the vdev either way, but can do so in parallel, so ideally it could be the
>> same speed), 
> 
> No.  The big factor for resilver time is (a) the number of operations that 
> need to be performed, and (b) the number of operations per second.
> 
> If you have one big vdev making up a pool, then the number of operations to 
> be performed is equal to the number of objects in the pool.  The number of 
> operations per second is limited by the worst case random seek time for any 
> device in the pool.  If you have an all-SSD pool, then it's equal to a single 
> disk performance.  If you have an all-HDD pool, then with increasing number 
> of devices in your vdev, you approach 50% of the IOPS of a single device.
> 
> If

Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-18 Thread Edward Ned Harvey (openindiana)
> From: Jay Heyl [mailto:j...@frelled.us]
> 
> I now realize you're talking about 8 separate 2-disk
> mirrors organized into a pool. "mirror x1 y1 mirror x2 y2 mirror x3 y3..."

Yup.  That's normal, and the only way.


> I also realize that almost every discussion I've seen online concerning
> mirrors proposes organizing the drives in the way I was thinking about it

Hmmm...   What alternative are you thinking of?There is no alternative.


> This also starts to make a lot more sense. Confused the hell out of me the
> first three times I read it. I'm going to have to ponder this a bit more as
> my thinking has been heavily influenced by the more conventional mirror
> arrangement.

What are you talking about?
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-18 Thread Edward Ned Harvey (openindiana)
> From: Timothy Coalson [mailto:tsc...@mst.edu]
> 
> Did you also compare the probability of bit errors causing data loss
> without a complete pool failure?  2-way mirrors, when one device
> completely
> dies, have no redundancy on that data, and the copy that remains must be
> perfect or some data will be lost.  

I had to think about this comment for a little while to understand what you 
were saying, but I think I got it.  I'm going to rephrase your question:

If one device in a 2-way mirror becomes unavailable, then the remaining device 
has no redundancy.  So if a bit error is encountered on the (now non-redundant) 
device, then it's an uncorrectable error.  Question is, did I calculate that 
probability?

Answer is, I think so.  Modelling the probability of drive failure (either 
complete failure or data loss) is very complex and non-linear.  Also dependent 
on the specific model of drive in question, and the graphs are typically not 
available.  So what I did was to start with some MTBDL graphs that I assumed to 
be typical, and then assume every data-loss event meant complete drive failure. 
 Already I'm simplifying the model beyond reality, but the simplification 
focuses on worst case, and treats every bit error as complete drive failure.  
This is why I say "I think so," to answer your question.  

Then, I didn't want to embark on a mathematician's journey of derivatives and 
integrals over some non-linear failure rate graphs, so I linearized...  I 
forget now (it was like 4-6 years ago) but I would have likely seen that drives 
were unlikely to fail in the first 2 years, and about 50% likely to fail after 
3 years, and nearly certain to fail after 5 years, so I would have likely 
modeled that as a linearly increasing probability of failure rate up to 4 
years, where it's assumed 100% failure rate at 4 years.

Yes, this modeling introduces inaccuracy, but that inaccuracy is in the noise.  
Maybe in the first 2 years, I'm 25% off in my estimates to the positive, and 
after 4 years I'm 25% off in the negative, or something like that.  But when 
the results show 10^-17 probability for one configuration and 10^-19 
probability for a different configuration, then the 25% error is irrelevant.  
It's easy to see which configuration is more probable to fail, and it's also 
easy to see they're both well within acceptable limits for most purposes 
(especially if you have good backups.)


> Also, as for time to resilver, I'm guessing that depends largely on where
> bottlenecks are (it has to read effectively all of the remaining disks in
> the vdev either way, but can do so in parallel, so ideally it could be the
> same speed), 

No.  The big factor for resilver time is (a) the number of operations that need 
to be performed, and (b) the number of operations per second.

If you have one big vdev making up a pool, then the number of operations to be 
performed is equal to the number of objects in the pool.  The number of 
operations per second is limited by the worst case random seek time for any 
device in the pool.  If you have an all-SSD pool, then it's equal to a single 
disk performance.  If you have an all-HDD pool, then with increasing number of 
devices in your vdev, you approach 50% of the IOPS of a single device.

If your pool is broken down into a bunch of smaller vdev's, Let's say N mirrors 
that are all 2-way.  Then the number of operations to resilver the degraded 
mirror is 1/N of the total objects in the pool.  And the number of operations 
per second is equal to the performance of a single disk.  So the resilver time 
in the big vdev raidz is 2N times longer than the resilver time for the mirror.

As you mentioned, other activity in the pool can further reduce the number of 
operations per second.  If you have N mirrors, then the probability of the 
other activity affecting the degraded mirror is 1/N.  Whereas, with a single 
big vdev, you guessed it, all other activity is guaranteed to affect the 
resilvering vdev.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-18 Thread Edward Ned Harvey (openindiana)
> From: Jay Heyl [mailto:j...@frelled.us]
> 
> Ah, that makes much more sense. Thanks for the clarification. Now that you
> put it that way I have to wonder how I ever came under the impression it
> was any other way.

I've gotten lost in the numerous mis-communications of this thread, but just to 
make sure there is no confusion:

If you have a mirror (or any vdev with redundancy, radizN) you issue a read, 
then normally only one side of the mirror gets read (not the redundant copies.) 
 If the cksum fails, then redundant copies are read successively, until a 
successful cksum is found (still, some redundant copies might not have been 
read.)

If you perform a scrub, then all copies of all information are read and 
validated.

The advantage of reading only one side of the mirror is performance.  If one 
device is busy satisfying one read request, then the other sides of the mirror 
are available to satisfy other read requests.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Jim Klimov

On 2013-04-17 21:25, Jay Heyl wrote:

It (finally) occurs to me that not all mirrors are created equal. I've been
assuming, and probably ignoring hints to the contrary, that what was being
compared here was a raid-z2 configuraton with a 2-way mirror composed of
two 8-disk vdevs. I now realize you're talking about 8 separate 2-disk
mirrors organized into a pool. "mirror x1 y1 mirror x2 y2 mirror x3 y3..."



Well, to help you clarify things, there are simple mirrors made over
two identically sized storage devices (for simplicity, we won't go
into three- or four-way mirrors which are essentially the same idea
spread over more devices for higher reliability and read performance).

If you involve more couples of devices, you can build (in traditional
RAID terminology) either raid10 - one striping over many separate
mirrors, or raid01 - one mirror over two individual stripes (or plain
concatenations sometimes). While the performance is similar and
difference seems semantical, there is some. Most importantly, a
single broken disk in a stripe invalidates it, and thus a whole
half of raid01 top-level mirror becomes broken. That's why these
setups are rarely used nowadays, and ZFS didn't even bother to
implement them (unlike raid10) :)

HTH,
//Jim

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Jay Heyl
On Wed, Apr 17, 2013 at 5:38 AM, Edward Ned Harvey (openindiana) <
openindi...@nedharvey.com> wrote:

> > From: Sašo Kiselkov [mailto:skiselkov...@gmail.com]
> >
> > Raid-Z indeed does stripe data across all
> > leaf vdevs (minus parity) and does so by splitting the logical block up
> > into equally sized portions.
>
> Jay, there you have it.  You asked why use mirrors, and you said you would
> use raidz2 or raidz3 unless cpu overhead is too much.  I recommended using
> mirrors and avoiding raidzN, and here is the answer why.
>
> If you have 16 disks arranged in 8x mirrors, versus 10 disks in raidz2
> which stripes across 8 disks plus 2 parity disks, then the serial write of
> each configuration is about the same; that is, 8x the sustained write speed
> of a single device.  But if you have two or more parallel sequential read
> threads, then the sequential read speed of the mirrors will be 16x while
> the raidz2 is only 8x.  The mirror configuration can do 8x random write
> while the raidz2 is only 1x.  And the mirror can do 16x random read while
> the raidz2 is only 1x.
>

It (finally) occurs to me that not all mirrors are created equal. I've been
assuming, and probably ignoring hints to the contrary, that what was being
compared here was a raid-z2 configuraton with a 2-way mirror composed of
two 8-disk vdevs. I now realize you're talking about 8 separate 2-disk
mirrors organized into a pool. "mirror x1 y1 mirror x2 y2 mirror x3 y3..."
I also realize that almost every discussion I've seen online concerning
mirrors proposes organizing the drives in the way I was thinking about it
(which is probably why I was thinking that way). I suppose this is
something different that zfs brings to the table when compared to more
conventional hardware raid.


>
> In the case you care about the least, they're equal.  In the case you care
> about most, the mirror configuration is 16x faster.
>
> You also said the raidz2 will offer more protection against failure,
> because you can survive any two disk failures (but no more.)  I would argue
> this is incorrect (I've done the probability analysis before).  Mostly
> because the resilver time in the mirror configuration is 8x to 16x faster
> (there's 1/8 as much data to resilver, and IOPS is limited by a single
> disk, not the "worst" of several disks, which introduces another factor up
> to 2x, increasing the 8x as high as 16x), so the smaller resilver window
> means lower probability of "concurrent" failures on the critical vdev.
>  We're talking about 12 hours versus 1 week, actual result of my machines
> in production.  Also, while it's possible to fault the pool with only 2
> failures in the mirror configuration, the probability is against that
> happening.  The first disk failure probability is 1/16 for each disk ...
> And then if you have a 2nd concurrent failure, there's a 14/15 probability
> that it occurs on a separately independent (safe) mirror.  The 3rd
> concurrent failure 12/14 chance of being safe.  The 4th concurrent failure
> 10/13 chance of being safe.  Etc.  The mirror configuration can probably
> withstand a higher number of failures, and also the resilver window for
> each failure is smaller.  When you look at the total probability of pool
> failure, they were both like 10^-17 or something like that.  In other
> words, we're splitting hairs but as long as we are, we might as well point
> out that they're both about the same.
>

This also starts to make a lot more sense. Confused the hell out of me the
first three times I read it. I'm going to have to ponder this a bit more as
my thinking has been heavily influenced by the more conventional mirror
arrangement.
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Jan Owoc
On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalson  wrote:
> On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) <
> openindi...@nedharvey.com> wrote:
>
>> You also said the raidz2 will offer more protection against failure,
>> because you can survive any two disk failures (but no more.)  I would argue
>> this is incorrect (I've done the probability analysis before).  Mostly
>> because the resilver time in the mirror configuration is 8x to 16x faster
>> (there's 1/8 as much data to resilver, and IOPS is limited by a single
>> disk, not the "worst" of several disks, which introduces another factor up
>> to 2x, increasing the 8x as high as 16x), so the smaller resilver window
>> means lower probability of "concurrent" failures on the critical vdev.
>>  We're talking about 12 hours versus 1 week, actual result of my machines
>> in production.
>>
>
> Did you also compare the probability of bit errors causing data loss
> without a complete pool failure?  2-way mirrors, when one device completely
> dies, have no redundancy on that data, and the copy that remains must be
> perfect or some data will be lost.  On the other hand, raid-z2 will still
> have available redundancy, allowing every single block to have a bad read
> on any single component disk, without losing data.  I haven't done the math
> on this, but I seem to recall some papers claiming that this is the more
> likely route to lost data on modern disks, by comparing bit error rate and
> capacity.  Of course, a second outright failure puts raid-z2 in a much
> worse boat than 2-way mirrors, which is a reason for raid-z3, but this may
> already be a less likely case.

Richard Elling wrote a blog post about "mean time to data loss" [1]. A
few years later he graphed out a few cases for typical values of
resilver times [2].

[1] https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
[2] http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html

Cheers,
Jan

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Jay Heyl
On Wed, Apr 17, 2013 at 11:21 AM, Jim Klimov  wrote:

> On 2013-04-17 20:09, Jay Heyl wrote:
>
>> reply. Unless the first device to answer returns garbage (something
>>> that doesn't match the expected checksum), other copies are not read
>>> as part of this request.
>>>
>>>
>> Ah, that makes much more sense. Thanks for the clarification. Now that you
>> put it that way I have to wonder how I ever came under the impression it
>> was any other way.
>>
>
>
> Well, there are different architectures, so some might do what you
> suggested. From what I read just yesterday, RAM mirroring on some
> high-end servers works indeed like you described - by reading both
> parts and comparing the results, testing ECC if needed, etc. to
> figure out the correct memory contents or return an error if both
> parts are faulty and can't be trusted (ECC mismatch on both).
>
> Military, nuclear and space systems often are built like 3 or 5
> computers (odd amount for easier quorum) doing the same calculations
> over same inputs, and comparing the results to be sure of them or to
> redo the task.
>
> So I guess it depends on your background - why you thought this of
> disk systems ;)


Actually, I'm pretty sure I read it somewhere on the internet. My fault for
thinking every guy with a blog actually knows what he's talking about. :-)
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Timothy Coalson
On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) <
openindi...@nedharvey.com> wrote:

> You also said the raidz2 will offer more protection against failure,
> because you can survive any two disk failures (but no more.)  I would argue
> this is incorrect (I've done the probability analysis before).  Mostly
> because the resilver time in the mirror configuration is 8x to 16x faster
> (there's 1/8 as much data to resilver, and IOPS is limited by a single
> disk, not the "worst" of several disks, which introduces another factor up
> to 2x, increasing the 8x as high as 16x), so the smaller resilver window
> means lower probability of "concurrent" failures on the critical vdev.
>  We're talking about 12 hours versus 1 week, actual result of my machines
> in production.
>

Did you also compare the probability of bit errors causing data loss
without a complete pool failure?  2-way mirrors, when one device completely
dies, have no redundancy on that data, and the copy that remains must be
perfect or some data will be lost.  On the other hand, raid-z2 will still
have available redundancy, allowing every single block to have a bad read
on any single component disk, without losing data.  I haven't done the math
on this, but I seem to recall some papers claiming that this is the more
likely route to lost data on modern disks, by comparing bit error rate and
capacity.  Of course, a second outright failure puts raid-z2 in a much
worse boat than 2-way mirrors, which is a reason for raid-z3, but this may
already be a less likely case.

Also, as for time to resilver, I'm guessing that depends largely on where
bottlenecks are (it has to read effectively all of the remaining disks in
the vdev either way, but can do so in parallel, so ideally it could be the
same speed), and how busy the pool is with other things - how does the
bandwidth sharing during resilver work?

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Jim Klimov

On 2013-04-17 20:09, Jay Heyl wrote:

reply. Unless the first device to answer returns garbage (something
that doesn't match the expected checksum), other copies are not read
as part of this request.



Ah, that makes much more sense. Thanks for the clarification. Now that you
put it that way I have to wonder how I ever came under the impression it
was any other way.



Well, there are different architectures, so some might do what you
suggested. From what I read just yesterday, RAM mirroring on some
high-end servers works indeed like you described - by reading both
parts and comparing the results, testing ECC if needed, etc. to
figure out the correct memory contents or return an error if both
parts are faulty and can't be trusted (ECC mismatch on both).

Military, nuclear and space systems often are built like 3 or 5
computers (odd amount for easier quorum) doing the same calculations
over same inputs, and comparing the results to be sure of them or to
redo the task.

So I guess it depends on your background - why you thought this of
disk systems ;)

//Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Jay Heyl
On Tue, Apr 16, 2013 at 5:49 PM, Jim Klimov  wrote:

> On 2013-04-17 02:10, Jay Heyl wrote:
>
>> Not to get into bickering about semantics, but I asked, "Or am I wrong
>> about reads being issued in parallel to all the mirrors in the array?", to
>> which you replied, "Yes, in normal case... this assumption is wrong... but
>> reads should be in parallel." (Ellipses intended for clarity, not argument
>> munging.) If reads are in parallel, then it seems as though my assumption
>> is correct. I realize the system will discard data from all but the first
>> reads and that using only the first response can improve performance, but
>> in terms of number of IOPs, which is where I intended to go with this, it
>> seems to me the mirrored system will have at least as many if not more
>> than
>> the raid-zn system.
>>
>> Or have I completely misunderstood what you intended to say?
>>
>
> Um, right... I got torn between several letters and forgot the details
> of one. So, here's what I replied to with poor wording - *I thought you
> meant* "A single read request from a program would be redirected as a
> series of parallel requests to mirror components asking for the same
> data, whichever one answers first" - this is no, the "wrong" in my
> reply. Unless the first device to answer returns garbage (something
> that doesn't match the expected checksum), other copies are not read
> as part of this request.
>

Ah, that makes much more sense. Thanks for the clarification. Now that you
put it that way I have to wonder how I ever came under the impression it
was any other way.
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Edward Ned Harvey (openindiana)
> From: Sašo Kiselkov [mailto:skiselkov...@gmail.com]
> 
> Raid-Z indeed does stripe data across all
> leaf vdevs (minus parity) and does so by splitting the logical block up
> into equally sized portions. 

Jay, there you have it.  You asked why use mirrors, and you said you would use 
raidz2 or raidz3 unless cpu overhead is too much.  I recommended using mirrors 
and avoiding raidzN, and here is the answer why.

If you have 16 disks arranged in 8x mirrors, versus 10 disks in raidz2 which 
stripes across 8 disks plus 2 parity disks, then the serial write of each 
configuration is about the same; that is, 8x the sustained write speed of a 
single device.  But if you have two or more parallel sequential read threads, 
then the sequential read speed of the mirrors will be 16x while the raidz2 is 
only 8x.  The mirror configuration can do 8x random write while the raidz2 is 
only 1x.  And the mirror can do 16x random read while the raidz2 is only 1x.

In the case you care about the least, they're equal.  In the case you care 
about most, the mirror configuration is 16x faster.

You also said the raidz2 will offer more protection against failure, because 
you can survive any two disk failures (but no more.)  I would argue this is 
incorrect (I've done the probability analysis before).  Mostly because the 
resilver time in the mirror configuration is 8x to 16x faster (there's 1/8 as 
much data to resilver, and IOPS is limited by a single disk, not the "worst" of 
several disks, which introduces another factor up to 2x, increasing the 8x as 
high as 16x), so the smaller resilver window means lower probability of 
"concurrent" failures on the critical vdev.  We're talking about 12 hours 
versus 1 week, actual result of my machines in production.  Also, while it's 
possible to fault the pool with only 2 failures in the mirror configuration, 
the probability is against that happening.  The first disk failure probability 
is 1/16 for each disk ... And then if you have a 2nd concurrent failure, 
there's a 14/15 probability that it occurs on a separately independent (safe) 
mirror.  The 3rd concurrent failure 12/14 chance of being safe.  The 4th 
concurrent failure 10/13 chance of being safe.  Etc.  The mirror configuration 
can probably withstand a higher number of failures, and also the resilver 
window for each failure is smaller.  When you look at the total probability of 
pool failure, they were both like 10^-17 or something like that.  In other 
words, we're splitting hairs but as long as we are, we might as well point out 
that they're both about the same.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-17 Thread Sašo Kiselkov
On 04/17/2013 02:08 AM, Edward Ned Harvey (openindiana) wrote:
>> From: Sašo Kiselkov [mailto:skiselkov...@gmail.com]
>>
>> If you are IOPS constrained, then yes, raid-zn will be slower, simply
>> because any read needs to hit all data drives in the stripe. 
> 
> Saso, I would expect you to know the answer to this question, probably:
> I have heard that raidz is more similar to raid-1e than raid-5.
> Meaning, when you write data to raidz, it doesn't get striped across
> all devices in the raidz vdev...  Rather, two copies of the data get
> written to any of the available devices in the raidz. Can you confirm?

No, this is not what happens. Raid-Z indeed does stripe data across all
leaf vdevs (minus parity) and does so by splitting the logical block up
into equally sized portions. A block *can* take up less than the full
stripe width for very small blocks or for very wide stripes, both of
which should be a rare occurrence. 4k sectored devices change this
calculation quite dramatically, which is why I wouldn't recommend using
them in pools unless you understand how your workload and raidz geometry
will interact and take note of it. See vdev_raidz_map_alloc here:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/vdev_raidz.c#L434-L554

for all the fine details (the above description is grossly oversimplified).

> If the behavior is to stripe across all the devices in the raidz,
> then the raidz iops really can't exceed that of a single device,
> because you have to wait for every device to respond before you
> have a complete block of data.  But if it's more like raid-1e and
> individual devices can read independently of each other, then at
> least theoretically, the raidz with n-devices in it could return
> iops performance on-par with n-times a single disk. 

As a general rule of thumb, raidz has the IOPS of a single drive. This
is not exactly news:
https://blogs.oracle.com/roch/entry/when_to_and_not_to

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 17/04/13 02:10, Jay Heyl wrote:
> Not to get into bickering about semantics, but I asked, "Or am I
> wrong about reads being issued in parallel to all the mirrors in
> the array?"

Each read is issued only to a (lets say, "random") disk in the mirror,
unless the read is faulty.

You can check this easily with "zpool iostat -v".

- -- 
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQCVAwUBUW4LLJlgi5GaxT1NAQJc7QQAin3XjOVhOqlD5/Q0xplH+TLtPNjzsCqd
rAvz30tnokA1MXgGpCXx2u5rGnS2CE/Xi5boMBMegxf+feAgQlANYRykpgwSqxeo
VgBUvkoWC2oKAk2hAT1UxcbKD+YhuESx8n1B/JHej97eZuhlthpmGZC2e1H3GDiH
0+I/+wNPUVY=
=yyd1
-END PGP SIGNATURE-

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-17 02:10, Jay Heyl wrote:

Not to get into bickering about semantics, but I asked, "Or am I wrong
about reads being issued in parallel to all the mirrors in the array?", to
which you replied, "Yes, in normal case... this assumption is wrong... but
reads should be in parallel." (Ellipses intended for clarity, not argument
munging.) If reads are in parallel, then it seems as though my assumption
is correct. I realize the system will discard data from all but the first
reads and that using only the first response can improve performance, but
in terms of number of IOPs, which is where I intended to go with this, it
seems to me the mirrored system will have at least as many if not more than
the raid-zn system.

Or have I completely misunderstood what you intended to say?


Um, right... I got torn between several letters and forgot the details
of one. So, here's what I replied to with poor wording - *I thought you
meant* "A single read request from a program would be redirected as a
series of parallel requests to mirror components asking for the same
data, whichever one answers first" - this is no, the "wrong" in my
reply. Unless the first device to answer returns garbage (something
that doesn't match the expected checksum), other copies are not read
as part of this request.

Now, if there are many requests on the system issued simultaneously,
which is most often the case, then reads from different requests are
directed to different disks, but again - one read goes to one disk
except pathological cases. It is likely that the system selects a
disk to read from based, in part, on its expectation of where the
disk head is (i.e. last requested LBA is nearest to the LBA we want
now) in order to minimize latency and unproductive time losses.
Thus "sequential reads" where requests for nearby sectors come in
a succession are likely to be satisfied by a single disk in the
mirror, leaving other disks available to satisfy other reads.

Copies of a write request however are sent to all disks and committed
(flushed) before the synchronous request is accepted as completed
(for example, a write-and-commit of a TXG transaction group).

Hope this makes my point clearer, it is late here ;)
//Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Edward Ned Harvey (openindiana)
> From: Mehmet Erol Sanliturk [mailto:m.e.sanlit...@gmail.com]
> 
> SSD units are very vulnerable to power cuts during work up to complete
> failure which they can not be used any more to complete loss of data .

If there are any junky drives out there that fail so dramatically, those are 
junky and the exception.  Just imagine how foolish the engineers would have to 
be, "Power loss?  I didn't think of that...  Complete drive failure in power 
loss is acceptable behavior."   Definitely an inaccurate generalization about 
SSD's.  There is nothing inherent about flash memory as compared to magnetic 
material, that would cause such a thing.

I repeat:  I'm not saying there's no such thing as a SSD that has such a 
problem.  I'm saying if there is, it's junk.  And you can safely assume any 
good drive doesn't have that problem.


> MLC ( Multi-Level Cell ) SSD units have a short life time if they are
> continuously written ( they are more suitable to write once ( in a limited
> number of writes sense ) - read many )  .

It's a fact that NAND has a finite number of write cycles, and it gets slower 
to write, the more times it's been re-written.  It is also a fact that when 
SSD's were first introduced to the commodity market about 11 years ago, that 
they failed quickly due to OSes (windows) continually writing the same sectors 
over and over.  But manufacturers have been long since aware of this problem, 
and solved it by overprovisioning and wear-leveling.

Similar to ZFS copy-on-write, which has the ability to logically address some 
blocks and secretly re-map them to different sectors behind the scenes...  
SSD's with wear-leveling secretly remap sectors during writes.


> SSD units may fail due to write wearing in an unexpected time , making them
> very unreliable for mission critical works .

Every page has a write counter, which is used to predict failure.  A very 
predictable, and very much *not* unexpected time.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Richard Elling
For the context of ZPL, easy answer below :-) ...

On Apr 16, 2013, at 4:12 PM, Timothy Coalson  wrote:

> On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov  wrote:
> 
>> On 2013-04-16 23:56, Jay Heyl wrote:
>> 
>>> result in more devices being hit for both read and write. Or am I wrong
>>> about reads being issued in parallel to all the mirrors in the array?
>>> 
>> 
>> Yes, in normal case (not scrubbing which makes a point of reading
>> everything) this assumption is wrong. Writes do hit all devices
>> (mirror halves or raid disks), but reads should be in parallel.
>> For mechanical HDDs this allows to double average read speeds
>> (or triple for 3-way mirrors, etc.) because different spindles
>> begin using their heads in shorter strokes around different areas,
>> if there are enough concurrent randomly placed reads.
>> 
> 
> There is another part to his question, specifically whether a single random
> read that falls within one block of the file hits more than one top level
> vdev -

No.

> to put it another way, whether a single block of a file is striped
> across top level vdevs.  I believe every block is allocated from one and
> only one vdev (blocks with ditto copies allocate multiple blocks, ideally
> from different vdevs, but this is not the same thing), such that every read
> that hits only one file block goes to only one top level vdev unless
> something goes wrong badly enough to need a ditto copy.

Correct.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Edward Ned Harvey (openindiana)
> From: Jay Heyl [mailto:j...@frelled.us]
> 
> > So I'm just assuming you're going to build a pool out of SSD's, mirrored,
> > perhaps even 3-way mirrors.  No cache/log devices.  All the ram you can fit
> > into the system.
> 
> What would be the logic behind mirrored SSD arrays? With spinning platters
> the mirrors improve performance by allowing the fastest of the mirrors to
> respond to a particular command to be the one that defines throughput.

When you read from a mirror, ZFS doesn't read the same data from both sides of 
the mirror simultaneously and let them race, wasting bus & memory bandwidth to 
attempt gaining smaller latency.  If you have a single thread doing serial 
reads, I also have no cause to believe that zfs reads stripes from multiple 
sides of the mirror to accelerate - rather, it relies on the striping across 
multiple mirrors or vdev's.

But if you have multiple threads requesting independent random read operations 
that are on the same mirror, I have measured the results that you get very 
nearly n-times a single disk random read performance by using a n-way mirror 
and at least n or 2n independent random read threads.


> There is no
> latency due to head movement or waiting for the proper spot on the disc to
> rotate under the heads. 

Nothing, including ZFS, has such an in-depth knowledge of the inner drive 
geometry as to know how long is necessary for the rotational latency to come 
around.  Also, rotational latency is almost nothing compared to head seek.  For 
this reason, short-stroking makes a big difference, when you have a data usage 
pattern that can easily be confined to a small number of adjacent tracks.  I 
believe, if you use a HDD for log device, it's aware of itself and does 
short-stroking, but I don't actually know.  Also, this is really a completely 
separate subject.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jay Heyl
On Tue, Apr 16, 2013 at 4:01 PM, Jim Klimov  wrote:

> On 2013-04-16 23:56, Jay Heyl wrote:
>
>> result in more devices being hit for both read and write. Or am I wrong
>> about reads being issued in parallel to all the mirrors in the array?
>>
>
> Yes, in normal case (not scrubbing which makes a point of reading
> everything) this assumption is wrong. Writes do hit all devices
> (mirror halves or raid disks), but reads should be in parallel.
> For mechanical HDDs this allows to double average read speeds
> (or triple for 3-way mirrors, etc.) because different spindles
> begin using their heads in shorter strokes around different areas,
> if there are enough concurrent randomly placed reads.


Not to get into bickering about semantics, but I asked, "Or am I wrong
about reads being issued in parallel to all the mirrors in the array?", to
which you replied, "Yes, in normal case... this assumption is wrong... but
reads should be in parallel." (Ellipses intended for clarity, not argument
munging.) If reads are in parallel, then it seems as though my assumption
is correct. I realize the system will discard data from all but the first
reads and that using only the first response can improve performance, but
in terms of number of IOPs, which is where I intended to go with this, it
seems to me the mirrored system will have at least as many if not more than
the raid-zn system.

Or have I completely misunderstood what you intended to say?
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Edward Ned Harvey (openindiana)
> From: Sašo Kiselkov [mailto:skiselkov...@gmail.com]
> 
> If you are IOPS constrained, then yes, raid-zn will be slower, simply
> because any read needs to hit all data drives in the stripe. 

Saso, I would expect you to know the answer to this question, probably:
I have heard that raidz is more similar to raid-1e than raid-5.  Meaning, when 
you write data to raidz, it doesn't get striped across all devices in the raidz 
vdev...  Rather, two copies of the data get written to any of the available 
devices in the raidz.  Can you confirm?

If the behavior is to stripe across all the devices in the raidz, then the 
raidz iops really can't exceed that of a single device, because you have to 
wait for every device to respond before you have a complete block of data.  But 
if it's more like raid-1e and individual devices can read independently of each 
other, then at least theoretically, the raidz with n-devices in it could return 
iops performance on-par with n-times a single disk. 


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Sašo Kiselkov
On 04/17/2013 12:08 AM, Richard Elling wrote:
> clarification below...
> 
> On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov  wrote:
> 
>> On 04/16/2013 11:37 PM, Timothy Coalson wrote:
>>> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov 
>>> wrote:
>>>
 If you are IOPS constrained, then yes, raid-zn will be slower, simply
 because any read needs to hit all data drives in the stripe. This is
 even worse on writes if the raidz has bad geometry (number of data
 drives isn't a power of 2).

>>>
>>> Off topic slightly, but I have always wondered at this - what exactly
>>> causes non-power of 2 plus number of parities geometries to be slower, and
>>> by how much?  I tested for this effect with some consumer drives, comparing
>>> 8+2 and 10+2, and didn't see much of a penalty (though the only random test
>>> I did was read, our workload is highly sequential so it wasn't important).
> 
> This makes sense, even for more random workloads.
> 
>>
>> Because a non-power-of-2 number of drives causes a read-modify-write
>> sequence on every (almost) write. HDDs are block devices and they can
>> only ever write in increments of their sector size (512 bytes or
>> nowadays often 4096 bytes). Using your example above, you divide a 128k
>> block by 8, you get 8x16k updates - all nicely aligned on 512 byte
>> boundaries, so your drives can write that in one go. If you divide by
>> 10, you get an ugly 12.8k, which means if your drives are of the
>> 512-byte sector variety, they write 24x 512 sectors and then for the
>> last partial sector write, they first need to fetch the sector from the
>> patter, modify if in memory and then write it out again.
> 
> This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has 
> been
> a few years, I did a bunch of tests and found no correlation between the 
> number
> of disks in the set (within boundaries as described in the man page) and 
> random
> performance for raidz. This is not the case for RAID-5/6 where pathologically
> bad performance is easy to create if you know the number of disks and stripe 
> width.
>  -- richard

You are right, and I think I already know where I went wrong, though
I'll need to check raidz_map_alloc to confirm. If memory serves me
right, raidz actually splits the I/O up so that each stripe component is
simply length-aligned and padded out to complete a full sector
(otherwise the zio_vdev_child_io would fail in a block-alignment
assertion in zio_create here:

zio_create(zio_t *pio, spa_t *spa,...
{
  ..
  ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0);
  ..

I was probably misremembering the power-of-2 rule from a discussion
about 4k sector drives. There the amount of wasted space can be
significant, especially on small-block data, e.g. the default 8k
volblocksize not being able to scale beyond 2 data drives + parity.

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-17 01:12, Timothy Coalson wrote:

On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov  wrote:


On 2013-04-16 23:56, Jay Heyl wrote:


result in more devices being hit for both read and write. Or am I wrong
about reads being issued in parallel to all the mirrors in the array?



Yes, in normal case (not scrubbing which makes a point of reading
everything) this assumption is wrong. Writes do hit all devices
(mirror halves or raid disks), but reads should be in parallel.
For mechanical HDDs this allows to double average read speeds
(or triple for 3-way mirrors, etc.) because different spindles
begin using their heads in shorter strokes around different areas,
if there are enough concurrent randomly placed reads.



There is another part to his question, specifically whether a single random
read that falls within one block of the file hits more than one top level
vdev - to put it another way, whether a single block of a file is striped
across top level vdevs.  I believe every block is allocated from one and
only one vdev (blocks with ditto copies allocate multiple blocks, ideally
from different vdevs, but this is not the same thing), such that every read
that hits only one file block goes to only one top level vdev unless
something goes wrong badly enough to need a ditto copy.


I believe so too... I think, striping over top-level vdevs is subject
to many tuning and algorithmical influences, and is not as simple as
even-odd IOs. Also IIRC there is some size of data (several MBytes)
that is preferably sent as sequential IO to one TLVDEV, then it is
striped over to another, in order to better utilize the strengths of
faster sequential IO vs. lags of seeking.

But I may be way off-track here with this "belief" ;)

Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-16 23:37, Timothy Coalson wrote:

On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote:


If you are IOPS constrained, then yes, raid-zn will be slower, simply
because any read needs to hit all data drives in the stripe. This is
even worse on writes if the raidz has bad geometry (number of data
drives isn't a power of 2).



Off topic slightly, but I have always wondered at this - what exactly
causes non-power of 2 plus number of parities geometries to be slower, and
by how much?  I tested for this effect with some consumer drives, comparing
8+2 and 10+2, and didn't see much of a penalty (though the only random test
I did was read, our workload is highly sequential so it wasn't important).



My take on this is not that these geometries are slower, but that
they may be less efficient in terms of overheads at data storage.

Say, you write a 16-sector block of userdata to your arrays.
In case of 8+2 that would be two full stripes of parity and data.
In case of 9+2 that would be a 9+2 and a 7+2 stripe. Access to
this data is less balanced, placing more load on some disks which
have 2 sectors of this block, and less load on others which have
only one sector. It seems more "sad" when (i.e. due to compression)
you have 1 or 2 userdata sectors remaining on a second stripe, but
must still provide the 2 or 3 sectors of redundancy for this mini
stripe.

Also, as I found, ZFS raidzN makes precautions to not leave some
potentially unusable holes (i.e. 1 or 2 free sectors, where you
can't fit parity and data), so it would allocate full stripes
when you have sufficiently unlucky stripe lengths just a few
sectors shorter than full (i.e. 7+2 above would likely be allocated
as 9+2 with zeroed-out extra sectors)... These things do add up to
gigabytes, though they can happen on power-of-two sized arrays with
compression just as easily (I found this with a 6-disk raidz2, with
4 data disks).

Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Timothy Coalson
On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov  wrote:

> On 2013-04-16 23:56, Jay Heyl wrote:
>
>> result in more devices being hit for both read and write. Or am I wrong
>> about reads being issued in parallel to all the mirrors in the array?
>>
>
> Yes, in normal case (not scrubbing which makes a point of reading
> everything) this assumption is wrong. Writes do hit all devices
> (mirror halves or raid disks), but reads should be in parallel.
> For mechanical HDDs this allows to double average read speeds
> (or triple for 3-way mirrors, etc.) because different spindles
> begin using their heads in shorter strokes around different areas,
> if there are enough concurrent randomly placed reads.
>

There is another part to his question, specifically whether a single random
read that falls within one block of the file hits more than one top level
vdev - to put it another way, whether a single block of a file is striped
across top level vdevs.  I believe every block is allocated from one and
only one vdev (blocks with ditto copies allocate multiple blocks, ideally
from different vdevs, but this is not the same thing), such that every read
that hits only one file block goes to only one top level vdev unless
something goes wrong badly enough to need a ditto copy.

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-16 23:56, Jay Heyl wrote:

result in more devices being hit for both read and write. Or am I wrong
about reads being issued in parallel to all the mirrors in the array?


Yes, in normal case (not scrubbing which makes a point of reading
everything) this assumption is wrong. Writes do hit all devices
(mirror halves or raid disks), but reads should be in parallel.
For mechanical HDDs this allows to double average read speeds
(or triple for 3-way mirrors, etc.) because different spindles
begin using their heads in shorter strokes around different areas,
if there are enough concurrent randomly placed reads.

Due to ZFS data structure, you know your target block's expected
checksum before you read its data (from any mirror half, or from
data disks in a raidzn); then ZFS calculates the checksum of the
data it has read and combined, and only if there is a mismatch, it
has to read from other disks for redundancy (and fix the detected
broken part).

HTH,
//Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Bob Friesenhahn

On Tue, 16 Apr 2013, Sašo Kiselkov wrote:


SATA and SAS are dedicated point-to-point interfaces so there is no
additive bottleneck with more drives as long as the devices are directly
connected.


Not true. Modern flash storage is quite capable of saturating a 6 Gbps
SATA link. SAS has an advantage here, being dual-port natively with
active-active load balancing deployed as standard practice. Also please
note that SATA is half-duplex, whereas SAS is full-duplex.


You did not describe how my statement about not being "additive" is 
wrong.  This is different than per-drive bandwidth being insufficient 
for latest SSDs.  Please expound on "Not true".


SAS/SATA are not like old parallel SCSI.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Richard Elling
clarification below...

On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov  wrote:

> On 04/16/2013 11:37 PM, Timothy Coalson wrote:
>> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote:
>> 
>>> If you are IOPS constrained, then yes, raid-zn will be slower, simply
>>> because any read needs to hit all data drives in the stripe. This is
>>> even worse on writes if the raidz has bad geometry (number of data
>>> drives isn't a power of 2).
>>> 
>> 
>> Off topic slightly, but I have always wondered at this - what exactly
>> causes non-power of 2 plus number of parities geometries to be slower, and
>> by how much?  I tested for this effect with some consumer drives, comparing
>> 8+2 and 10+2, and didn't see much of a penalty (though the only random test
>> I did was read, our workload is highly sequential so it wasn't important).

This makes sense, even for more random workloads.

> 
> Because a non-power-of-2 number of drives causes a read-modify-write
> sequence on every (almost) write. HDDs are block devices and they can
> only ever write in increments of their sector size (512 bytes or
> nowadays often 4096 bytes). Using your example above, you divide a 128k
> block by 8, you get 8x16k updates - all nicely aligned on 512 byte
> boundaries, so your drives can write that in one go. If you divide by
> 10, you get an ugly 12.8k, which means if your drives are of the
> 512-byte sector variety, they write 24x 512 sectors and then for the
> last partial sector write, they first need to fetch the sector from the
> patter, modify if in memory and then write it out again.

This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has 
been
a few years, I did a bunch of tests and found no correlation between the number
of disks in the set (within boundaries as described in the man page) and random
performance for raidz. This is not the case for RAID-5/6 where pathologically
bad performance is easy to create if you know the number of disks and stripe 
width.
 -- richard

> 
> I said "almost" every write is affected, but this largely depends on
> your workload. If your writes are large async writes, then this RMW
> cycle only happens at the end of the transaction commit (simplifying a
> bit, but you get the idea), which is pretty small. However, if you are
> doing many small updates in different locations (e.g. writing the ZIL),
> this can significantly amplify the load.
> 
> Cheers,
> --
> Saso
> 
> ___
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread alka
ZFS datablocks are also a power of two what means, that if you have 
1,2,4,8,16,32,.. datadisks, every write is evenly spread over all disks.

If you add one disk ex from 8 to 9 datadisks, any one disk is not used on a 
read/write.
Does that means, 9 datadisks are slower than 8 disks?

No,  9 disks are faster, maybee not 1/9 faster but faster.
So think about more like a myth 

Add Raid redundancy disks to that count, example with 8 datadisks,  
add one disk for Z1 (9), 2 disks for Z2 (10) and 3 disks forZ3(11) for disks 
per vdev.




Am 16.04.2013 um 23:37 schrieb Timothy Coalson:

> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote:
> 
>> If you are IOPS constrained, then yes, raid-zn will be slower, simply
>> because any read needs to hit all data drives in the stripe. This is
>> even worse on writes if the raidz has bad geometry (number of data
>> drives isn't a power of 2).
>> 
> 
> Off topic slightly, but I have always wondered at this - what exactly
> causes non-power of 2 plus number of parities geometries to be slower, and
> by how much?  I tested for this effect with some consumer drives, comparing
> 8+2 and 10+2, and didn't see much of a penalty (though the only random test
> I did was read, our workload is highly sequential so it wasn't important).
> 
> Tim
> ___
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss

--


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Timothy Coalson
On Tue, Apr 16, 2013 at 4:44 PM, Sašo Kiselkov wrote:

> On 04/16/2013 11:37 PM, Timothy Coalson wrote:
> > On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov  >wrote:
> >
> >> If you are IOPS constrained, then yes, raid-zn will be slower, simply
> >> because any read needs to hit all data drives in the stripe. This is
> >> even worse on writes if the raidz has bad geometry (number of data
> >> drives isn't a power of 2).
> >>
> >
> > Off topic slightly, but I have always wondered at this - what exactly
> > causes non-power of 2 plus number of parities geometries to be slower,
> and
> > by how much?  I tested for this effect with some consumer drives,
> comparing
> > 8+2 and 10+2, and didn't see much of a penalty (though the only random
> test
> > I did was read, our workload is highly sequential so it wasn't
> important).
>
> Because a non-power-of-2 number of drives causes a read-modify-write
> sequence on every (almost) write. HDDs are block devices and they can
> only ever write in increments of their sector size (512 bytes or
> nowadays often 4096 bytes). Using your example above, you divide a 128k
> block by 8, you get 8x16k updates - all nicely aligned on 512 byte
> boundaries, so your drives can write that in one go. If you divide by
> 10, you get an ugly 12.8k, which means if your drives are of the
> 512-byte sector variety, they write 24x 512 sectors and then for the
> last partial sector write, they first need to fetch the sector from the
> patter, modify if in memory and then write it out again.
>
> I said "almost" every write is affected, but this largely depends on
> your workload. If your writes are large async writes, then this RMW
> cycle only happens at the end of the transaction commit (simplifying a
> bit, but you get the idea), which is pretty small. However, if you are
> doing many small updates in different locations (e.g. writing the ZIL),
> this can significantly amplify the load.


Okay, I get the carryover of partial stripe causing problems, that makes
sense, and at least has implications on space efficiency given that ZFS
mainly uses power of 2 block sizes.  However, I was not under the
impression that ZFS ever uses partial sectors, that instead it uses fewer
devices in the final stripe, ie, it would be split 10+2, 10+2...6+2.  If
what you say is true, I'm not sure how ZFS both manages to address halfway
through a sector (if it must keep that old partial sector, it must be used
somewhere, yes?), and yet has problems with changing sector sizes (the
infamous ashift).  Are you perhaps thinking of block device style software
raid, where you need to ensure that even non-useful bits have correct
parity computed?

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jay Heyl
On Tue, Apr 16, 2013 at 2:25 PM, Timothy Coalson  wrote:

> On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl  wrote:
>
> > My question about the rationale behind the suggestion of mirrored SSD
> > arrays was really meant to be more in relation to the question from the
> OP.
> > I don't see how mirrored arrays of SSDs would be effective in his
> > situation.
> >
>
> There is another detail here to keep in mind: ZFS checks checksums on every
> read from storage, and with raid-zn used with block sizes that give it more
> capacity than mirroring (that is, data blocks are large enough that they
> get split across multiple data sectors and therefore devices, instead of
> degenerate single data sector plus parity sector(s) - OP mentioned 32K
> blocks, so they should get split), this means each random filesystem read
> that isn't cached hits a large number of devices in a raid-zn vdev, but
> only one device in a mirror vdev (unless ZFS splits these reads across
> mirrors, but even then it is still fewer devices hit).  If you are limited
> by IOPS of the devices, then this could make raid-zn slower.
>

I'm getting a sense of comparing apples to oranges here, but I do see your
point about the raid-zn always requiring reads from more devices due to the
parity. OTOH, it was my impression that read operations on n-way mirrors
are always issued to each of the 'n' mirrors. Just for the sake of
argument, let's say we need room for 1TB of storage. For raid-z2 we use
4x500GB devices. For the mirrored setup we have two mirrors each with
2x500GB devices. Reads to the raid-z2 system will hit four devices. If my
assumption is correct, reads to the mirrored system will also hit four
devices. If we go to a 3-way mirror, reads would hit six devices.

In all but degenerate cases, mirrored arrangements are going to include
more drives for the same amount of usable storage, so it seems they should
result in more devices being hit for both read and write. Or am I wrong
about reads being issued in parallel to all the mirrors in the array?
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Sašo Kiselkov
On 04/16/2013 11:37 PM, Timothy Coalson wrote:
> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote:
> 
>> If you are IOPS constrained, then yes, raid-zn will be slower, simply
>> because any read needs to hit all data drives in the stripe. This is
>> even worse on writes if the raidz has bad geometry (number of data
>> drives isn't a power of 2).
>>
> 
> Off topic slightly, but I have always wondered at this - what exactly
> causes non-power of 2 plus number of parities geometries to be slower, and
> by how much?  I tested for this effect with some consumer drives, comparing
> 8+2 and 10+2, and didn't see much of a penalty (though the only random test
> I did was read, our workload is highly sequential so it wasn't important).

Because a non-power-of-2 number of drives causes a read-modify-write
sequence on every (almost) write. HDDs are block devices and they can
only ever write in increments of their sector size (512 bytes or
nowadays often 4096 bytes). Using your example above, you divide a 128k
block by 8, you get 8x16k updates - all nicely aligned on 512 byte
boundaries, so your drives can write that in one go. If you divide by
10, you get an ugly 12.8k, which means if your drives are of the
512-byte sector variety, they write 24x 512 sectors and then for the
last partial sector write, they first need to fetch the sector from the
patter, modify if in memory and then write it out again.

I said "almost" every write is affected, but this largely depends on
your workload. If your writes are large async writes, then this RMW
cycle only happens at the end of the transaction commit (simplifying a
bit, but you get the idea), which is pretty small. However, if you are
doing many small updates in different locations (e.g. writing the ZIL),
this can significantly amplify the load.

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Timothy Coalson
On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote:

> If you are IOPS constrained, then yes, raid-zn will be slower, simply
> because any read needs to hit all data drives in the stripe. This is
> even worse on writes if the raidz has bad geometry (number of data
> drives isn't a power of 2).
>

Off topic slightly, but I have always wondered at this - what exactly
causes non-power of 2 plus number of parities geometries to be slower, and
by how much?  I tested for this effect with some consumer drives, comparing
8+2 and 10+2, and didn't see much of a penalty (though the only random test
I did was read, our workload is highly sequential so it wasn't important).

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Sašo Kiselkov
On 04/16/2013 11:25 PM, Timothy Coalson wrote:
> On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl  wrote:
> 
>> My question about the rationale behind the suggestion of mirrored SSD
>> arrays was really meant to be more in relation to the question from the OP.
>> I don't see how mirrored arrays of SSDs would be effective in his
>> situation.
>>
> 
> There is another detail here to keep in mind: ZFS checks checksums on every
> read from storage, and with raid-zn used with block sizes that give it more
> capacity than mirroring (that is, data blocks are large enough that they
> get split across multiple data sectors and therefore devices, instead of
> degenerate single data sector plus parity sector(s) - OP mentioned 32K
> blocks, so they should get split), this means each random filesystem read
> that isn't cached hits a large number of devices in a raid-zn vdev, but
> only one device in a mirror vdev (unless ZFS splits these reads across
> mirrors, but even then it is still fewer devices hit).  If you are limited
> by IOPS of the devices, then this could make raid-zn slower.

If you are IOPS constrained, then yes, raid-zn will be slower, simply
because any read needs to hit all data drives in the stripe. This is
even worse on writes if the raidz has bad geometry (number of data
drives isn't a power of 2).

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Timothy Coalson
On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl  wrote:

> My question about the rationale behind the suggestion of mirrored SSD
> arrays was really meant to be more in relation to the question from the OP.
> I don't see how mirrored arrays of SSDs would be effective in his
> situation.
>

There is another detail here to keep in mind: ZFS checks checksums on every
read from storage, and with raid-zn used with block sizes that give it more
capacity than mirroring (that is, data blocks are large enough that they
get split across multiple data sectors and therefore devices, instead of
degenerate single data sector plus parity sector(s) - OP mentioned 32K
blocks, so they should get split), this means each random filesystem read
that isn't cached hits a large number of devices in a raid-zn vdev, but
only one device in a mirror vdev (unless ZFS splits these reads across
mirrors, but even then it is still fewer devices hit).  If you are limited
by IOPS of the devices, then this could make raid-zn slower.

Disclaimer: this is theory, I haven't tested this in practice, nor have I
done any math to see if it should matter to SSDs.  However, since it is a
configuration question rather than a hardware question, it may be possible
to acquire (some of) the hardware first and test both setups before
deciding.

Tim
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Sašo Kiselkov
On 04/16/2013 10:57 PM, Bob Friesenhahn wrote:
> On Tue, 16 Apr 2013, Jay Heyl wrote:
>>
>> It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS
>> when there are multiple storage devices on the other end of that path. No
>> single HDD today is going to come close to needing that full 6Gb/s,
>> but put
>> four or five of them hanging off that same path and that ultra-super
>> highway starts looking pretty congested. Put SSDs on the other end and
>> the
>> 6Gb/s pathway is going to quickly become your bottleneck.
> 
> SATA and SAS are dedicated point-to-point interfaces so there is no
> additive bottleneck with more drives as long as the devices are directly
> connected.

Not true. Modern flash storage is quite capable of saturating a 6 Gbps
SATA link. SAS has an advantage here, being dual-port natively with
active-active load balancing deployed as standard practice. Also please
note that SATA is half-duplex, whereas SAS is full-duplex.

The problem with SATA vs SAS for flash storage is that there are, as
yet, no flash devices of the "NL-SAS" kind. By this I mean drives that
are only about 10-20% more expensive than their SATA counterparts,
offering native SAS connectivity, but not top-notch "enterprise"
features and/or performance. This situation existed in HDDs not long
ago: you had 7k2 SATA and 10k/15k SAS, but no 7k2 SAS. That's why we had
to do all that nonsense with SAS to SATA interposers (I have an old Sun
J4200 with 1TB SATA drives that had an interposer on each of the 12
drives). Since then, NL-SAS largely made this route obsolete, so now I
just buy 7k2 NL-SAS drives and skip the whole interposer thing.

Now if any of the big storage drive got their shit together and started
offering flash storage with native SAS at slightly above SATA prices,
I'd be delighted. Trouble is, the manufacturers seem to be trying to
position SAS SSDs as "even more expensive/performing than SAS HDDs"
types of products. When I can buy a 512GB SATA SSD for the price of a
600GB SAS drive, that seems a strange proposition indeed...

--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Bob Friesenhahn

On Tue, 16 Apr 2013, Jay Heyl wrote:


It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS
when there are multiple storage devices on the other end of that path. No
single HDD today is going to come close to needing that full 6Gb/s, but put
four or five of them hanging off that same path and that ultra-super
highway starts looking pretty congested. Put SSDs on the other end and the
6Gb/s pathway is going to quickly become your bottleneck.


SATA and SAS are dedicated point-to-point interfaces so there is no 
additive bottleneck with more drives as long as the devices are 
directly connected.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jay Heyl
On Tue, Apr 16, 2013 at 11:54 AM, Jim Klimov  wrote:

> On 2013-04-16 20:30, Jay Heyl wrote:
>
>> What would be the logic behind mirrored SSD arrays? With spinning platters
>> the mirrors improve performance by allowing the fastest of the mirrors to
>> respond to a particular command to be the one that defines throughput.
>> With
>>
>
> Well, to think up a rationale: it is quite possible to saturate a bus
> or an HBA with SSDs, leading to increased latency in case of intense
> IO just because some tasks (data packets) are waiting in queue waiting
> for the bottleneck to dissolve. If another side of the mirror has a
> different connection (another HBA, another PCI bus) then IOs can go
> there - increasing overall performance.
>

This strikes me as a strong argument for carefully planning the arrangement
of storage devices of any sort in relation to HBAs and buses. It seems
significantly less strong as an argument for a mirror _maybe_ having a
different connection and responding faster.

My question about the rationale behind the suggestion of mirrored SSD
arrays was really meant to be more in relation to the question from the OP.
I don't see how mirrored arrays of SSDs would be effective in his
situation.

Personally, I'd go with RAID-Z2 or RAID-Z3 unless the computational load on
the CPU is especially high. This would give you as good as or better fault
protection than mirrors at significantly less cost. Indeed, given his
scenario of write early, read often later on, I might even be tempted to go
for the new TLC SSDs from Samsung. For this particular use the much reduced
"lifetime" of the devices would probably not be a factor at all. OTOH,
given the almost-no-limits budget, shaving $100 here or there is probably
not a big consideration. (And just to be clear, I would NOT recommend the
TLC SSDs for a more general solution. It was specifically the write-few,
read-many scenario that made me think of them.)

Basically, this answer stems from logic which applies to "why would we
> need 6Gbit/s on HDDs?" Indeed, HDDs won't likely saturate their buses
> with even sequential reads. The link speed really applies to the bursts
> of IO between the system and HDD's caches. Double bus speed roughly
> halves the time a HDD needs to keep the bus busy for its portion of IO.
> And when there are hundreds of disks sharing a resource (an expander
> for example), this begins to matter.


It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS
when there are multiple storage devices on the other end of that path. No
single HDD today is going to come close to needing that full 6Gb/s, but put
four or five of them hanging off that same path and that ultra-super
highway starts looking pretty congested. Put SSDs on the other end and the
6Gb/s pathway is going to quickly become your bottleneck.
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Andrew Gabriel

Mehmet Erol Sanliturk wrote:

I am not an expert of this subject , but with respect to my readings in
some e-mails in different mailing lists and from some relevant pages in
Wikipedia about SSD drives , the following points are mentioned about SSD
disadvantages ( even for "Enterprise" labeled drives ) :


SSD units are very vulnerable to power cuts during work up to complete
failure which they can not be used any more to complete loss of data .


That's why some of them include their own momentary power store, or in
some systems, the system has a momentary power store to keep them powered
for a period after the last write operation.


MLC ( Multi-Level Cell ) SSD units have a short life time if they are
continuously written ( they are more suitable to write once ( in a limited
number of writes sense ) - read many )  .

SLC ( Single-Level Cell ) SSD units have much more long life span , but
they are expensive with respect to MLC SSD units .

SSD units may fail due to write wearing in an unexpected time , making them
very unreliable for mission critical works .


All the Enterprise grade SSDs I've used can tell you how far through their
life they are (in terms of write wearing). Some of the monitoring tools pick
this up and warn you when you're down to some threshold, such as 20% left.

Secondly, when they wear out, they fail to write (effectively become write
protected). So you find out before they confirm committing your data, and
you can still read all the data back.

This is generally the complete opposite of the failure modes of hard drives,
although like any device, the SSD might fail for other reasons.

I have not played with consumer grade drives.


Due to the above points ( they may be wrong perhaps ) personally I would
select revolving plate SAS disks and up to now I did not buy any SSD for
these reasons .

The above points are a possible disadvantages set for consideration .


The extra cost of using loads of short stroked 15k drives to get anywhere
near SSD performance is generally prohibitive.

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-16 19:17, Mehmet Erol Sanliturk wrote:

I am not an expert of this subject , but with respect to my readings in
some e-mails in different mailing lists and from some relevant pages in
Wikipedia about SSD drives , the following points are mentioned about SSD
disadvantages ( even for "Enterprise" labeled drives ) :



My awareness in the subject is of similar nature, but with different
results someplace... Here goes:



SSD units are very vulnerable to power cuts during work up to complete
failure which they can not be used any more to complete loss of data .


Yes, maybe, for some vendors. Information is scarce about which
ones are better in practical reliability, leading to requests for
info like this thread.

Still, some vendors make a living by selling expensive gear into
critical workloads, and are thought to perform well. One factor,
though not always a guarantee, of proper end-of-work in case of
a power-cut, is presence of either batteries/accumulators, or
capacitors, which power the device long enough for it to save its
caches, metadata, etc. Then the mileage varies how well who does it.



MLC ( Multi-Level Cell ) SSD units have a short life time if they are
continuously written ( they are more suitable to write once ( in a limited
number of writes sense ) - read many )  .

SLC ( Single-Level Cell ) SSD units have much more long life span , but
they are expensive with respect to MLC SSD units .


I hear SLC are also faster due to more simple design. Price stems
from requirement to have more cells than MLC to implement the same
amount of storage bits. Also there are now some new designs like
eMLC which are young and "untested", but are said to have MLC price
and SLC reliability.

With decrease of sizes in technical process, diffusion and brownian
movement of atoms plays an increasingly greater role. Indeed, while
early SSDs boasted tens and hundreds of thousands of rewrite cycles,
now 5-10k is good. But faster.



SSD units may fail due to write wearing in an unexpected time , making them
very unreliable for mission critical works .


For this reason there is over-provisioning. The SSD firmware detects
unreliable chips and excludes them from use, relocating data onto
spare chips. Also there is wear-leveling, it is when the firmware
tries to make sure that all ships are utilized more or less equally
and on average the device lives longer. Basically, an SSD (unlike
a normal USB Flash key) implements a RAID over tens of chips with
intimate knowledge and diagnostic mechanisms over the storage pieces.

Overall, vendors now often rate their devices in gbytes of writes
in their lifetime, or in full rewrites of the device. Last year we
had a similar discussion on-list, regarding then-new Intel DC S3700
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/50424
and it struck me that in practical terms they boasted "Endurance
Rating - 10 drive writes/day over 5 years". That is a lot for many
use-cases. They are also relatively pricey, at $2.5/gb linearly
from 100G to 800G devices (in a local webshop here).



Due to the above points ( they may be wrong perhaps ) personally I would
select revolving plate SAS disks and up to now I did not buy any SSD for
these reasons .

The above points are a possible disadvantages set for consideration .


They are not wrong in general, and there are any number of examples
where bad things do happen. But there are devices which are said to
successfully work around the fundamental drawbacks with some other
technology, such as firmware and capacitors and so on.

It is indeed not yet a subject and market to be careless with, by
taking just any device off the shelf and expecting it to perform
well and live long. Also it is beneficial to do some homework during
system configuration and reduce unnecessary writes to the SSDs - by
moving logs out of the rpool, disabling atime updates and so on.
There are things an SSD is good for, and some things HDDs are better
at (or are commonly thought to be) - i.e. price and longevity past
infant death toll, and the choice of components does depend on
expected system utilization as well as performance requirements
as well as how much you're ready to cash up for that.

All that said, I haven't yet touched an SSD so far, but mostly due
to financial reasons with both dayjob and home rigs...

//Jim

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jim Klimov

On 2013-04-16 20:30, Jay Heyl wrote:

What would be the logic behind mirrored SSD arrays? With spinning platters
the mirrors improve performance by allowing the fastest of the mirrors to
respond to a particular command to be the one that defines throughput. With


Well, to think up a rationale: it is quite possible to saturate a bus
or an HBA with SSDs, leading to increased latency in case of intense
IO just because some tasks (data packets) are waiting in queue waiting
for the bottleneck to dissolve. If another side of the mirror has a
different connection (another HBA, another PCI bus) then IOs can go
there - increasing overall performance.

Basically, this answer stems from logic which applies to "why would we
need 6Gbit/s on HDDs?" Indeed, HDDs won't likely saturate their buses
with even sequential reads. The link speed really applies to the bursts
of IO between the system and HDD's caches. Double bus speed roughly
halves the time a HDD needs to keep the bus busy for its portion of IO.
And when there are hundreds of disks sharing a resource (an expander
for example), this begins to matter.

HTH,
//Jim Klimob


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Jay Heyl
On Mon, Apr 15, 2013 at 5:00 AM, Edward Ned Harvey (openindiana) <
openindi...@nedharvey.com> wrote:

>
> So I'm just assuming you're going to build a pool out of SSD's, mirrored,
> perhaps even 3-way mirrors.  No cache/log devices.  All the ram you can fit
> into the system.


What would be the logic behind mirrored SSD arrays? With spinning platters
the mirrors improve performance by allowing the fastest of the mirrors to
respond to a particular command to be the one that defines throughput. With
SSDs, they all should respond in basically the same time. There is no
latency due to head movement or waiting for the proper spot on the disc to
rotate under the heads. The improvement in read performance seen in
mirrored spinning platters should not be present with SSDs. Admittedly,
this is from a purely theoretical perspective. I've never assembled an SSD
array to compare mirrored vs RAID-Zx performance. I'm curious if you're
aware of something I'm overlooking.
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-16 Thread Mehmet Erol Sanliturk
I am not an expert of this subject , but with respect to my readings in
some e-mails in different mailing lists and from some relevant pages in
Wikipedia about SSD drives , the following points are mentioned about SSD
disadvantages ( even for "Enterprise" labeled drives ) :


SSD units are very vulnerable to power cuts during work up to complete
failure which they can not be used any more to complete loss of data .

MLC ( Multi-Level Cell ) SSD units have a short life time if they are
continuously written ( they are more suitable to write once ( in a limited
number of writes sense ) - read many )  .

SLC ( Single-Level Cell ) SSD units have much more long life span , but
they are expensive with respect to MLC SSD units .

SSD units may fail due to write wearing in an unexpected time , making them
very unreliable for mission critical works .

Due to the above points ( they may be wrong perhaps ) personally I would
select revolving plate SAS disks and up to now I did not buy any SSD for
these reasons .

The above points are a possible disadvantages set for consideration .


Thank you very much .

Mehmet Erol Sanliturk
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-16 Thread Doug Hughes
some of these points are a bit dated. Allow me to make some updates. I'm sure 
that you are aware that most 10gig switches these days are cut through and not 
store and forward. That's Arista, HP, Dell Force10, Mellanox, and IBM/Blade. 
Cisco has a mix of things, but they aren't really in the low latency space. The 
10g and 40g port to port forwarding is in nanoseconds. buffering is mostly 
reserved to carrier operations anymore, and even there it is becoming less 
common because of the toll it causes to things like IPVideo and VOIP. Buffers 
are good for web farms, still, and to a certain extent storage servers or WAN 
links where there is a high degree of contention from disparate traffic.
  
At a physical level, the signalling of IB compared to Ethernet (10g+) is very 
similar, which is why Mellanox can make a single chip that does 10gbit 40gbit, 
and QDR and FDR infiniband on any port.
 there are also a fair number of vendors that support RDMA in ethernet NIC now, 
like SolarFlare with Onboot technology.

The main reason for lowest achievable latency is higher speed. Latency is 
roughly equivalent to the inversion of bandwidth.  But, the higher levels of 
protocols that you stack on top contribute much more than the hardware 
theoretical minimums or maximums. TCP/IP is a killer in terms of adding 
overhead. That's why there are protocols like ISER, SRP, and friends. RDMA is 
much faster than the kernel overhead induced by TCP session setups and other 
host side user/kernel boundaries and buffering. PCI latency is also higher than 
the port to port latency on a good 10g switch, nevermind 40 or FDR infiniband.

There is even a special layer that you can write custom protocols to on 
Infiniband called Verbs for lowering latency further.

Infiniband is inherently a layer1 and 2 protocol, and the subnet manager 
(software) is resposible for setting up all virtual circuits (routes between 
hosts on the fabric) and rerouting when a path goes bad. Also, the link 
aggregation, as you mention, is rock solid and amazingly good. Auto rerouting 
is fabulous and super fast. But, you don't get layer3. TCP over IB works out of 
the box, but adds large overhead. Still, it does make it possible that you can 
have IB native and IP over IB with gateways to a TCP network with a single 
cable. That's pretty cool.


Sent from my android device.

-Original Message-
From: "Edward Ned Harvey (openindiana)" 
To: Discussion list for OpenIndiana 
Sent: Tue, 16 Apr 2013 10:49 AM
Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage 
(OpenIndiana-discuss Digest, Vol 33, Issue 20)

> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> 
> It would be difficult to believe that 10Gbit Ethernet offers better
> bandwidth than 56Gbit Infiniband (the current offering).  The swiching
> model is quite similar.  The main reason why IB offers better latency
> is a better HBA hardware interface and a specialized stack.  5X is 5X.

Put another way, the reason infiniband is so much higher throughput and lower 
latency than ethernet is because the switching (at the physical layer) is 
completely different from ethernet, and messages are passed directly from 
user-level to user-level on remote system ram via RDMA, bypassing the OSI layer 
model and other kernel overhead.  I read a paper from vmware, where they 
implemented RDMA over ethernet and doubled the speed of vmotion (but still not 
as fast as infiniband, by like 4x.)

Beside the bypassing of OSI layers and kernel latency, IB latency is lower 
because Ethernet switches use store-and-forward buffering managed by the 
backplane in the switch, in which a sender sends a packet to a buffer on the 
switch, which then pushes it through the backplane, and finally to another 
buffer on the destination.  IB uses cross-bar, or cut-through switching, in 
which the sending host channel adapter signals the destination address to the 
switch, then waits for the channel to be opened.  Once the channel is opened, 
it stays open, and the switch in between is nothing but signal amplification 
(as well as additional virtual lanes for congestion management, and other 
functions).  The sender writes directly to RAM on the destination via RDMA, no 
buffering in between.  Bypassing the OSI layer model.  Hence much lower latency.

IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 
4x, 16x designations, and the 40Gbit specifications.  Something which is 
quasi-possible in ethernet via LACP, but not as good and not the same.  IB 
guarantees packets delivered in the right order, with native congestion control 
as compared to ethernet which may drop packets and TCP must detect and 
retransmit...  

Ethernet includes a lot of support for IP addressing, and variable link speeds 
(some 10Gbit, 10/100, 1G etc) and all of this asynchronous.  For these reasons, 
IB is not a suitable replacement for IP commun

Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-16 Thread Edward Ned Harvey (openindiana)
> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> 
> It would be difficult to believe that 10Gbit Ethernet offers better
> bandwidth than 56Gbit Infiniband (the current offering).  The swiching
> model is quite similar.  The main reason why IB offers better latency
> is a better HBA hardware interface and a specialized stack.  5X is 5X.

Put another way, the reason infiniband is so much higher throughput and lower 
latency than ethernet is because the switching (at the physical layer) is 
completely different from ethernet, and messages are passed directly from 
user-level to user-level on remote system ram via RDMA, bypassing the OSI layer 
model and other kernel overhead.  I read a paper from vmware, where they 
implemented RDMA over ethernet and doubled the speed of vmotion (but still not 
as fast as infiniband, by like 4x.)

Beside the bypassing of OSI layers and kernel latency, IB latency is lower 
because Ethernet switches use store-and-forward buffering managed by the 
backplane in the switch, in which a sender sends a packet to a buffer on the 
switch, which then pushes it through the backplane, and finally to another 
buffer on the destination.  IB uses cross-bar, or cut-through switching, in 
which the sending host channel adapter signals the destination address to the 
switch, then waits for the channel to be opened.  Once the channel is opened, 
it stays open, and the switch in between is nothing but signal amplification 
(as well as additional virtual lanes for congestion management, and other 
functions).  The sender writes directly to RAM on the destination via RDMA, no 
buffering in between.  Bypassing the OSI layer model.  Hence much lower latency.

IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 
4x, 16x designations, and the 40Gbit specifications.  Something which is 
quasi-possible in ethernet via LACP, but not as good and not the same.  IB 
guarantees packets delivered in the right order, with native congestion control 
as compared to ethernet which may drop packets and TCP must detect and 
retransmit...  

Ethernet includes a lot of support for IP addressing, and variable link speeds 
(some 10Gbit, 10/100, 1G etc) and all of this asynchronous.  For these reasons, 
IB is not a suitable replacement for IP communications done on ethernet, with a 
lot of variable peer-to-peer and broadcast traffic.  IB is designed for 
networks where systems want to establish connections to other systems, and 
those connections remain mostly statically connected.  Primarily clustering & 
storage networks.  Not primarily TCP/IP.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-15 Thread Bob Friesenhahn

On Mon, 15 Apr 2013, Ong Yu-Phing wrote:
Working set of ~50% is quite large; when you say data analysis I'd assume 
some sort of OLTP or real-time BI situation, but do you know the nature of 
your processing, i.e. is it latency dependent or bandwidth dependent?  Reason 
I ask, is because I think 10GB delivers better overall B/W, but 4GB 
infiniband delivers better latency.


It would be difficult to believe that 10Gbit Ethernet offers better 
bandwidth than 56Gbit Infiniband (the current offering).  The swiching 
model is quite similar.  The main reason why IB offers better latency 
is a better HBA hardware interface and a specialized stack.  5X is 5X.


If 3xdisk raidz1 is too expensive, then put more SSDs in each raidz1.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-15 Thread Sašo Kiselkov
On 04/15/2013 03:30 PM, John Doe wrote:
> From: Günther Alka 
> 
>> I would think about the following
>> - yes, i would build that from SSD
>> - build the pool from multiple 10 disk Raid-Z2 vdevs,
> 
> Slightly out of topic but, what is the status of the TRIM command and zfs...?

ATM: unsupported. I'm working on that in Illumos. The ZFS bits are
there, but there is no driver support for issuing the commands to the
underlying devices.

--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-15 Thread John Doe
From: Günther Alka 

> I would think about the following
> - yes, i would build that from SSD
> - build the pool from multiple 10 disk Raid-Z2 vdevs,

Slightly out of topic but, what is the status of the TRIM command and zfs...?

JD


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-15 Thread Edward Ned Harvey (openindiana)
> From: Wim van den Berge [mailto:w...@vandenberge.us]
> 
> multiple 10Gb uplinks
> 
> However the next system is  going to be a little different. It needs to be
> the absolute fastest iSCSI target we can create/afford. 

So I'm just assuming you're going to build a pool out of SSD's, mirrored, 
perhaps even 3-way mirrors.  No cache/log devices.  All the ram you can fit 
into the system.

You've been using 10G ether so far.  Expensive, not too bad.  I'm going to 
recommend looking into infiniband instead.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

2013-04-14 Thread Ong Yu-Phing
A heads up that 10-12TB means you'd need 11.5-13TB useable, assuming 
you'd need to keep used storage < 90% of total storage useable (or is 
that old news now?).


So, using Saso's RAID5 config of Intel DC3700s in 3xdisk raidz1, that 
means you'd need 21x Intel DC3700's at 800GB (21x800/3*2*.9=10.008) to 
get 10TB, or 27x to get 12.9TB useable, excluding root/cache etc.  Which 
means 50+K for SSDs, leaving you only 10K for the server platform, which 
might not be enough to get 0.5TB of RAM etc (unless you can get a bulk 
discount on the Intel DC3700s!).


Working set of ~50% is quite large; when you say data analysis I'd 
assume some sort of OLTP or real-time BI situation, but do you know the 
nature of your processing, i.e. is it latency dependent or bandwidth 
dependent?  Reason I ask, is because I think 10GB delivers better 
overall B/W, but 4GB infiniband delivers better latency.


10 years ago I've worked with 30+TB data sets which were preloaded into 
an Oracle database, with data structures highly optimized for the types 
of reads which the applications required (2-3 day window for complex 
analysis of monthly data).  No SSDs and fancy stuff in those days.  But 
if your data is live/realtime and constantly streaming in, then the work 
profile can be dramatically different.


On 15/04/2013 07:17, Sa?o Kiselkov wrote:

On 04/14/2013 05:15 PM, Wim van den Berge wrote:

Hello,

We have been running OpenIndiana (and its various predecessors) as storage
servers in production for the last couple of years. Over that time the
majority of our storage infrastructure has been moved to Open Indiana to the
point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+
servers in three datacenters . All of these systems are pretty much the
same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM,
multiple 10Gb uplinks. All of these work like a charm.

However the next system is  going to be a little different. It needs to be
the absolute fastest iSCSI target we can create/afford. We'll need about
10-12TB of capacity and the working set will be 5-6TB and IO over time is
90% reads and 10% writes using 32K blocks but this is a data analysis
scenario so all the writes are upfront. Contrary to previous installs, money
is a secondary (but not unimportant) issue for this one. I'd like to stick
with a SuperMicro platform and we've been thinking of trying the new Intel
S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep
system cost below $60K.

This is new ground for us. Before this one, the game has always been
primarily about capacity/data integrity and anything we designed based on
ZFS/Open Solaris has always more than delivered in the performance arena.
This time we're looking to fill up the dedicated 10Gbe connections to each
of the four to eight processing nodes as much as possible. The processing
nodes have been designed that they will consume whatever storage bandwidth
they can get.

Any ideas/thoughts/recommendations/caveats would be much appreciated.

Hi Wim,

Interesting project. You should definitely look at all-SSD pools here.
With the 800GB DC S3700 running in 3-drive raidz1's you're looking at
approximately $34k CAPEX (for the 10TB capacity point) just for the
SSDs. That leaves you ~$25k you can spend on the rest of the box, which
is *a lot*. Be sure to put lots of RAM (512GB+) into the box.

Also consider ditching 10GE and go straight to IB. A dual-port QDR card
can be had nowadays for about $1k (SuperMicro even makes motherboards
with QDR-IB on-board) and a 36-port Mellanox QDR switch can be had for
about $8k (this integrates the IB subnet manager, so this is all you
need to set up an IB network):
http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=158

Cheers,
--
Saso






___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-14 Thread Richard Elling
On Apr 14, 2013, at 8:15 AM, Wim van den Berge  wrote:
> Hello,
> 
> We have been running OpenIndiana (and its various predecessors) as storage
> servers in production for the last couple of years. Over that time the
> majority of our storage infrastructure has been moved to Open Indiana to the
> point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+
> servers in three datacenters . All of these systems are pretty much the
> same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM,
> multiple 10Gb uplinks. All of these work like a charm. 
> 
> However the next system is  going to be a little different. It needs to be
> the absolute fastest iSCSI target we can create/afford. We'll need about
> 10-12TB of capacity and the working set will be 5-6TB and IO over time is
> 90% reads and 10% writes using 32K blocks but this is a data analysis
> scenario so all the writes are upfront. Contrary to previous installs, money
> is a secondary (but not unimportant) issue for this one. I'd like to stick
> with a SuperMicro platform and we've been thinking of trying the new Intel
> S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep
> system cost below $60K.

Does "fast" mean "low-latency"? If so, the general rules are:
+ mirror
+ go direct, no expanders
+ iSCSI tends to not use ZIL very much, but  you can verify on your 
workload.

There are a number of vendors who have been selling SSD-only ZFS systems
for a few years. You might ask around for experiences and specs.
 -- richard

> This is new ground for us. Before this one, the game has always been
> primarily about capacity/data integrity and anything we designed based on
> ZFS/Open Solaris has always more than delivered in the performance arena.
> This time we're looking to fill up the dedicated 10Gbe connections to each
> of the four to eight processing nodes as much as possible. The processing
> nodes have been designed that they will consume whatever storage bandwidth
> they can get.
> 
> 
> 
> Any ideas/thoughts/recommendations/caveats would be much appreciated.
> 
> 
> 
> Thanks
> 
> 
> 
> W
> 
> ___
> OpenIndiana-discuss mailing list
> OpenIndiana-discuss@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss

--

richard.ell...@richardelling.com
+1-760-896-4422



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-14 Thread Sašo Kiselkov
On 04/14/2013 05:15 PM, Wim van den Berge wrote:
> Hello,
> 
> We have been running OpenIndiana (and its various predecessors) as storage
> servers in production for the last couple of years. Over that time the
> majority of our storage infrastructure has been moved to Open Indiana to the
> point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+
> servers in three datacenters . All of these systems are pretty much the
> same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM,
> multiple 10Gb uplinks. All of these work like a charm. 
> 
> However the next system is  going to be a little different. It needs to be
> the absolute fastest iSCSI target we can create/afford. We'll need about
> 10-12TB of capacity and the working set will be 5-6TB and IO over time is
> 90% reads and 10% writes using 32K blocks but this is a data analysis
> scenario so all the writes are upfront. Contrary to previous installs, money
> is a secondary (but not unimportant) issue for this one. I'd like to stick
> with a SuperMicro platform and we've been thinking of trying the new Intel
> S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep
> system cost below $60K.
> 
> This is new ground for us. Before this one, the game has always been
> primarily about capacity/data integrity and anything we designed based on
> ZFS/Open Solaris has always more than delivered in the performance arena.
> This time we're looking to fill up the dedicated 10Gbe connections to each
> of the four to eight processing nodes as much as possible. The processing
> nodes have been designed that they will consume whatever storage bandwidth
> they can get.
> 
> Any ideas/thoughts/recommendations/caveats would be much appreciated.

Hi Wim,

Interesting project. You should definitely look at all-SSD pools here.
With the 800GB DC S3700 running in 3-drive raidz1's you're looking at
approximately $34k CAPEX (for the 10TB capacity point) just for the
SSDs. That leaves you ~$25k you can spend on the rest of the box, which
is *a lot*. Be sure to put lots of RAM (512GB+) into the box.

Also consider ditching 10GE and go straight to IB. A dual-port QDR card
can be had nowadays for about $1k (SuperMicro even makes motherboards
with QDR-IB on-board) and a 36-port Mellanox QDR switch can be had for
about $8k (this integrates the IB subnet manager, so this is all you
need to set up an IB network):
http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=158

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-14 Thread Günther Alka

I would think about the following

- yes, i would build that from SSD
- build the pool from multiple 10 disk Raid-Z2 vdevs,

- use as much RAM as possible to serve most of reads from RAM
example a dual socket 2011 system with 256 GB RAM

- if you need sync writes/ disabled LU write back cache, use dedicated 
DRAM based log-devices (ZEUSRAM)
For multiple 10 Gbe you may need several of them or disable sync/enable 
LU cache when possible,
I would calculate one ZEUSRAM per 10 GBe adapter (about 2000$ each), 
analyze ZIL usage first.


- if possible, avoid expander with Sata disks

- do not fill a pool above 50% if you need max performance
read about fillrate vs throughput: 
http://blog.delphix.com/uday/2013/02/19/78/


- tune ip (Jumboframes, MPIO, Trunking) and iSCSI blocksize

- think about using OmniOS (a little more up to date than OI)


The rest is some math,you need:

a case (like a 50 x 3,5" bay Chenbro or a up to 72 bay SuperMicro)
with a 7 x pci-e mainboard, CPU, RAM, 3 x SAS2 HBA controller,
4 x dual 10 Gbe adapters:

ex: Chenbro 50 x 3,5" case without expander:
4 x dual 10 Gbe + 3 x LSI 16 channel HBA

or Supermicro cases with expander, up to 72 x 2,5" bays
with up to 3 x 8-16 channel HBA

say 1 $


The rest is for SSD and ZIL
If you like to use 10 TB and want to have 20TB capacity for performance 
reasons:

with your 800GB Intel, you have about 6,5 TB usable for 10 disks (Z2)
You need 30 of them ex 2000$ per SSD: 6$ (without ZIL and spare),

gives a total o 7$ without ZIL and spare.

other Option:
use 500-600 GB SSD like Intel 320 or 520.
You need more of them but they are cheaper regarding TB/$

allow 80% SSD usage, check ARC usage to eventually reduce amount if SSD
(RAM is cheaper than using only 50% of SSD capacity)

keep enough slots free to optionally add more SSD for better performance 
or higher capacity

care about needed capacity for snaps
add 10% spare disks.




On 14.04.2013 17:15, Wim van den Berge wrote:

Hello,

  


We have been running OpenIndiana (and its various predecessors) as storage
servers in production for the last couple of years. Over that time the
majority of our storage infrastructure has been moved to Open Indiana to the
point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+
servers in three datacenters . All of these systems are pretty much the
same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM,
multiple 10Gb uplinks. All of these work like a charm.

  


However the next system is  going to be a little different. It needs to be
the absolute fastest iSCSI target we can create/afford. We'll need about
10-12TB of capacity and the working set will be 5-6TB and IO over time is
90% reads and 10% writes using 32K blocks but this is a data analysis
scenario so all the writes are upfront. Contrary to previous installs, money
is a secondary (but not unimportant) issue for this one. I'd like to stick
with a SuperMicro platform and we've been thinking of trying the new Intel
S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep
system cost below $60K.

  


This is new ground for us. Before this one, the game has always been
primarily about capacity/data integrity and anything we designed based on
ZFS/Open Solaris has always more than delivered in the performance arena.
This time we're looking to fill up the dedicated 10Gbe connections to each
of the four to eight processing nodes as much as possible. The processing
nodes have been designed that they will consume whatever storage bandwidth
they can get.

  


Any ideas/thoughts/recommendations/caveats would be much appreciated.

  


Thanks

  


W

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss




___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


[OpenIndiana-discuss] Recommendations for fast storage

2013-04-14 Thread Wim van den Berge
Hello,

 

We have been running OpenIndiana (and its various predecessors) as storage
servers in production for the last couple of years. Over that time the
majority of our storage infrastructure has been moved to Open Indiana to the
point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+
servers in three datacenters . All of these systems are pretty much the
same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM,
multiple 10Gb uplinks. All of these work like a charm. 

 

However the next system is  going to be a little different. It needs to be
the absolute fastest iSCSI target we can create/afford. We'll need about
10-12TB of capacity and the working set will be 5-6TB and IO over time is
90% reads and 10% writes using 32K blocks but this is a data analysis
scenario so all the writes are upfront. Contrary to previous installs, money
is a secondary (but not unimportant) issue for this one. I'd like to stick
with a SuperMicro platform and we've been thinking of trying the new Intel
S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep
system cost below $60K.

 

This is new ground for us. Before this one, the game has always been
primarily about capacity/data integrity and anything we designed based on
ZFS/Open Solaris has always more than delivered in the performance arena.
This time we're looking to fill up the dedicated 10Gbe connections to each
of the four to eight processing nodes as much as possible. The processing
nodes have been designed that they will consume whatever storage bandwidth
they can get.

 

Any ideas/thoughts/recommendations/caveats would be much appreciated.

 

Thanks

 

W

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss