Re: [OpenIndiana-discuss] Recommendations for fast storage
comment below… On Apr 18, 2013, at 5:17 AM, Edward Ned Harvey (openindiana) wrote: >> From: Timothy Coalson [mailto:tsc...@mst.edu] >> >> Did you also compare the probability of bit errors causing data loss >> without a complete pool failure? 2-way mirrors, when one device >> completely >> dies, have no redundancy on that data, and the copy that remains must be >> perfect or some data will be lost. > > I had to think about this comment for a little while to understand what you > were saying, but I think I got it. I'm going to rephrase your question: > > If one device in a 2-way mirror becomes unavailable, then the remaining > device has no redundancy. So if a bit error is encountered on the (now > non-redundant) device, then it's an uncorrectable error. Question is, did I > calculate that probability? > > Answer is, I think so. Modelling the probability of drive failure (either > complete failure or data loss) is very complex and non-linear. Also > dependent on the specific model of drive in question, and the graphs are > typically not available. So what I did was to start with some MTBDL graphs > that I assumed to be typical, and then assume every data-loss event meant > complete drive failure. Already I'm simplifying the model beyond reality, > but the simplification focuses on worst case, and treats every bit error as > complete drive failure. This is why I say "I think so," to answer your > question. > > Then, I didn't want to embark on a mathematician's journey of derivatives and > integrals over some non-linear failure rate graphs, so I linearized... I > forget now (it was like 4-6 years ago) but I would have likely seen that > drives were unlikely to fail in the first 2 years, and about 50% likely to > fail after 3 years, and nearly certain to fail after 5 years, so I would have > likely modeled that as a linearly increasing probability of failure rate up > to 4 years, where it's assumed 100% failure rate at 4 years. This technique shows a good appreciation of the expected lifetime of components. Some of the more sophisticated models use a Weibull distribution, and this works particularly well for computing devices. The problem for designers is that the Weibull model parameters are not publically published by the vendors. You need some time in the field to collect these, so it is impractical for the systems designers. At the end of the day, we have two practical choices: 1. Prepare for planned obsolescence and replacement of devices when the expected lifetime metric is reached. The best proxy for HDD expected lifetime is the warranty period, and you'll often notice that enterprise drives have a better spec than consumer drives -- you tend to get what you pay for. 2. Measure your environment very carefully and take proactive action when the system begins to display signs of age-related wear out. This is a good idea in all cases, but the techniques are not widely adopted… yet. > Yes, this modeling introduces inaccuracy, but that inaccuracy is in the > noise. Maybe in the first 2 years, I'm 25% off in my estimates to the > positive, and after 4 years I'm 25% off in the negative, or something like > that. But when the results show 10^-17 probability for one configuration and > 10^-19 probability for a different configuration, then the 25% error is > irrelevant. It's easy to see which configuration is more probable to fail, > and it's also easy to see they're both well within acceptable limits for most > purposes (especially if you have good backups.) For reliability measurements, this is not a bad track record. There are lots of other, environmental and historical factors that impact real life. As an analogy, for humans, early death tends to be dominated by accidents rather than chronic health conditions. For example, children tend to die in automobile accidents, while octogenarians tend to die from heart attacks, organ failure, or cancer -- different failure modes as a function of age. -- richard >> Also, as for time to resilver, I'm guessing that depends largely on where >> bottlenecks are (it has to read effectively all of the remaining disks in >> the vdev either way, but can do so in parallel, so ideally it could be the >> same speed), > > No. The big factor for resilver time is (a) the number of operations that > need to be performed, and (b) the number of operations per second. > > If you have one big vdev making up a pool, then the number of operations to > be performed is equal to the number of objects in the pool. The number of > operations per second is limited by the worst case random seek time for any > device in the pool. If you have an all-SSD pool, then it's equal to a single > disk performance. If you have an all-HDD pool, then with increasing number > of devices in your vdev, you approach 50% of the IOPS of a single device. > > If
Re: [OpenIndiana-discuss] Recommendations for fast storage
> From: Jay Heyl [mailto:j...@frelled.us] > > I now realize you're talking about 8 separate 2-disk > mirrors organized into a pool. "mirror x1 y1 mirror x2 y2 mirror x3 y3..." Yup. That's normal, and the only way. > I also realize that almost every discussion I've seen online concerning > mirrors proposes organizing the drives in the way I was thinking about it Hmmm... What alternative are you thinking of?There is no alternative. > This also starts to make a lot more sense. Confused the hell out of me the > first three times I read it. I'm going to have to ponder this a bit more as > my thinking has been heavily influenced by the more conventional mirror > arrangement. What are you talking about? ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
> From: Timothy Coalson [mailto:tsc...@mst.edu] > > Did you also compare the probability of bit errors causing data loss > without a complete pool failure? 2-way mirrors, when one device > completely > dies, have no redundancy on that data, and the copy that remains must be > perfect or some data will be lost. I had to think about this comment for a little while to understand what you were saying, but I think I got it. I'm going to rephrase your question: If one device in a 2-way mirror becomes unavailable, then the remaining device has no redundancy. So if a bit error is encountered on the (now non-redundant) device, then it's an uncorrectable error. Question is, did I calculate that probability? Answer is, I think so. Modelling the probability of drive failure (either complete failure or data loss) is very complex and non-linear. Also dependent on the specific model of drive in question, and the graphs are typically not available. So what I did was to start with some MTBDL graphs that I assumed to be typical, and then assume every data-loss event meant complete drive failure. Already I'm simplifying the model beyond reality, but the simplification focuses on worst case, and treats every bit error as complete drive failure. This is why I say "I think so," to answer your question. Then, I didn't want to embark on a mathematician's journey of derivatives and integrals over some non-linear failure rate graphs, so I linearized... I forget now (it was like 4-6 years ago) but I would have likely seen that drives were unlikely to fail in the first 2 years, and about 50% likely to fail after 3 years, and nearly certain to fail after 5 years, so I would have likely modeled that as a linearly increasing probability of failure rate up to 4 years, where it's assumed 100% failure rate at 4 years. Yes, this modeling introduces inaccuracy, but that inaccuracy is in the noise. Maybe in the first 2 years, I'm 25% off in my estimates to the positive, and after 4 years I'm 25% off in the negative, or something like that. But when the results show 10^-17 probability for one configuration and 10^-19 probability for a different configuration, then the 25% error is irrelevant. It's easy to see which configuration is more probable to fail, and it's also easy to see they're both well within acceptable limits for most purposes (especially if you have good backups.) > Also, as for time to resilver, I'm guessing that depends largely on where > bottlenecks are (it has to read effectively all of the remaining disks in > the vdev either way, but can do so in parallel, so ideally it could be the > same speed), No. The big factor for resilver time is (a) the number of operations that need to be performed, and (b) the number of operations per second. If you have one big vdev making up a pool, then the number of operations to be performed is equal to the number of objects in the pool. The number of operations per second is limited by the worst case random seek time for any device in the pool. If you have an all-SSD pool, then it's equal to a single disk performance. If you have an all-HDD pool, then with increasing number of devices in your vdev, you approach 50% of the IOPS of a single device. If your pool is broken down into a bunch of smaller vdev's, Let's say N mirrors that are all 2-way. Then the number of operations to resilver the degraded mirror is 1/N of the total objects in the pool. And the number of operations per second is equal to the performance of a single disk. So the resilver time in the big vdev raidz is 2N times longer than the resilver time for the mirror. As you mentioned, other activity in the pool can further reduce the number of operations per second. If you have N mirrors, then the probability of the other activity affecting the degraded mirror is 1/N. Whereas, with a single big vdev, you guessed it, all other activity is guaranteed to affect the resilvering vdev. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
> From: Jay Heyl [mailto:j...@frelled.us] > > Ah, that makes much more sense. Thanks for the clarification. Now that you > put it that way I have to wonder how I ever came under the impression it > was any other way. I've gotten lost in the numerous mis-communications of this thread, but just to make sure there is no confusion: If you have a mirror (or any vdev with redundancy, radizN) you issue a read, then normally only one side of the mirror gets read (not the redundant copies.) If the cksum fails, then redundant copies are read successively, until a successful cksum is found (still, some redundant copies might not have been read.) If you perform a scrub, then all copies of all information are read and validated. The advantage of reading only one side of the mirror is performance. If one device is busy satisfying one read request, then the other sides of the mirror are available to satisfy other read requests. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 2013-04-17 21:25, Jay Heyl wrote: It (finally) occurs to me that not all mirrors are created equal. I've been assuming, and probably ignoring hints to the contrary, that what was being compared here was a raid-z2 configuraton with a 2-way mirror composed of two 8-disk vdevs. I now realize you're talking about 8 separate 2-disk mirrors organized into a pool. "mirror x1 y1 mirror x2 y2 mirror x3 y3..." Well, to help you clarify things, there are simple mirrors made over two identically sized storage devices (for simplicity, we won't go into three- or four-way mirrors which are essentially the same idea spread over more devices for higher reliability and read performance). If you involve more couples of devices, you can build (in traditional RAID terminology) either raid10 - one striping over many separate mirrors, or raid01 - one mirror over two individual stripes (or plain concatenations sometimes). While the performance is similar and difference seems semantical, there is some. Most importantly, a single broken disk in a stripe invalidates it, and thus a whole half of raid01 top-level mirror becomes broken. That's why these setups are rarely used nowadays, and ZFS didn't even bother to implement them (unlike raid10) :) HTH, //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Wed, Apr 17, 2013 at 5:38 AM, Edward Ned Harvey (openindiana) < openindi...@nedharvey.com> wrote: > > From: Sašo Kiselkov [mailto:skiselkov...@gmail.com] > > > > Raid-Z indeed does stripe data across all > > leaf vdevs (minus parity) and does so by splitting the logical block up > > into equally sized portions. > > Jay, there you have it. You asked why use mirrors, and you said you would > use raidz2 or raidz3 unless cpu overhead is too much. I recommended using > mirrors and avoiding raidzN, and here is the answer why. > > If you have 16 disks arranged in 8x mirrors, versus 10 disks in raidz2 > which stripes across 8 disks plus 2 parity disks, then the serial write of > each configuration is about the same; that is, 8x the sustained write speed > of a single device. But if you have two or more parallel sequential read > threads, then the sequential read speed of the mirrors will be 16x while > the raidz2 is only 8x. The mirror configuration can do 8x random write > while the raidz2 is only 1x. And the mirror can do 16x random read while > the raidz2 is only 1x. > It (finally) occurs to me that not all mirrors are created equal. I've been assuming, and probably ignoring hints to the contrary, that what was being compared here was a raid-z2 configuraton with a 2-way mirror composed of two 8-disk vdevs. I now realize you're talking about 8 separate 2-disk mirrors organized into a pool. "mirror x1 y1 mirror x2 y2 mirror x3 y3..." I also realize that almost every discussion I've seen online concerning mirrors proposes organizing the drives in the way I was thinking about it (which is probably why I was thinking that way). I suppose this is something different that zfs brings to the table when compared to more conventional hardware raid. > > In the case you care about the least, they're equal. In the case you care > about most, the mirror configuration is 16x faster. > > You also said the raidz2 will offer more protection against failure, > because you can survive any two disk failures (but no more.) I would argue > this is incorrect (I've done the probability analysis before). Mostly > because the resilver time in the mirror configuration is 8x to 16x faster > (there's 1/8 as much data to resilver, and IOPS is limited by a single > disk, not the "worst" of several disks, which introduces another factor up > to 2x, increasing the 8x as high as 16x), so the smaller resilver window > means lower probability of "concurrent" failures on the critical vdev. > We're talking about 12 hours versus 1 week, actual result of my machines > in production. Also, while it's possible to fault the pool with only 2 > failures in the mirror configuration, the probability is against that > happening. The first disk failure probability is 1/16 for each disk ... > And then if you have a 2nd concurrent failure, there's a 14/15 probability > that it occurs on a separately independent (safe) mirror. The 3rd > concurrent failure 12/14 chance of being safe. The 4th concurrent failure > 10/13 chance of being safe. Etc. The mirror configuration can probably > withstand a higher number of failures, and also the resilver window for > each failure is smaller. When you look at the total probability of pool > failure, they were both like 10^-17 or something like that. In other > words, we're splitting hairs but as long as we are, we might as well point > out that they're both about the same. > This also starts to make a lot more sense. Confused the hell out of me the first three times I read it. I'm going to have to ponder this a bit more as my thinking has been heavily influenced by the more conventional mirror arrangement. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalson wrote: > On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) < > openindi...@nedharvey.com> wrote: > >> You also said the raidz2 will offer more protection against failure, >> because you can survive any two disk failures (but no more.) I would argue >> this is incorrect (I've done the probability analysis before). Mostly >> because the resilver time in the mirror configuration is 8x to 16x faster >> (there's 1/8 as much data to resilver, and IOPS is limited by a single >> disk, not the "worst" of several disks, which introduces another factor up >> to 2x, increasing the 8x as high as 16x), so the smaller resilver window >> means lower probability of "concurrent" failures on the critical vdev. >> We're talking about 12 hours versus 1 week, actual result of my machines >> in production. >> > > Did you also compare the probability of bit errors causing data loss > without a complete pool failure? 2-way mirrors, when one device completely > dies, have no redundancy on that data, and the copy that remains must be > perfect or some data will be lost. On the other hand, raid-z2 will still > have available redundancy, allowing every single block to have a bad read > on any single component disk, without losing data. I haven't done the math > on this, but I seem to recall some papers claiming that this is the more > likely route to lost data on modern disks, by comparing bit error rate and > capacity. Of course, a second outright failure puts raid-z2 in a much > worse boat than 2-way mirrors, which is a reason for raid-z3, but this may > already be a less likely case. Richard Elling wrote a blog post about "mean time to data loss" [1]. A few years later he graphed out a few cases for typical values of resilver times [2]. [1] https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl [2] http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html Cheers, Jan ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Wed, Apr 17, 2013 at 11:21 AM, Jim Klimov wrote: > On 2013-04-17 20:09, Jay Heyl wrote: > >> reply. Unless the first device to answer returns garbage (something >>> that doesn't match the expected checksum), other copies are not read >>> as part of this request. >>> >>> >> Ah, that makes much more sense. Thanks for the clarification. Now that you >> put it that way I have to wonder how I ever came under the impression it >> was any other way. >> > > > Well, there are different architectures, so some might do what you > suggested. From what I read just yesterday, RAM mirroring on some > high-end servers works indeed like you described - by reading both > parts and comparing the results, testing ECC if needed, etc. to > figure out the correct memory contents or return an error if both > parts are faulty and can't be trusted (ECC mismatch on both). > > Military, nuclear and space systems often are built like 3 or 5 > computers (odd amount for easier quorum) doing the same calculations > over same inputs, and comparing the results to be sure of them or to > redo the task. > > So I guess it depends on your background - why you thought this of > disk systems ;) Actually, I'm pretty sure I read it somewhere on the internet. My fault for thinking every guy with a blog actually knows what he's talking about. :-) ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) < openindi...@nedharvey.com> wrote: > You also said the raidz2 will offer more protection against failure, > because you can survive any two disk failures (but no more.) I would argue > this is incorrect (I've done the probability analysis before). Mostly > because the resilver time in the mirror configuration is 8x to 16x faster > (there's 1/8 as much data to resilver, and IOPS is limited by a single > disk, not the "worst" of several disks, which introduces another factor up > to 2x, increasing the 8x as high as 16x), so the smaller resilver window > means lower probability of "concurrent" failures on the critical vdev. > We're talking about 12 hours versus 1 week, actual result of my machines > in production. > Did you also compare the probability of bit errors causing data loss without a complete pool failure? 2-way mirrors, when one device completely dies, have no redundancy on that data, and the copy that remains must be perfect or some data will be lost. On the other hand, raid-z2 will still have available redundancy, allowing every single block to have a bad read on any single component disk, without losing data. I haven't done the math on this, but I seem to recall some papers claiming that this is the more likely route to lost data on modern disks, by comparing bit error rate and capacity. Of course, a second outright failure puts raid-z2 in a much worse boat than 2-way mirrors, which is a reason for raid-z3, but this may already be a less likely case. Also, as for time to resilver, I'm guessing that depends largely on where bottlenecks are (it has to read effectively all of the remaining disks in the vdev either way, but can do so in parallel, so ideally it could be the same speed), and how busy the pool is with other things - how does the bandwidth sharing during resilver work? Tim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 2013-04-17 20:09, Jay Heyl wrote: reply. Unless the first device to answer returns garbage (something that doesn't match the expected checksum), other copies are not read as part of this request. Ah, that makes much more sense. Thanks for the clarification. Now that you put it that way I have to wonder how I ever came under the impression it was any other way. Well, there are different architectures, so some might do what you suggested. From what I read just yesterday, RAM mirroring on some high-end servers works indeed like you described - by reading both parts and comparing the results, testing ECC if needed, etc. to figure out the correct memory contents or return an error if both parts are faulty and can't be trusted (ECC mismatch on both). Military, nuclear and space systems often are built like 3 or 5 computers (odd amount for easier quorum) doing the same calculations over same inputs, and comparing the results to be sure of them or to redo the task. So I guess it depends on your background - why you thought this of disk systems ;) //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, Apr 16, 2013 at 5:49 PM, Jim Klimov wrote: > On 2013-04-17 02:10, Jay Heyl wrote: > >> Not to get into bickering about semantics, but I asked, "Or am I wrong >> about reads being issued in parallel to all the mirrors in the array?", to >> which you replied, "Yes, in normal case... this assumption is wrong... but >> reads should be in parallel." (Ellipses intended for clarity, not argument >> munging.) If reads are in parallel, then it seems as though my assumption >> is correct. I realize the system will discard data from all but the first >> reads and that using only the first response can improve performance, but >> in terms of number of IOPs, which is where I intended to go with this, it >> seems to me the mirrored system will have at least as many if not more >> than >> the raid-zn system. >> >> Or have I completely misunderstood what you intended to say? >> > > Um, right... I got torn between several letters and forgot the details > of one. So, here's what I replied to with poor wording - *I thought you > meant* "A single read request from a program would be redirected as a > series of parallel requests to mirror components asking for the same > data, whichever one answers first" - this is no, the "wrong" in my > reply. Unless the first device to answer returns garbage (something > that doesn't match the expected checksum), other copies are not read > as part of this request. > Ah, that makes much more sense. Thanks for the clarification. Now that you put it that way I have to wonder how I ever came under the impression it was any other way. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
> From: Sašo Kiselkov [mailto:skiselkov...@gmail.com] > > Raid-Z indeed does stripe data across all > leaf vdevs (minus parity) and does so by splitting the logical block up > into equally sized portions. Jay, there you have it. You asked why use mirrors, and you said you would use raidz2 or raidz3 unless cpu overhead is too much. I recommended using mirrors and avoiding raidzN, and here is the answer why. If you have 16 disks arranged in 8x mirrors, versus 10 disks in raidz2 which stripes across 8 disks plus 2 parity disks, then the serial write of each configuration is about the same; that is, 8x the sustained write speed of a single device. But if you have two or more parallel sequential read threads, then the sequential read speed of the mirrors will be 16x while the raidz2 is only 8x. The mirror configuration can do 8x random write while the raidz2 is only 1x. And the mirror can do 16x random read while the raidz2 is only 1x. In the case you care about the least, they're equal. In the case you care about most, the mirror configuration is 16x faster. You also said the raidz2 will offer more protection against failure, because you can survive any two disk failures (but no more.) I would argue this is incorrect (I've done the probability analysis before). Mostly because the resilver time in the mirror configuration is 8x to 16x faster (there's 1/8 as much data to resilver, and IOPS is limited by a single disk, not the "worst" of several disks, which introduces another factor up to 2x, increasing the 8x as high as 16x), so the smaller resilver window means lower probability of "concurrent" failures on the critical vdev. We're talking about 12 hours versus 1 week, actual result of my machines in production. Also, while it's possible to fault the pool with only 2 failures in the mirror configuration, the probability is against that happening. The first disk failure probability is 1/16 for each disk ... And then if you have a 2nd concurrent failure, there's a 14/15 probability that it occurs on a separately independent (safe) mirror. The 3rd concurrent failure 12/14 chance of being safe. The 4th concurrent failure 10/13 chance of being safe. Etc. The mirror configuration can probably withstand a higher number of failures, and also the resilver window for each failure is smaller. When you look at the total probability of pool failure, they were both like 10^-17 or something like that. In other words, we're splitting hairs but as long as we are, we might as well point out that they're both about the same. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 04/17/2013 02:08 AM, Edward Ned Harvey (openindiana) wrote: >> From: Sašo Kiselkov [mailto:skiselkov...@gmail.com] >> >> If you are IOPS constrained, then yes, raid-zn will be slower, simply >> because any read needs to hit all data drives in the stripe. > > Saso, I would expect you to know the answer to this question, probably: > I have heard that raidz is more similar to raid-1e than raid-5. > Meaning, when you write data to raidz, it doesn't get striped across > all devices in the raidz vdev... Rather, two copies of the data get > written to any of the available devices in the raidz. Can you confirm? No, this is not what happens. Raid-Z indeed does stripe data across all leaf vdevs (minus parity) and does so by splitting the logical block up into equally sized portions. A block *can* take up less than the full stripe width for very small blocks or for very wide stripes, both of which should be a rare occurrence. 4k sectored devices change this calculation quite dramatically, which is why I wouldn't recommend using them in pools unless you understand how your workload and raidz geometry will interact and take note of it. See vdev_raidz_map_alloc here: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/vdev_raidz.c#L434-L554 for all the fine details (the above description is grossly oversimplified). > If the behavior is to stripe across all the devices in the raidz, > then the raidz iops really can't exceed that of a single device, > because you have to wait for every device to respond before you > have a complete block of data. But if it's more like raid-1e and > individual devices can read independently of each other, then at > least theoretically, the raidz with n-devices in it could return > iops performance on-par with n-times a single disk. As a general rule of thumb, raidz has the IOPS of a single drive. This is not exactly news: https://blogs.oracle.com/roch/entry/when_to_and_not_to Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 17/04/13 02:10, Jay Heyl wrote: > Not to get into bickering about semantics, but I asked, "Or am I > wrong about reads being issued in parallel to all the mirrors in > the array?" Each read is issued only to a (lets say, "random") disk in the mirror, unless the read is faulty. You can check this easily with "zpool iostat -v". - -- Jesús Cea Avión _/_/ _/_/_/_/_/_/ j...@jcea.es - http://www.jcea.es/ _/_/_/_/ _/_/_/_/ _/_/ Twitter: @jcea_/_/_/_/ _/_/_/_/_/ jabber / xmpp:j...@jabber.org _/_/ _/_/_/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/_/_/ _/_/_/_/ _/_/ "My name is Dump, Core Dump" _/_/_/_/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQCVAwUBUW4LLJlgi5GaxT1NAQJc7QQAin3XjOVhOqlD5/Q0xplH+TLtPNjzsCqd rAvz30tnokA1MXgGpCXx2u5rGnS2CE/Xi5boMBMegxf+feAgQlANYRykpgwSqxeo VgBUvkoWC2oKAk2hAT1UxcbKD+YhuESx8n1B/JHej97eZuhlthpmGZC2e1H3GDiH 0+I/+wNPUVY= =yyd1 -END PGP SIGNATURE- ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 2013-04-17 02:10, Jay Heyl wrote: Not to get into bickering about semantics, but I asked, "Or am I wrong about reads being issued in parallel to all the mirrors in the array?", to which you replied, "Yes, in normal case... this assumption is wrong... but reads should be in parallel." (Ellipses intended for clarity, not argument munging.) If reads are in parallel, then it seems as though my assumption is correct. I realize the system will discard data from all but the first reads and that using only the first response can improve performance, but in terms of number of IOPs, which is where I intended to go with this, it seems to me the mirrored system will have at least as many if not more than the raid-zn system. Or have I completely misunderstood what you intended to say? Um, right... I got torn between several letters and forgot the details of one. So, here's what I replied to with poor wording - *I thought you meant* "A single read request from a program would be redirected as a series of parallel requests to mirror components asking for the same data, whichever one answers first" - this is no, the "wrong" in my reply. Unless the first device to answer returns garbage (something that doesn't match the expected checksum), other copies are not read as part of this request. Now, if there are many requests on the system issued simultaneously, which is most often the case, then reads from different requests are directed to different disks, but again - one read goes to one disk except pathological cases. It is likely that the system selects a disk to read from based, in part, on its expectation of where the disk head is (i.e. last requested LBA is nearest to the LBA we want now) in order to minimize latency and unproductive time losses. Thus "sequential reads" where requests for nearby sectors come in a succession are likely to be satisfied by a single disk in the mirror, leaving other disks available to satisfy other reads. Copies of a write request however are sent to all disks and committed (flushed) before the synchronous request is accepted as completed (for example, a write-and-commit of a TXG transaction group). Hope this makes my point clearer, it is late here ;) //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
> From: Mehmet Erol Sanliturk [mailto:m.e.sanlit...@gmail.com] > > SSD units are very vulnerable to power cuts during work up to complete > failure which they can not be used any more to complete loss of data . If there are any junky drives out there that fail so dramatically, those are junky and the exception. Just imagine how foolish the engineers would have to be, "Power loss? I didn't think of that... Complete drive failure in power loss is acceptable behavior." Definitely an inaccurate generalization about SSD's. There is nothing inherent about flash memory as compared to magnetic material, that would cause such a thing. I repeat: I'm not saying there's no such thing as a SSD that has such a problem. I'm saying if there is, it's junk. And you can safely assume any good drive doesn't have that problem. > MLC ( Multi-Level Cell ) SSD units have a short life time if they are > continuously written ( they are more suitable to write once ( in a limited > number of writes sense ) - read many ) . It's a fact that NAND has a finite number of write cycles, and it gets slower to write, the more times it's been re-written. It is also a fact that when SSD's were first introduced to the commodity market about 11 years ago, that they failed quickly due to OSes (windows) continually writing the same sectors over and over. But manufacturers have been long since aware of this problem, and solved it by overprovisioning and wear-leveling. Similar to ZFS copy-on-write, which has the ability to logically address some blocks and secretly re-map them to different sectors behind the scenes... SSD's with wear-leveling secretly remap sectors during writes. > SSD units may fail due to write wearing in an unexpected time , making them > very unreliable for mission critical works . Every page has a write counter, which is used to predict failure. A very predictable, and very much *not* unexpected time. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
For the context of ZPL, easy answer below :-) ... On Apr 16, 2013, at 4:12 PM, Timothy Coalson wrote: > On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov wrote: > >> On 2013-04-16 23:56, Jay Heyl wrote: >> >>> result in more devices being hit for both read and write. Or am I wrong >>> about reads being issued in parallel to all the mirrors in the array? >>> >> >> Yes, in normal case (not scrubbing which makes a point of reading >> everything) this assumption is wrong. Writes do hit all devices >> (mirror halves or raid disks), but reads should be in parallel. >> For mechanical HDDs this allows to double average read speeds >> (or triple for 3-way mirrors, etc.) because different spindles >> begin using their heads in shorter strokes around different areas, >> if there are enough concurrent randomly placed reads. >> > > There is another part to his question, specifically whether a single random > read that falls within one block of the file hits more than one top level > vdev - No. > to put it another way, whether a single block of a file is striped > across top level vdevs. I believe every block is allocated from one and > only one vdev (blocks with ditto copies allocate multiple blocks, ideally > from different vdevs, but this is not the same thing), such that every read > that hits only one file block goes to only one top level vdev unless > something goes wrong badly enough to need a ditto copy. Correct. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
> From: Jay Heyl [mailto:j...@frelled.us] > > > So I'm just assuming you're going to build a pool out of SSD's, mirrored, > > perhaps even 3-way mirrors. No cache/log devices. All the ram you can fit > > into the system. > > What would be the logic behind mirrored SSD arrays? With spinning platters > the mirrors improve performance by allowing the fastest of the mirrors to > respond to a particular command to be the one that defines throughput. When you read from a mirror, ZFS doesn't read the same data from both sides of the mirror simultaneously and let them race, wasting bus & memory bandwidth to attempt gaining smaller latency. If you have a single thread doing serial reads, I also have no cause to believe that zfs reads stripes from multiple sides of the mirror to accelerate - rather, it relies on the striping across multiple mirrors or vdev's. But if you have multiple threads requesting independent random read operations that are on the same mirror, I have measured the results that you get very nearly n-times a single disk random read performance by using a n-way mirror and at least n or 2n independent random read threads. > There is no > latency due to head movement or waiting for the proper spot on the disc to > rotate under the heads. Nothing, including ZFS, has such an in-depth knowledge of the inner drive geometry as to know how long is necessary for the rotational latency to come around. Also, rotational latency is almost nothing compared to head seek. For this reason, short-stroking makes a big difference, when you have a data usage pattern that can easily be confined to a small number of adjacent tracks. I believe, if you use a HDD for log device, it's aware of itself and does short-stroking, but I don't actually know. Also, this is really a completely separate subject. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, Apr 16, 2013 at 4:01 PM, Jim Klimov wrote: > On 2013-04-16 23:56, Jay Heyl wrote: > >> result in more devices being hit for both read and write. Or am I wrong >> about reads being issued in parallel to all the mirrors in the array? >> > > Yes, in normal case (not scrubbing which makes a point of reading > everything) this assumption is wrong. Writes do hit all devices > (mirror halves or raid disks), but reads should be in parallel. > For mechanical HDDs this allows to double average read speeds > (or triple for 3-way mirrors, etc.) because different spindles > begin using their heads in shorter strokes around different areas, > if there are enough concurrent randomly placed reads. Not to get into bickering about semantics, but I asked, "Or am I wrong about reads being issued in parallel to all the mirrors in the array?", to which you replied, "Yes, in normal case... this assumption is wrong... but reads should be in parallel." (Ellipses intended for clarity, not argument munging.) If reads are in parallel, then it seems as though my assumption is correct. I realize the system will discard data from all but the first reads and that using only the first response can improve performance, but in terms of number of IOPs, which is where I intended to go with this, it seems to me the mirrored system will have at least as many if not more than the raid-zn system. Or have I completely misunderstood what you intended to say? ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
> From: Sašo Kiselkov [mailto:skiselkov...@gmail.com] > > If you are IOPS constrained, then yes, raid-zn will be slower, simply > because any read needs to hit all data drives in the stripe. Saso, I would expect you to know the answer to this question, probably: I have heard that raidz is more similar to raid-1e than raid-5. Meaning, when you write data to raidz, it doesn't get striped across all devices in the raidz vdev... Rather, two copies of the data get written to any of the available devices in the raidz. Can you confirm? If the behavior is to stripe across all the devices in the raidz, then the raidz iops really can't exceed that of a single device, because you have to wait for every device to respond before you have a complete block of data. But if it's more like raid-1e and individual devices can read independently of each other, then at least theoretically, the raidz with n-devices in it could return iops performance on-par with n-times a single disk. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 04/17/2013 12:08 AM, Richard Elling wrote: > clarification below... > > On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov wrote: > >> On 04/16/2013 11:37 PM, Timothy Coalson wrote: >>> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov >>> wrote: >>> If you are IOPS constrained, then yes, raid-zn will be slower, simply because any read needs to hit all data drives in the stripe. This is even worse on writes if the raidz has bad geometry (number of data drives isn't a power of 2). >>> >>> Off topic slightly, but I have always wondered at this - what exactly >>> causes non-power of 2 plus number of parities geometries to be slower, and >>> by how much? I tested for this effect with some consumer drives, comparing >>> 8+2 and 10+2, and didn't see much of a penalty (though the only random test >>> I did was read, our workload is highly sequential so it wasn't important). > > This makes sense, even for more random workloads. > >> >> Because a non-power-of-2 number of drives causes a read-modify-write >> sequence on every (almost) write. HDDs are block devices and they can >> only ever write in increments of their sector size (512 bytes or >> nowadays often 4096 bytes). Using your example above, you divide a 128k >> block by 8, you get 8x16k updates - all nicely aligned on 512 byte >> boundaries, so your drives can write that in one go. If you divide by >> 10, you get an ugly 12.8k, which means if your drives are of the >> 512-byte sector variety, they write 24x 512 sectors and then for the >> last partial sector write, they first need to fetch the sector from the >> patter, modify if in memory and then write it out again. > > This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has > been > a few years, I did a bunch of tests and found no correlation between the > number > of disks in the set (within boundaries as described in the man page) and > random > performance for raidz. This is not the case for RAID-5/6 where pathologically > bad performance is easy to create if you know the number of disks and stripe > width. > -- richard You are right, and I think I already know where I went wrong, though I'll need to check raidz_map_alloc to confirm. If memory serves me right, raidz actually splits the I/O up so that each stripe component is simply length-aligned and padded out to complete a full sector (otherwise the zio_vdev_child_io would fail in a block-alignment assertion in zio_create here: zio_create(zio_t *pio, spa_t *spa,... { .. ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0); .. I was probably misremembering the power-of-2 rule from a discussion about 4k sector drives. There the amount of wasted space can be significant, especially on small-block data, e.g. the default 8k volblocksize not being able to scale beyond 2 data drives + parity. Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 2013-04-17 01:12, Timothy Coalson wrote: On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov wrote: On 2013-04-16 23:56, Jay Heyl wrote: result in more devices being hit for both read and write. Or am I wrong about reads being issued in parallel to all the mirrors in the array? Yes, in normal case (not scrubbing which makes a point of reading everything) this assumption is wrong. Writes do hit all devices (mirror halves or raid disks), but reads should be in parallel. For mechanical HDDs this allows to double average read speeds (or triple for 3-way mirrors, etc.) because different spindles begin using their heads in shorter strokes around different areas, if there are enough concurrent randomly placed reads. There is another part to his question, specifically whether a single random read that falls within one block of the file hits more than one top level vdev - to put it another way, whether a single block of a file is striped across top level vdevs. I believe every block is allocated from one and only one vdev (blocks with ditto copies allocate multiple blocks, ideally from different vdevs, but this is not the same thing), such that every read that hits only one file block goes to only one top level vdev unless something goes wrong badly enough to need a ditto copy. I believe so too... I think, striping over top-level vdevs is subject to many tuning and algorithmical influences, and is not as simple as even-odd IOs. Also IIRC there is some size of data (several MBytes) that is preferably sent as sequential IO to one TLVDEV, then it is striped over to another, in order to better utilize the strengths of faster sequential IO vs. lags of seeking. But I may be way off-track here with this "belief" ;) Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 2013-04-16 23:37, Timothy Coalson wrote: On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote: If you are IOPS constrained, then yes, raid-zn will be slower, simply because any read needs to hit all data drives in the stripe. This is even worse on writes if the raidz has bad geometry (number of data drives isn't a power of 2). Off topic slightly, but I have always wondered at this - what exactly causes non-power of 2 plus number of parities geometries to be slower, and by how much? I tested for this effect with some consumer drives, comparing 8+2 and 10+2, and didn't see much of a penalty (though the only random test I did was read, our workload is highly sequential so it wasn't important). My take on this is not that these geometries are slower, but that they may be less efficient in terms of overheads at data storage. Say, you write a 16-sector block of userdata to your arrays. In case of 8+2 that would be two full stripes of parity and data. In case of 9+2 that would be a 9+2 and a 7+2 stripe. Access to this data is less balanced, placing more load on some disks which have 2 sectors of this block, and less load on others which have only one sector. It seems more "sad" when (i.e. due to compression) you have 1 or 2 userdata sectors remaining on a second stripe, but must still provide the 2 or 3 sectors of redundancy for this mini stripe. Also, as I found, ZFS raidzN makes precautions to not leave some potentially unusable holes (i.e. 1 or 2 free sectors, where you can't fit parity and data), so it would allocate full stripes when you have sufficiently unlucky stripe lengths just a few sectors shorter than full (i.e. 7+2 above would likely be allocated as 9+2 with zeroed-out extra sectors)... These things do add up to gigabytes, though they can happen on power-of-two sized arrays with compression just as easily (I found this with a 6-disk raidz2, with 4 data disks). Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, Apr 16, 2013 at 6:01 PM, Jim Klimov wrote: > On 2013-04-16 23:56, Jay Heyl wrote: > >> result in more devices being hit for both read and write. Or am I wrong >> about reads being issued in parallel to all the mirrors in the array? >> > > Yes, in normal case (not scrubbing which makes a point of reading > everything) this assumption is wrong. Writes do hit all devices > (mirror halves or raid disks), but reads should be in parallel. > For mechanical HDDs this allows to double average read speeds > (or triple for 3-way mirrors, etc.) because different spindles > begin using their heads in shorter strokes around different areas, > if there are enough concurrent randomly placed reads. > There is another part to his question, specifically whether a single random read that falls within one block of the file hits more than one top level vdev - to put it another way, whether a single block of a file is striped across top level vdevs. I believe every block is allocated from one and only one vdev (blocks with ditto copies allocate multiple blocks, ideally from different vdevs, but this is not the same thing), such that every read that hits only one file block goes to only one top level vdev unless something goes wrong badly enough to need a ditto copy. Tim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 2013-04-16 23:56, Jay Heyl wrote: result in more devices being hit for both read and write. Or am I wrong about reads being issued in parallel to all the mirrors in the array? Yes, in normal case (not scrubbing which makes a point of reading everything) this assumption is wrong. Writes do hit all devices (mirror halves or raid disks), but reads should be in parallel. For mechanical HDDs this allows to double average read speeds (or triple for 3-way mirrors, etc.) because different spindles begin using their heads in shorter strokes around different areas, if there are enough concurrent randomly placed reads. Due to ZFS data structure, you know your target block's expected checksum before you read its data (from any mirror half, or from data disks in a raidzn); then ZFS calculates the checksum of the data it has read and combined, and only if there is a mismatch, it has to read from other disks for redundancy (and fix the detected broken part). HTH, //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, 16 Apr 2013, Sašo Kiselkov wrote: SATA and SAS are dedicated point-to-point interfaces so there is no additive bottleneck with more drives as long as the devices are directly connected. Not true. Modern flash storage is quite capable of saturating a 6 Gbps SATA link. SAS has an advantage here, being dual-port natively with active-active load balancing deployed as standard practice. Also please note that SATA is half-duplex, whereas SAS is full-duplex. You did not describe how my statement about not being "additive" is wrong. This is different than per-drive bandwidth being insufficient for latest SSDs. Please expound on "Not true". SAS/SATA are not like old parallel SCSI. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
clarification below... On Apr 16, 2013, at 2:44 PM, Sašo Kiselkov wrote: > On 04/16/2013 11:37 PM, Timothy Coalson wrote: >> On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote: >> >>> If you are IOPS constrained, then yes, raid-zn will be slower, simply >>> because any read needs to hit all data drives in the stripe. This is >>> even worse on writes if the raidz has bad geometry (number of data >>> drives isn't a power of 2). >>> >> >> Off topic slightly, but I have always wondered at this - what exactly >> causes non-power of 2 plus number of parities geometries to be slower, and >> by how much? I tested for this effect with some consumer drives, comparing >> 8+2 and 10+2, and didn't see much of a penalty (though the only random test >> I did was read, our workload is highly sequential so it wasn't important). This makes sense, even for more random workloads. > > Because a non-power-of-2 number of drives causes a read-modify-write > sequence on every (almost) write. HDDs are block devices and they can > only ever write in increments of their sector size (512 bytes or > nowadays often 4096 bytes). Using your example above, you divide a 128k > block by 8, you get 8x16k updates - all nicely aligned on 512 byte > boundaries, so your drives can write that in one go. If you divide by > 10, you get an ugly 12.8k, which means if your drives are of the > 512-byte sector variety, they write 24x 512 sectors and then for the > last partial sector write, they first need to fetch the sector from the > patter, modify if in memory and then write it out again. This is true for RAID-5/6, but it is not true for ZFS or raidz. Though it has been a few years, I did a bunch of tests and found no correlation between the number of disks in the set (within boundaries as described in the man page) and random performance for raidz. This is not the case for RAID-5/6 where pathologically bad performance is easy to create if you know the number of disks and stripe width. -- richard > > I said "almost" every write is affected, but this largely depends on > your workload. If your writes are large async writes, then this RMW > cycle only happens at the end of the transaction commit (simplifying a > bit, but you get the idea), which is pretty small. However, if you are > doing many small updates in different locations (e.g. writing the ZIL), > this can significantly amplify the load. > > Cheers, > -- > Saso > > ___ > OpenIndiana-discuss mailing list > OpenIndiana-discuss@openindiana.org > http://openindiana.org/mailman/listinfo/openindiana-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
ZFS datablocks are also a power of two what means, that if you have 1,2,4,8,16,32,.. datadisks, every write is evenly spread over all disks. If you add one disk ex from 8 to 9 datadisks, any one disk is not used on a read/write. Does that means, 9 datadisks are slower than 8 disks? No, 9 disks are faster, maybee not 1/9 faster but faster. So think about more like a myth Add Raid redundancy disks to that count, example with 8 datadisks, add one disk for Z1 (9), 2 disks for Z2 (10) and 3 disks forZ3(11) for disks per vdev. Am 16.04.2013 um 23:37 schrieb Timothy Coalson: > On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote: > >> If you are IOPS constrained, then yes, raid-zn will be slower, simply >> because any read needs to hit all data drives in the stripe. This is >> even worse on writes if the raidz has bad geometry (number of data >> drives isn't a power of 2). >> > > Off topic slightly, but I have always wondered at this - what exactly > causes non-power of 2 plus number of parities geometries to be slower, and > by how much? I tested for this effect with some consumer drives, comparing > 8+2 and 10+2, and didn't see much of a penalty (though the only random test > I did was read, our workload is highly sequential so it wasn't important). > > Tim > ___ > OpenIndiana-discuss mailing list > OpenIndiana-discuss@openindiana.org > http://openindiana.org/mailman/listinfo/openindiana-discuss -- ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, Apr 16, 2013 at 4:44 PM, Sašo Kiselkov wrote: > On 04/16/2013 11:37 PM, Timothy Coalson wrote: > > On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov >wrote: > > > >> If you are IOPS constrained, then yes, raid-zn will be slower, simply > >> because any read needs to hit all data drives in the stripe. This is > >> even worse on writes if the raidz has bad geometry (number of data > >> drives isn't a power of 2). > >> > > > > Off topic slightly, but I have always wondered at this - what exactly > > causes non-power of 2 plus number of parities geometries to be slower, > and > > by how much? I tested for this effect with some consumer drives, > comparing > > 8+2 and 10+2, and didn't see much of a penalty (though the only random > test > > I did was read, our workload is highly sequential so it wasn't > important). > > Because a non-power-of-2 number of drives causes a read-modify-write > sequence on every (almost) write. HDDs are block devices and they can > only ever write in increments of their sector size (512 bytes or > nowadays often 4096 bytes). Using your example above, you divide a 128k > block by 8, you get 8x16k updates - all nicely aligned on 512 byte > boundaries, so your drives can write that in one go. If you divide by > 10, you get an ugly 12.8k, which means if your drives are of the > 512-byte sector variety, they write 24x 512 sectors and then for the > last partial sector write, they first need to fetch the sector from the > patter, modify if in memory and then write it out again. > > I said "almost" every write is affected, but this largely depends on > your workload. If your writes are large async writes, then this RMW > cycle only happens at the end of the transaction commit (simplifying a > bit, but you get the idea), which is pretty small. However, if you are > doing many small updates in different locations (e.g. writing the ZIL), > this can significantly amplify the load. Okay, I get the carryover of partial stripe causing problems, that makes sense, and at least has implications on space efficiency given that ZFS mainly uses power of 2 block sizes. However, I was not under the impression that ZFS ever uses partial sectors, that instead it uses fewer devices in the final stripe, ie, it would be split 10+2, 10+2...6+2. If what you say is true, I'm not sure how ZFS both manages to address halfway through a sector (if it must keep that old partial sector, it must be used somewhere, yes?), and yet has problems with changing sector sizes (the infamous ashift). Are you perhaps thinking of block device style software raid, where you need to ensure that even non-useful bits have correct parity computed? Tim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, Apr 16, 2013 at 2:25 PM, Timothy Coalson wrote: > On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl wrote: > > > My question about the rationale behind the suggestion of mirrored SSD > > arrays was really meant to be more in relation to the question from the > OP. > > I don't see how mirrored arrays of SSDs would be effective in his > > situation. > > > > There is another detail here to keep in mind: ZFS checks checksums on every > read from storage, and with raid-zn used with block sizes that give it more > capacity than mirroring (that is, data blocks are large enough that they > get split across multiple data sectors and therefore devices, instead of > degenerate single data sector plus parity sector(s) - OP mentioned 32K > blocks, so they should get split), this means each random filesystem read > that isn't cached hits a large number of devices in a raid-zn vdev, but > only one device in a mirror vdev (unless ZFS splits these reads across > mirrors, but even then it is still fewer devices hit). If you are limited > by IOPS of the devices, then this could make raid-zn slower. > I'm getting a sense of comparing apples to oranges here, but I do see your point about the raid-zn always requiring reads from more devices due to the parity. OTOH, it was my impression that read operations on n-way mirrors are always issued to each of the 'n' mirrors. Just for the sake of argument, let's say we need room for 1TB of storage. For raid-z2 we use 4x500GB devices. For the mirrored setup we have two mirrors each with 2x500GB devices. Reads to the raid-z2 system will hit four devices. If my assumption is correct, reads to the mirrored system will also hit four devices. If we go to a 3-way mirror, reads would hit six devices. In all but degenerate cases, mirrored arrangements are going to include more drives for the same amount of usable storage, so it seems they should result in more devices being hit for both read and write. Or am I wrong about reads being issued in parallel to all the mirrors in the array? ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 04/16/2013 11:37 PM, Timothy Coalson wrote: > On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote: > >> If you are IOPS constrained, then yes, raid-zn will be slower, simply >> because any read needs to hit all data drives in the stripe. This is >> even worse on writes if the raidz has bad geometry (number of data >> drives isn't a power of 2). >> > > Off topic slightly, but I have always wondered at this - what exactly > causes non-power of 2 plus number of parities geometries to be slower, and > by how much? I tested for this effect with some consumer drives, comparing > 8+2 and 10+2, and didn't see much of a penalty (though the only random test > I did was read, our workload is highly sequential so it wasn't important). Because a non-power-of-2 number of drives causes a read-modify-write sequence on every (almost) write. HDDs are block devices and they can only ever write in increments of their sector size (512 bytes or nowadays often 4096 bytes). Using your example above, you divide a 128k block by 8, you get 8x16k updates - all nicely aligned on 512 byte boundaries, so your drives can write that in one go. If you divide by 10, you get an ugly 12.8k, which means if your drives are of the 512-byte sector variety, they write 24x 512 sectors and then for the last partial sector write, they first need to fetch the sector from the patter, modify if in memory and then write it out again. I said "almost" every write is affected, but this largely depends on your workload. If your writes are large async writes, then this RMW cycle only happens at the end of the transaction commit (simplifying a bit, but you get the idea), which is pretty small. However, if you are doing many small updates in different locations (e.g. writing the ZIL), this can significantly amplify the load. Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, Apr 16, 2013 at 4:29 PM, Sašo Kiselkov wrote: > If you are IOPS constrained, then yes, raid-zn will be slower, simply > because any read needs to hit all data drives in the stripe. This is > even worse on writes if the raidz has bad geometry (number of data > drives isn't a power of 2). > Off topic slightly, but I have always wondered at this - what exactly causes non-power of 2 plus number of parities geometries to be slower, and by how much? I tested for this effect with some consumer drives, comparing 8+2 and 10+2, and didn't see much of a penalty (though the only random test I did was read, our workload is highly sequential so it wasn't important). Tim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 04/16/2013 11:25 PM, Timothy Coalson wrote: > On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl wrote: > >> My question about the rationale behind the suggestion of mirrored SSD >> arrays was really meant to be more in relation to the question from the OP. >> I don't see how mirrored arrays of SSDs would be effective in his >> situation. >> > > There is another detail here to keep in mind: ZFS checks checksums on every > read from storage, and with raid-zn used with block sizes that give it more > capacity than mirroring (that is, data blocks are large enough that they > get split across multiple data sectors and therefore devices, instead of > degenerate single data sector plus parity sector(s) - OP mentioned 32K > blocks, so they should get split), this means each random filesystem read > that isn't cached hits a large number of devices in a raid-zn vdev, but > only one device in a mirror vdev (unless ZFS splits these reads across > mirrors, but even then it is still fewer devices hit). If you are limited > by IOPS of the devices, then this could make raid-zn slower. If you are IOPS constrained, then yes, raid-zn will be slower, simply because any read needs to hit all data drives in the stripe. This is even worse on writes if the raidz has bad geometry (number of data drives isn't a power of 2). Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, Apr 16, 2013 at 3:48 PM, Jay Heyl wrote: > My question about the rationale behind the suggestion of mirrored SSD > arrays was really meant to be more in relation to the question from the OP. > I don't see how mirrored arrays of SSDs would be effective in his > situation. > There is another detail here to keep in mind: ZFS checks checksums on every read from storage, and with raid-zn used with block sizes that give it more capacity than mirroring (that is, data blocks are large enough that they get split across multiple data sectors and therefore devices, instead of degenerate single data sector plus parity sector(s) - OP mentioned 32K blocks, so they should get split), this means each random filesystem read that isn't cached hits a large number of devices in a raid-zn vdev, but only one device in a mirror vdev (unless ZFS splits these reads across mirrors, but even then it is still fewer devices hit). If you are limited by IOPS of the devices, then this could make raid-zn slower. Disclaimer: this is theory, I haven't tested this in practice, nor have I done any math to see if it should matter to SSDs. However, since it is a configuration question rather than a hardware question, it may be possible to acquire (some of) the hardware first and test both setups before deciding. Tim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 04/16/2013 10:57 PM, Bob Friesenhahn wrote: > On Tue, 16 Apr 2013, Jay Heyl wrote: >> >> It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS >> when there are multiple storage devices on the other end of that path. No >> single HDD today is going to come close to needing that full 6Gb/s, >> but put >> four or five of them hanging off that same path and that ultra-super >> highway starts looking pretty congested. Put SSDs on the other end and >> the >> 6Gb/s pathway is going to quickly become your bottleneck. > > SATA and SAS are dedicated point-to-point interfaces so there is no > additive bottleneck with more drives as long as the devices are directly > connected. Not true. Modern flash storage is quite capable of saturating a 6 Gbps SATA link. SAS has an advantage here, being dual-port natively with active-active load balancing deployed as standard practice. Also please note that SATA is half-duplex, whereas SAS is full-duplex. The problem with SATA vs SAS for flash storage is that there are, as yet, no flash devices of the "NL-SAS" kind. By this I mean drives that are only about 10-20% more expensive than their SATA counterparts, offering native SAS connectivity, but not top-notch "enterprise" features and/or performance. This situation existed in HDDs not long ago: you had 7k2 SATA and 10k/15k SAS, but no 7k2 SAS. That's why we had to do all that nonsense with SAS to SATA interposers (I have an old Sun J4200 with 1TB SATA drives that had an interposer on each of the 12 drives). Since then, NL-SAS largely made this route obsolete, so now I just buy 7k2 NL-SAS drives and skip the whole interposer thing. Now if any of the big storage drive got their shit together and started offering flash storage with native SAS at slightly above SATA prices, I'd be delighted. Trouble is, the manufacturers seem to be trying to position SAS SSDs as "even more expensive/performing than SAS HDDs" types of products. When I can buy a 512GB SATA SSD for the price of a 600GB SAS drive, that seems a strange proposition indeed... -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, 16 Apr 2013, Jay Heyl wrote: It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS when there are multiple storage devices on the other end of that path. No single HDD today is going to come close to needing that full 6Gb/s, but put four or five of them hanging off that same path and that ultra-super highway starts looking pretty congested. Put SSDs on the other end and the 6Gb/s pathway is going to quickly become your bottleneck. SATA and SAS are dedicated point-to-point interfaces so there is no additive bottleneck with more drives as long as the devices are directly connected. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Tue, Apr 16, 2013 at 11:54 AM, Jim Klimov wrote: > On 2013-04-16 20:30, Jay Heyl wrote: > >> What would be the logic behind mirrored SSD arrays? With spinning platters >> the mirrors improve performance by allowing the fastest of the mirrors to >> respond to a particular command to be the one that defines throughput. >> With >> > > Well, to think up a rationale: it is quite possible to saturate a bus > or an HBA with SSDs, leading to increased latency in case of intense > IO just because some tasks (data packets) are waiting in queue waiting > for the bottleneck to dissolve. If another side of the mirror has a > different connection (another HBA, another PCI bus) then IOs can go > there - increasing overall performance. > This strikes me as a strong argument for carefully planning the arrangement of storage devices of any sort in relation to HBAs and buses. It seems significantly less strong as an argument for a mirror _maybe_ having a different connection and responding faster. My question about the rationale behind the suggestion of mirrored SSD arrays was really meant to be more in relation to the question from the OP. I don't see how mirrored arrays of SSDs would be effective in his situation. Personally, I'd go with RAID-Z2 or RAID-Z3 unless the computational load on the CPU is especially high. This would give you as good as or better fault protection than mirrors at significantly less cost. Indeed, given his scenario of write early, read often later on, I might even be tempted to go for the new TLC SSDs from Samsung. For this particular use the much reduced "lifetime" of the devices would probably not be a factor at all. OTOH, given the almost-no-limits budget, shaving $100 here or there is probably not a big consideration. (And just to be clear, I would NOT recommend the TLC SSDs for a more general solution. It was specifically the write-few, read-many scenario that made me think of them.) Basically, this answer stems from logic which applies to "why would we > need 6Gbit/s on HDDs?" Indeed, HDDs won't likely saturate their buses > with even sequential reads. The link speed really applies to the bursts > of IO between the system and HDD's caches. Double bus speed roughly > halves the time a HDD needs to keep the bus busy for its portion of IO. > And when there are hundreds of disks sharing a resource (an expander > for example), this begins to matter. It's actually not all that difficult to saturate a 6Gb/s pathway with ZFS when there are multiple storage devices on the other end of that path. No single HDD today is going to come close to needing that full 6Gb/s, but put four or five of them hanging off that same path and that ultra-super highway starts looking pretty congested. Put SSDs on the other end and the 6Gb/s pathway is going to quickly become your bottleneck. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
Mehmet Erol Sanliturk wrote: I am not an expert of this subject , but with respect to my readings in some e-mails in different mailing lists and from some relevant pages in Wikipedia about SSD drives , the following points are mentioned about SSD disadvantages ( even for "Enterprise" labeled drives ) : SSD units are very vulnerable to power cuts during work up to complete failure which they can not be used any more to complete loss of data . That's why some of them include their own momentary power store, or in some systems, the system has a momentary power store to keep them powered for a period after the last write operation. MLC ( Multi-Level Cell ) SSD units have a short life time if they are continuously written ( they are more suitable to write once ( in a limited number of writes sense ) - read many ) . SLC ( Single-Level Cell ) SSD units have much more long life span , but they are expensive with respect to MLC SSD units . SSD units may fail due to write wearing in an unexpected time , making them very unreliable for mission critical works . All the Enterprise grade SSDs I've used can tell you how far through their life they are (in terms of write wearing). Some of the monitoring tools pick this up and warn you when you're down to some threshold, such as 20% left. Secondly, when they wear out, they fail to write (effectively become write protected). So you find out before they confirm committing your data, and you can still read all the data back. This is generally the complete opposite of the failure modes of hard drives, although like any device, the SSD might fail for other reasons. I have not played with consumer grade drives. Due to the above points ( they may be wrong perhaps ) personally I would select revolving plate SAS disks and up to now I did not buy any SSD for these reasons . The above points are a possible disadvantages set for consideration . The extra cost of using loads of short stroked 15k drives to get anywhere near SSD performance is generally prohibitive. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 2013-04-16 19:17, Mehmet Erol Sanliturk wrote: I am not an expert of this subject , but with respect to my readings in some e-mails in different mailing lists and from some relevant pages in Wikipedia about SSD drives , the following points are mentioned about SSD disadvantages ( even for "Enterprise" labeled drives ) : My awareness in the subject is of similar nature, but with different results someplace... Here goes: SSD units are very vulnerable to power cuts during work up to complete failure which they can not be used any more to complete loss of data . Yes, maybe, for some vendors. Information is scarce about which ones are better in practical reliability, leading to requests for info like this thread. Still, some vendors make a living by selling expensive gear into critical workloads, and are thought to perform well. One factor, though not always a guarantee, of proper end-of-work in case of a power-cut, is presence of either batteries/accumulators, or capacitors, which power the device long enough for it to save its caches, metadata, etc. Then the mileage varies how well who does it. MLC ( Multi-Level Cell ) SSD units have a short life time if they are continuously written ( they are more suitable to write once ( in a limited number of writes sense ) - read many ) . SLC ( Single-Level Cell ) SSD units have much more long life span , but they are expensive with respect to MLC SSD units . I hear SLC are also faster due to more simple design. Price stems from requirement to have more cells than MLC to implement the same amount of storage bits. Also there are now some new designs like eMLC which are young and "untested", but are said to have MLC price and SLC reliability. With decrease of sizes in technical process, diffusion and brownian movement of atoms plays an increasingly greater role. Indeed, while early SSDs boasted tens and hundreds of thousands of rewrite cycles, now 5-10k is good. But faster. SSD units may fail due to write wearing in an unexpected time , making them very unreliable for mission critical works . For this reason there is over-provisioning. The SSD firmware detects unreliable chips and excludes them from use, relocating data onto spare chips. Also there is wear-leveling, it is when the firmware tries to make sure that all ships are utilized more or less equally and on average the device lives longer. Basically, an SSD (unlike a normal USB Flash key) implements a RAID over tens of chips with intimate knowledge and diagnostic mechanisms over the storage pieces. Overall, vendors now often rate their devices in gbytes of writes in their lifetime, or in full rewrites of the device. Last year we had a similar discussion on-list, regarding then-new Intel DC S3700 http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/50424 and it struck me that in practical terms they boasted "Endurance Rating - 10 drive writes/day over 5 years". That is a lot for many use-cases. They are also relatively pricey, at $2.5/gb linearly from 100G to 800G devices (in a local webshop here). Due to the above points ( they may be wrong perhaps ) personally I would select revolving plate SAS disks and up to now I did not buy any SSD for these reasons . The above points are a possible disadvantages set for consideration . They are not wrong in general, and there are any number of examples where bad things do happen. But there are devices which are said to successfully work around the fundamental drawbacks with some other technology, such as firmware and capacitors and so on. It is indeed not yet a subject and market to be careless with, by taking just any device off the shelf and expecting it to perform well and live long. Also it is beneficial to do some homework during system configuration and reduce unnecessary writes to the SSDs - by moving logs out of the rpool, disabling atime updates and so on. There are things an SSD is good for, and some things HDDs are better at (or are commonly thought to be) - i.e. price and longevity past infant death toll, and the choice of components does depend on expected system utilization as well as performance requirements as well as how much you're ready to cash up for that. All that said, I haven't yet touched an SSD so far, but mostly due to financial reasons with both dayjob and home rigs... //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 2013-04-16 20:30, Jay Heyl wrote: What would be the logic behind mirrored SSD arrays? With spinning platters the mirrors improve performance by allowing the fastest of the mirrors to respond to a particular command to be the one that defines throughput. With Well, to think up a rationale: it is quite possible to saturate a bus or an HBA with SSDs, leading to increased latency in case of intense IO just because some tasks (data packets) are waiting in queue waiting for the bottleneck to dissolve. If another side of the mirror has a different connection (another HBA, another PCI bus) then IOs can go there - increasing overall performance. Basically, this answer stems from logic which applies to "why would we need 6Gbit/s on HDDs?" Indeed, HDDs won't likely saturate their buses with even sequential reads. The link speed really applies to the bursts of IO between the system and HDD's caches. Double bus speed roughly halves the time a HDD needs to keep the bus busy for its portion of IO. And when there are hundreds of disks sharing a resource (an expander for example), this begins to matter. HTH, //Jim Klimob ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Mon, Apr 15, 2013 at 5:00 AM, Edward Ned Harvey (openindiana) < openindi...@nedharvey.com> wrote: > > So I'm just assuming you're going to build a pool out of SSD's, mirrored, > perhaps even 3-way mirrors. No cache/log devices. All the ram you can fit > into the system. What would be the logic behind mirrored SSD arrays? With spinning platters the mirrors improve performance by allowing the fastest of the mirrors to respond to a particular command to be the one that defines throughput. With SSDs, they all should respond in basically the same time. There is no latency due to head movement or waiting for the proper spot on the disc to rotate under the heads. The improvement in read performance seen in mirrored spinning platters should not be present with SSDs. Admittedly, this is from a purely theoretical perspective. I've never assembled an SSD array to compare mirrored vs RAID-Zx performance. I'm curious if you're aware of something I'm overlooking. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
I am not an expert of this subject , but with respect to my readings in some e-mails in different mailing lists and from some relevant pages in Wikipedia about SSD drives , the following points are mentioned about SSD disadvantages ( even for "Enterprise" labeled drives ) : SSD units are very vulnerable to power cuts during work up to complete failure which they can not be used any more to complete loss of data . MLC ( Multi-Level Cell ) SSD units have a short life time if they are continuously written ( they are more suitable to write once ( in a limited number of writes sense ) - read many ) . SLC ( Single-Level Cell ) SSD units have much more long life span , but they are expensive with respect to MLC SSD units . SSD units may fail due to write wearing in an unexpected time , making them very unreliable for mission critical works . Due to the above points ( they may be wrong perhaps ) personally I would select revolving plate SAS disks and up to now I did not buy any SSD for these reasons . The above points are a possible disadvantages set for consideration . Thank you very much . Mehmet Erol Sanliturk ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)
some of these points are a bit dated. Allow me to make some updates. I'm sure that you are aware that most 10gig switches these days are cut through and not store and forward. That's Arista, HP, Dell Force10, Mellanox, and IBM/Blade. Cisco has a mix of things, but they aren't really in the low latency space. The 10g and 40g port to port forwarding is in nanoseconds. buffering is mostly reserved to carrier operations anymore, and even there it is becoming less common because of the toll it causes to things like IPVideo and VOIP. Buffers are good for web farms, still, and to a certain extent storage servers or WAN links where there is a high degree of contention from disparate traffic. At a physical level, the signalling of IB compared to Ethernet (10g+) is very similar, which is why Mellanox can make a single chip that does 10gbit 40gbit, and QDR and FDR infiniband on any port. there are also a fair number of vendors that support RDMA in ethernet NIC now, like SolarFlare with Onboot technology. The main reason for lowest achievable latency is higher speed. Latency is roughly equivalent to the inversion of bandwidth. But, the higher levels of protocols that you stack on top contribute much more than the hardware theoretical minimums or maximums. TCP/IP is a killer in terms of adding overhead. That's why there are protocols like ISER, SRP, and friends. RDMA is much faster than the kernel overhead induced by TCP session setups and other host side user/kernel boundaries and buffering. PCI latency is also higher than the port to port latency on a good 10g switch, nevermind 40 or FDR infiniband. There is even a special layer that you can write custom protocols to on Infiniband called Verbs for lowering latency further. Infiniband is inherently a layer1 and 2 protocol, and the subnet manager (software) is resposible for setting up all virtual circuits (routes between hosts on the fabric) and rerouting when a path goes bad. Also, the link aggregation, as you mention, is rock solid and amazingly good. Auto rerouting is fabulous and super fast. But, you don't get layer3. TCP over IB works out of the box, but adds large overhead. Still, it does make it possible that you can have IB native and IP over IB with gateways to a TCP network with a single cable. That's pretty cool. Sent from my android device. -Original Message- From: "Edward Ned Harvey (openindiana)" To: Discussion list for OpenIndiana Sent: Tue, 16 Apr 2013 10:49 AM Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20) > From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] > > It would be difficult to believe that 10Gbit Ethernet offers better > bandwidth than 56Gbit Infiniband (the current offering). The swiching > model is quite similar. The main reason why IB offers better latency > is a better HBA hardware interface and a specialized stack. 5X is 5X. Put another way, the reason infiniband is so much higher throughput and lower latency than ethernet is because the switching (at the physical layer) is completely different from ethernet, and messages are passed directly from user-level to user-level on remote system ram via RDMA, bypassing the OSI layer model and other kernel overhead. I read a paper from vmware, where they implemented RDMA over ethernet and doubled the speed of vmotion (but still not as fast as infiniband, by like 4x.) Beside the bypassing of OSI layers and kernel latency, IB latency is lower because Ethernet switches use store-and-forward buffering managed by the backplane in the switch, in which a sender sends a packet to a buffer on the switch, which then pushes it through the backplane, and finally to another buffer on the destination. IB uses cross-bar, or cut-through switching, in which the sending host channel adapter signals the destination address to the switch, then waits for the channel to be opened. Once the channel is opened, it stays open, and the switch in between is nothing but signal amplification (as well as additional virtual lanes for congestion management, and other functions). The sender writes directly to RAM on the destination via RDMA, no buffering in between. Bypassing the OSI layer model. Hence much lower latency. IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 4x, 16x designations, and the 40Gbit specifications. Something which is quasi-possible in ethernet via LACP, but not as good and not the same. IB guarantees packets delivered in the right order, with native congestion control as compared to ethernet which may drop packets and TCP must detect and retransmit... Ethernet includes a lot of support for IP addressing, and variable link speeds (some 10Gbit, 10/100, 1G etc) and all of this asynchronous. For these reasons, IB is not a suitable replacement for IP commun
Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)
> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] > > It would be difficult to believe that 10Gbit Ethernet offers better > bandwidth than 56Gbit Infiniband (the current offering). The swiching > model is quite similar. The main reason why IB offers better latency > is a better HBA hardware interface and a specialized stack. 5X is 5X. Put another way, the reason infiniband is so much higher throughput and lower latency than ethernet is because the switching (at the physical layer) is completely different from ethernet, and messages are passed directly from user-level to user-level on remote system ram via RDMA, bypassing the OSI layer model and other kernel overhead. I read a paper from vmware, where they implemented RDMA over ethernet and doubled the speed of vmotion (but still not as fast as infiniband, by like 4x.) Beside the bypassing of OSI layers and kernel latency, IB latency is lower because Ethernet switches use store-and-forward buffering managed by the backplane in the switch, in which a sender sends a packet to a buffer on the switch, which then pushes it through the backplane, and finally to another buffer on the destination. IB uses cross-bar, or cut-through switching, in which the sending host channel adapter signals the destination address to the switch, then waits for the channel to be opened. Once the channel is opened, it stays open, and the switch in between is nothing but signal amplification (as well as additional virtual lanes for congestion management, and other functions). The sender writes directly to RAM on the destination via RDMA, no buffering in between. Bypassing the OSI layer model. Hence much lower latency. IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 4x, 16x designations, and the 40Gbit specifications. Something which is quasi-possible in ethernet via LACP, but not as good and not the same. IB guarantees packets delivered in the right order, with native congestion control as compared to ethernet which may drop packets and TCP must detect and retransmit... Ethernet includes a lot of support for IP addressing, and variable link speeds (some 10Gbit, 10/100, 1G etc) and all of this asynchronous. For these reasons, IB is not a suitable replacement for IP communications done on ethernet, with a lot of variable peer-to-peer and broadcast traffic. IB is designed for networks where systems want to establish connections to other systems, and those connections remain mostly statically connected. Primarily clustering & storage networks. Not primarily TCP/IP. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)
On Mon, 15 Apr 2013, Ong Yu-Phing wrote: Working set of ~50% is quite large; when you say data analysis I'd assume some sort of OLTP or real-time BI situation, but do you know the nature of your processing, i.e. is it latency dependent or bandwidth dependent? Reason I ask, is because I think 10GB delivers better overall B/W, but 4GB infiniband delivers better latency. It would be difficult to believe that 10Gbit Ethernet offers better bandwidth than 56Gbit Infiniband (the current offering). The swiching model is quite similar. The main reason why IB offers better latency is a better HBA hardware interface and a specialized stack. 5X is 5X. If 3xdisk raidz1 is too expensive, then put more SSDs in each raidz1. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 04/15/2013 03:30 PM, John Doe wrote: > From: Günther Alka > >> I would think about the following >> - yes, i would build that from SSD >> - build the pool from multiple 10 disk Raid-Z2 vdevs, > > Slightly out of topic but, what is the status of the TRIM command and zfs...? ATM: unsupported. I'm working on that in Illumos. The ZFS bits are there, but there is no driver support for issuing the commands to the underlying devices. -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
From: Günther Alka > I would think about the following > - yes, i would build that from SSD > - build the pool from multiple 10 disk Raid-Z2 vdevs, Slightly out of topic but, what is the status of the TRIM command and zfs...? JD ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
> From: Wim van den Berge [mailto:w...@vandenberge.us] > > multiple 10Gb uplinks > > However the next system is going to be a little different. It needs to be > the absolute fastest iSCSI target we can create/afford. So I'm just assuming you're going to build a pool out of SSD's, mirrored, perhaps even 3-way mirrors. No cache/log devices. All the ram you can fit into the system. You've been using 10G ether so far. Expensive, not too bad. I'm going to recommend looking into infiniband instead. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)
A heads up that 10-12TB means you'd need 11.5-13TB useable, assuming you'd need to keep used storage < 90% of total storage useable (or is that old news now?). So, using Saso's RAID5 config of Intel DC3700s in 3xdisk raidz1, that means you'd need 21x Intel DC3700's at 800GB (21x800/3*2*.9=10.008) to get 10TB, or 27x to get 12.9TB useable, excluding root/cache etc. Which means 50+K for SSDs, leaving you only 10K for the server platform, which might not be enough to get 0.5TB of RAM etc (unless you can get a bulk discount on the Intel DC3700s!). Working set of ~50% is quite large; when you say data analysis I'd assume some sort of OLTP or real-time BI situation, but do you know the nature of your processing, i.e. is it latency dependent or bandwidth dependent? Reason I ask, is because I think 10GB delivers better overall B/W, but 4GB infiniband delivers better latency. 10 years ago I've worked with 30+TB data sets which were preloaded into an Oracle database, with data structures highly optimized for the types of reads which the applications required (2-3 day window for complex analysis of monthly data). No SSDs and fancy stuff in those days. But if your data is live/realtime and constantly streaming in, then the work profile can be dramatically different. On 15/04/2013 07:17, Sa?o Kiselkov wrote: On 04/14/2013 05:15 PM, Wim van den Berge wrote: Hello, We have been running OpenIndiana (and its various predecessors) as storage servers in production for the last couple of years. Over that time the majority of our storage infrastructure has been moved to Open Indiana to the point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+ servers in three datacenters . All of these systems are pretty much the same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM, multiple 10Gb uplinks. All of these work like a charm. However the next system is going to be a little different. It needs to be the absolute fastest iSCSI target we can create/afford. We'll need about 10-12TB of capacity and the working set will be 5-6TB and IO over time is 90% reads and 10% writes using 32K blocks but this is a data analysis scenario so all the writes are upfront. Contrary to previous installs, money is a secondary (but not unimportant) issue for this one. I'd like to stick with a SuperMicro platform and we've been thinking of trying the new Intel S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep system cost below $60K. This is new ground for us. Before this one, the game has always been primarily about capacity/data integrity and anything we designed based on ZFS/Open Solaris has always more than delivered in the performance arena. This time we're looking to fill up the dedicated 10Gbe connections to each of the four to eight processing nodes as much as possible. The processing nodes have been designed that they will consume whatever storage bandwidth they can get. Any ideas/thoughts/recommendations/caveats would be much appreciated. Hi Wim, Interesting project. You should definitely look at all-SSD pools here. With the 800GB DC S3700 running in 3-drive raidz1's you're looking at approximately $34k CAPEX (for the 10TB capacity point) just for the SSDs. That leaves you ~$25k you can spend on the rest of the box, which is *a lot*. Be sure to put lots of RAM (512GB+) into the box. Also consider ditching 10GE and go straight to IB. A dual-port QDR card can be had nowadays for about $1k (SuperMicro even makes motherboards with QDR-IB on-board) and a 36-port Mellanox QDR switch can be had for about $8k (this integrates the IB subnet manager, so this is all you need to set up an IB network): http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=158 Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On Apr 14, 2013, at 8:15 AM, Wim van den Berge wrote: > Hello, > > We have been running OpenIndiana (and its various predecessors) as storage > servers in production for the last couple of years. Over that time the > majority of our storage infrastructure has been moved to Open Indiana to the > point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+ > servers in three datacenters . All of these systems are pretty much the > same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM, > multiple 10Gb uplinks. All of these work like a charm. > > However the next system is going to be a little different. It needs to be > the absolute fastest iSCSI target we can create/afford. We'll need about > 10-12TB of capacity and the working set will be 5-6TB and IO over time is > 90% reads and 10% writes using 32K blocks but this is a data analysis > scenario so all the writes are upfront. Contrary to previous installs, money > is a secondary (but not unimportant) issue for this one. I'd like to stick > with a SuperMicro platform and we've been thinking of trying the new Intel > S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep > system cost below $60K. Does "fast" mean "low-latency"? If so, the general rules are: + mirror + go direct, no expanders + iSCSI tends to not use ZIL very much, but you can verify on your workload. There are a number of vendors who have been selling SSD-only ZFS systems for a few years. You might ask around for experiences and specs. -- richard > This is new ground for us. Before this one, the game has always been > primarily about capacity/data integrity and anything we designed based on > ZFS/Open Solaris has always more than delivered in the performance arena. > This time we're looking to fill up the dedicated 10Gbe connections to each > of the four to eight processing nodes as much as possible. The processing > nodes have been designed that they will consume whatever storage bandwidth > they can get. > > > > Any ideas/thoughts/recommendations/caveats would be much appreciated. > > > > Thanks > > > > W > > ___ > OpenIndiana-discuss mailing list > OpenIndiana-discuss@openindiana.org > http://openindiana.org/mailman/listinfo/openindiana-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
On 04/14/2013 05:15 PM, Wim van den Berge wrote: > Hello, > > We have been running OpenIndiana (and its various predecessors) as storage > servers in production for the last couple of years. Over that time the > majority of our storage infrastructure has been moved to Open Indiana to the > point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+ > servers in three datacenters . All of these systems are pretty much the > same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM, > multiple 10Gb uplinks. All of these work like a charm. > > However the next system is going to be a little different. It needs to be > the absolute fastest iSCSI target we can create/afford. We'll need about > 10-12TB of capacity and the working set will be 5-6TB and IO over time is > 90% reads and 10% writes using 32K blocks but this is a data analysis > scenario so all the writes are upfront. Contrary to previous installs, money > is a secondary (but not unimportant) issue for this one. I'd like to stick > with a SuperMicro platform and we've been thinking of trying the new Intel > S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep > system cost below $60K. > > This is new ground for us. Before this one, the game has always been > primarily about capacity/data integrity and anything we designed based on > ZFS/Open Solaris has always more than delivered in the performance arena. > This time we're looking to fill up the dedicated 10Gbe connections to each > of the four to eight processing nodes as much as possible. The processing > nodes have been designed that they will consume whatever storage bandwidth > they can get. > > Any ideas/thoughts/recommendations/caveats would be much appreciated. Hi Wim, Interesting project. You should definitely look at all-SSD pools here. With the 800GB DC S3700 running in 3-drive raidz1's you're looking at approximately $34k CAPEX (for the 10TB capacity point) just for the SSDs. That leaves you ~$25k you can spend on the rest of the box, which is *a lot*. Be sure to put lots of RAM (512GB+) into the box. Also consider ditching 10GE and go straight to IB. A dual-port QDR card can be had nowadays for about $1k (SuperMicro even makes motherboards with QDR-IB on-board) and a 36-port Mellanox QDR switch can be had for about $8k (this integrates the IB subnet manager, so this is all you need to set up an IB network): http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=158 Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] Recommendations for fast storage
I would think about the following - yes, i would build that from SSD - build the pool from multiple 10 disk Raid-Z2 vdevs, - use as much RAM as possible to serve most of reads from RAM example a dual socket 2011 system with 256 GB RAM - if you need sync writes/ disabled LU write back cache, use dedicated DRAM based log-devices (ZEUSRAM) For multiple 10 Gbe you may need several of them or disable sync/enable LU cache when possible, I would calculate one ZEUSRAM per 10 GBe adapter (about 2000$ each), analyze ZIL usage first. - if possible, avoid expander with Sata disks - do not fill a pool above 50% if you need max performance read about fillrate vs throughput: http://blog.delphix.com/uday/2013/02/19/78/ - tune ip (Jumboframes, MPIO, Trunking) and iSCSI blocksize - think about using OmniOS (a little more up to date than OI) The rest is some math,you need: a case (like a 50 x 3,5" bay Chenbro or a up to 72 bay SuperMicro) with a 7 x pci-e mainboard, CPU, RAM, 3 x SAS2 HBA controller, 4 x dual 10 Gbe adapters: ex: Chenbro 50 x 3,5" case without expander: 4 x dual 10 Gbe + 3 x LSI 16 channel HBA or Supermicro cases with expander, up to 72 x 2,5" bays with up to 3 x 8-16 channel HBA say 1 $ The rest is for SSD and ZIL If you like to use 10 TB and want to have 20TB capacity for performance reasons: with your 800GB Intel, you have about 6,5 TB usable for 10 disks (Z2) You need 30 of them ex 2000$ per SSD: 6$ (without ZIL and spare), gives a total o 7$ without ZIL and spare. other Option: use 500-600 GB SSD like Intel 320 or 520. You need more of them but they are cheaper regarding TB/$ allow 80% SSD usage, check ARC usage to eventually reduce amount if SSD (RAM is cheaper than using only 50% of SSD capacity) keep enough slots free to optionally add more SSD for better performance or higher capacity care about needed capacity for snaps add 10% spare disks. On 14.04.2013 17:15, Wim van den Berge wrote: Hello, We have been running OpenIndiana (and its various predecessors) as storage servers in production for the last couple of years. Over that time the majority of our storage infrastructure has been moved to Open Indiana to the point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+ servers in three datacenters . All of these systems are pretty much the same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM, multiple 10Gb uplinks. All of these work like a charm. However the next system is going to be a little different. It needs to be the absolute fastest iSCSI target we can create/afford. We'll need about 10-12TB of capacity and the working set will be 5-6TB and IO over time is 90% reads and 10% writes using 32K blocks but this is a data analysis scenario so all the writes are upfront. Contrary to previous installs, money is a secondary (but not unimportant) issue for this one. I'd like to stick with a SuperMicro platform and we've been thinking of trying the new Intel S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep system cost below $60K. This is new ground for us. Before this one, the game has always been primarily about capacity/data integrity and anything we designed based on ZFS/Open Solaris has always more than delivered in the performance arena. This time we're looking to fill up the dedicated 10Gbe connections to each of the four to eight processing nodes as much as possible. The processing nodes have been designed that they will consume whatever storage bandwidth they can get. Any ideas/thoughts/recommendations/caveats would be much appreciated. Thanks W ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
[OpenIndiana-discuss] Recommendations for fast storage
Hello, We have been running OpenIndiana (and its various predecessors) as storage servers in production for the last couple of years. Over that time the majority of our storage infrastructure has been moved to Open Indiana to the point where we currently serve (iSCSI, NFS and CIFS) about 1.2PB from 10+ servers in three datacenters . All of these systems are pretty much the same, large pool of disks, SSD for root, ZIL and L2ARC, 64-128GB RAM, multiple 10Gb uplinks. All of these work like a charm. However the next system is going to be a little different. It needs to be the absolute fastest iSCSI target we can create/afford. We'll need about 10-12TB of capacity and the working set will be 5-6TB and IO over time is 90% reads and 10% writes using 32K blocks but this is a data analysis scenario so all the writes are upfront. Contrary to previous installs, money is a secondary (but not unimportant) issue for this one. I'd like to stick with a SuperMicro platform and we've been thinking of trying the new Intel S3700 800GB SSD's which seem to run about $2K. Ideally I'd like to keep system cost below $60K. This is new ground for us. Before this one, the game has always been primarily about capacity/data integrity and anything we designed based on ZFS/Open Solaris has always more than delivered in the performance arena. This time we're looking to fill up the dedicated 10Gbe connections to each of the four to eight processing nodes as much as possible. The processing nodes have been designed that they will consume whatever storage bandwidth they can get. Any ideas/thoughts/recommendations/caveats would be much appreciated. Thanks W ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss