Re: [zfs-discuss] 350TB+ storage solution

2011-05-18 Thread Rich Teer
On Wed, 18 May 2011, Chris Mosetick wrote:

> to go in the packing dept. I still love their prices!

There's a reason fort at: you don't get what you don't pay for!

-- 
Rich Teer, Publisher
Vinylphile Magazine

www.vinylphilemag.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-18 Thread Chris Mosetick
The drives I just bought were half packed in white foam then wrapped

> in bubble wrap.  Not all edges were protected with more than bubble
> wrap.



Same here for me. I purchased 10 x 2TB Hitachi 7200rpm SATA disks from
Newegg.com in March. The majority of the drives were protected in white
foam. However ~1/2 inch of each end of all the drives were only protected by
bubble wrap. A small batch of three disks I ordered (testing for the larger
order) in February were packed similarly, and I've already had to RMA one of
those drives. Newegg, moving in the right direction, but still have a ways
to go in the packing dept. I still love their prices!

-Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Eric D. Mudama

On Mon, May 16 at 21:55, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Paul Kraus

All drives have a very high DOA rate according to Newegg. The
way they package drives for shipping is exactly how Seagate
specifically says NOT to pack them here


8 months ago, newegg says they've changed this practice.
http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167


The drives I just bought were half packed in white foam then wrapped
in bubble wrap.  Not all edges were protected with more than bubble
wrap.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Paul Kraus
> 
> All drives have a very high DOA rate according to Newegg. The
> way they package drives for shipping is exactly how Seagate
> specifically says NOT to pack them here

8 months ago, newegg says they've changed this practice.
http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Jim Klimov

2011-05-16 9:14, Richard Elling пишет:

On May 15, 2011, at 10:18 AM, Jim Klimov  wrote:


Hi, Very interesting suggestions as I'm contemplating a Supermicro-based server 
for my work as well, but probably in a lower budget as a backup store for an 
aging Thumper (not as its superior replacement).

Still, I have a couple of questions regarding your raidz layout recommendation.

On one hand, I've read that as current drives get larger (while their random 
IOPS/MBPS don't grow nearly as fast with new generations), it is becoming more 
and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for vdevs 
made of many disks - a dozen or so. When a drive fails, you still have two 
redundant parities, and with a resilver window expected to be in hours if not 
days range, I would want that airbag, to say the least. You know, failures 
rarely come one by one ;)

Not to worry. If you add another level of redundancy, the data protection
is improved by orders of magnitude. If the resilver time increases, the effect
on data protection is reduced by a relatively small divisor. To get some sense
of this, the MTBF is often 1,000,000,000 hours and there are only 24 hours in
a day.



If MTBFs were real, we'd never see disks failing within a year ;)

Problem is, these values seem to be determined in an ivory-tower
lab. An expensive-vendor edition of a drive running in a cooled
data center with shock absorbers and other nice features does
often live a lot longer than a similar OEM enterprise or consumer
drive running in an apartment with varying weather around and
often overheating and randomly vibrating with a dozen other
disks rotating in the same box.

The ramble about expensive-vendor drive editions comes from
my memory of some forum or blog discussion which I can't point
to now either, which suggested that vendors like Sun do not
charge 5x-10x the price of the same label of OEM drive just
for a nice corporate logo stamped onto the disk. Vendors were
said to burn-in the drives in their labs for like half a year or a
year before putting the survivors to the market. This implies
that some of the drives did not survive a burn-in period, and
indeed the MTBF for the remaining ones is higher because
"infancy death" due to manufacturing problems soon after
arrival to the end customer is unlikely for these particular
tested devices. The long burn-in times were also said to
be the partial reason why vendors never sell the biggest
disks available on the market (does any vendor sell 3Tb
with their own brand already? Sun-Oracle? IBM? HP?)
Thus may be obscured as "certification process" which
occasionally takes about as long - to see if the newest
and greatest disks die within a year or so.

Another implied idea in that discussion was that the vendors
can influence OEMs in choice of components, an example
in the thread being about different marks of steel for the
ball bearings. Such choices can drive the price up with
a reason - disks like that are more expensive to produce -
but also increases their reliability.

In fact, I've had very few Sun disks breaking in the boxes
I've managed over 10 years; all I can remember now were
two or three 2.5" 72Gb Fujitsus with a Sun brand. Still, we
have another dozen of those running so far for several years.

So yes, I can believe that Big Vendor Brand disks can boast
huge MTBFs and prove that with a track record, and such
drives are often replaced not because of a break-down,
but rather as a precaution, and because of "moral aging",
such as low speed and small volume.

But for the rest of us (like Home-ZFS users) such numbers
of MTBF are as fantastic as the Big Vendor prices, and
inachievable for any number of reasons, starting with use
of cheaper and potentially worse hardware from the
beginning, and non-"orchard" conditions of running the
machines...

I do have some 5-year-old disks running in computers
daily and still alive, but I have about as many which died
young, sometimes even within the warranty period ;)



On another hand, I've recently seen many recommendations that in a RAIDZ* drive set, the 
number of data disks should be a power of two - so that ZFS blocks/stripes and those of 
of its users (like databases) which are inclined to use 2^N-sized blocks can be often 
accessed in a single IO burst across all drives, and not in "one and one-quarter 
IO" on the average, which might delay IOs to other stripes while some of the disks 
in a vdev are busy processing leftovers of a previous request, and others are waiting for 
their peers.

I've never heard of this and it doesn't pass the sniff test. Can you cite a 
source?

I was trying to find an "authoritative" link today but failed.
I know I've read this for many times over the past couple
of months, but this may still be an "urban legend" or even
FUD, retold many times...

In fact, today I came across old posts from Jeff Bonwick,
where he explains the disk usage and "ZFS striping" which
is not like usual RAID striping. If th

Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Eric D. Mudama

On Mon, May 16 at 14:29, Paul Kraus wrote:

   I have stopped buying drives (and everything else) from Newegg
as they cannot be bothered to properly pack items. It is worth the
extra $5 per drive to buy them from CDW (who uses factory approved
packaging). Note that I made this change 5 or so years ago and Newegg
may have changed their packaging since then.


NewEgg packaging is exactly what you describe, unchanged in the last
few years.  Most recent newegg drive purchase was last week for me.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Paul Kraus
On Mon, May 16, 2011 at 2:35 PM, Krunal Desai  wrote:

> An order of 6 the 5K3000 drives for work-related purposes shipped in a
> Styrofoam holder of sorts that was cut in half for my small number of
> drives (is this what 20 pks come in?). No idea what other packaging
> was around them (shipping and receiving opened the packages).

Yes, the 20 packs I have seen are a big box with a foam insert with 2
columns of 10 'slots' that hold a drive in anti-static plastic.

P.S. I buy from CDW (and previously from Newegg) for home not work.
Work tends to buy from Sun/Oracle via a reseller. I can't afford new
Sun/Oracle for home use.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Garrett D'Amore
Actually it is 100 or less, i.e. a 10 msec delay.

  -- Garrett D'Amore

On May 16, 2011, at 11:13 AM, "Richard Elling"  wrote:

> On May 16, 2011, at 10:31 AM, Brandon High wrote:
>> On Mon, May 16, 2011 at 8:33 AM, Richard Elling
>>  wrote:
>>> As a rule of thumb, the resilvering disk is expected to max out at around
>>> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
>>> the throttles or broken data path.
>> 
>> My system was doing far less than 80 IOPS during resilver when I
>> recently upgraded the drives. The older and newer drives were both 5k
>> RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to
>> be super fast.
>> 
>> The worst resilver was 50 hours, the best was about 20 hours. This was
>> just my home server, which is lightly used. The clients (2-3 CIFS
>> clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS
>> clients) are mostly idle and don't do a lot of writes.
>> 
>> Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things
>> up a bit, which suggests that the default values may be too
>> conservative for some environments.
> 
> I am more inclined to change the hires_tick value. The "delays" are in 
> units of clock ticks. For Solaris, the default clock tick is 10ms, that I will
> argue is too large for modern disk systems. What this means is that when 
> the resilver, scrub, or memory throttle causes delays, the effective IOPS is
> driven to 10 or less. Unfortunately, these values are guesses and are 
> probably suboptimal for various use cases. OTOH, the prior behaviour of
> no resilver or scrub throttle was also considered a bad thing.
> -- richard
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Krunal Desai
On Mon, May 16, 2011 at 2:29 PM, Paul Kraus  wrote:
> What Newegg was doing is buying drives in the 20-pack from the
> manufacturer and packing them individually WRAPPED IN BUBBLE WRAP and
> then stuffed in a box. No clamshell. I realized *something* was up
> when _every_ drive I looked at had a much higher report of DOA (or
> early failure) at the Newegg reviews than made any sense (and compared
> to other site's reviews).

I picked up a single 5K3000 last week, have not powered it on yet, but
it came in a pseudo-OEM box with clamshells. I remember getting
bubble-wrapped single drives from Newegg, and more than a fair share
of those drives suffered early deaths or never powered on in the first
place. No complaints about Amazon: Seagate drives came in Seagate OEM
boxes with free shipping via Prime. (probably not practical for you
enterprise/professional guys, but nice for home users).

An order of 6 the 5K3000 drives for work-related purposes shipped in a
Styrofoam holder of sorts that was cut in half for my small number of
drives (is this what 20 pks come in?). No idea what other packaging
was around them (shipping and receiving opened the packages).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Paul Kraus
On Mon, May 16, 2011 at 1:20 PM, Brandon High  wrote:

> The 1TB and 2TB are manufactured in China, and have a very high
> failure and DOA rate according to Newegg.

All drives have a very high DOA rate according to Newegg. The
way they package drives for shipping is exactly how Seagate
specifically says NOT to pack them here
http://www.seagate.com/ww/v/index.jsp?locale=en-US&name=what-to-pack&vgnextoid=5c3a8bc90bf03210VgnVCM101a48090aRCRD

I have stopped buying drives (and everything else) from Newegg
as they cannot be bothered to properly pack items. It is worth the
extra $5 per drive to buy them from CDW (who uses factory approved
packaging). Note that I made this change 5 or so years ago and Newegg
may have changed their packaging since then.

 What Newegg was doing is buying drives in the 20-pack from the
manufacturer and packing them individually WRAPPED IN BUBBLE WRAP and
then stuffed in a box. No clamshell. I realized *something* was up
when _every_ drive I looked at had a much higher report of DOA (or
early failure) at the Newegg reviews than made any sense (and compared
to other site's reviews).

 This is NOT to say that the drives in question really don't have
a QC issue, just that the reports via Newegg are biased by Newegg's
packing / shipping practices.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Krunal Desai
On Mon, May 16, 2011 at 1:20 PM, Brandon High  wrote:
> The 1TB and 2TB are manufactured in China, and have a very high
> failure and DOA rate according to Newegg.
>
> The 3TB drives come off the same production line as the Ultrastar
> 5K3000 in Thailand and may be more reliable.

Thanks for the heads up, I was thinking about 5K3000s to finish out my
build (currently have Barracuda LPs). I do wonder how much of that DOA
is due to newegg HDD packaging/shipping, however.

--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Richard Elling
On May 16, 2011, at 10:31 AM, Brandon High wrote:
> On Mon, May 16, 2011 at 8:33 AM, Richard Elling
>  wrote:
>> As a rule of thumb, the resilvering disk is expected to max out at around
>> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
>> the throttles or broken data path.
> 
> My system was doing far less than 80 IOPS during resilver when I
> recently upgraded the drives. The older and newer drives were both 5k
> RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to
> be super fast.
> 
> The worst resilver was 50 hours, the best was about 20 hours. This was
> just my home server, which is lightly used. The clients (2-3 CIFS
> clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS
> clients) are mostly idle and don't do a lot of writes.
> 
> Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things
> up a bit, which suggests that the default values may be too
> conservative for some environments.

I am more inclined to change the hires_tick value. The "delays" are in 
units of clock ticks. For Solaris, the default clock tick is 10ms, that I will
argue is too large for modern disk systems. What this means is that when 
the resilver, scrub, or memory throttle causes delays, the effective IOPS is
driven to 10 or less. Unfortunately, these values are guesses and are 
probably suboptimal for various use cases. OTOH, the prior behaviour of
no resilver or scrub throttle was also considered a bad thing.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High
On Mon, May 16, 2011 at 8:33 AM, Richard Elling
 wrote:
> As a rule of thumb, the resilvering disk is expected to max out at around
> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
> the throttles or broken data path.

My system was doing far less than 80 IOPS during resilver when I
recently upgraded the drives. The older and newer drives were both 5k
RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to
be super fast.

The worst resilver was 50 hours, the best was about 20 hours. This was
just my home server, which is lightly used. The clients (2-3 CIFS
clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS
clients) are mostly idle and don't do a lot of writes.

Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things
up a bit, which suggests that the default values may be too
conservative for some environments.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High
On Sat, May 14, 2011 at 11:20 PM, John Doe  wrote:
>> 171   Hitachi 7K3000 3TB
> I'd go for the more environmentally friendly Ultrastar 5K3000 version - with 
> that many drives you wont mind the slower rotation but WILL notice a 
> difference in power and cooling cost

A word of caution - The Hitachi Deskstar 5K3000 drives in 1TB and 2TB
are different than the 3TB.

The 1TB and 2TB are manufactured in China, and have a very high
failure and DOA rate according to Newegg.

The 3TB drives come off the same production line as the Ultrastar
5K3000 in Thailand and may be more reliable.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread John Doe
following are some thoughts if it's not too late:

> 1 SuperMicro  847E1-R1400LPB
I guess you meant the 847E1[b]6[/b]-R1400LPB, the SAS1 version makes no sense

> 1 SuperMicro  H8DG6-F
not the best choice, see below why

> 171   Hitachi 7K3000 3TB
I'd go for the more environmentally friendly Ultrastar 5K3000 version - with 
that many drives you wont mind the slower rotation but WILL notice a difference 
in power and cooling cost

> 1 LSI SAS 9202-16e
this is really only a very expensive gadget to be honest, there's really no 
point to it - especially true when you start looking for the necessary cables 
that use a connector who's still in "draft" specification...

stick to the excellent LSI SAS9200-8e, of which you will need at least 3 in 
your setup, one to connect each of the 3 JBODS - with them filled with fast 
drives like you chose, you will need two links (one for the front and one for 
the back backplane) as daisychainig the backplanes together would oversaturate 
a single link. 

if you'd want to take advantage of the dual expanders on your JBOD backplanes 
for additional redundancy in case of expander or controller failure, you will 
need 6 of those  LSI SAS9200-8e - this is where your board isn't ideal as it 
has a 3/1/2 PCIe x16/x8/x4 configuration while you'd need 6 PCIe x8 - something 
the X8DTH-6F will provide, as well as the onboard LSI SAS2008 based HBA for the 
two backplanes in the server case.

> 1 LSI SAS 9211-4i
> 2 OCZ 64GB SSD Vertex 3
> 2 OCZ 256GB SSD Vertex 3
if these are meant to be connected together and used as ZIL+L2ARC, then I'd 
STRONGLY urge you to get the following instead:
1x LSI MegaRAID SAS 9265-8i 
1x LSI FastPath licence
4-8x 120GB or 240GB Vertex 3 Max IOPS Edition, whatever suits the budget

this solution allows you to push around 400k IOPS to the cache, more than 
likely way more than the stated appication of the system will need

> 1 NeterionX3120SR0001
I don't know this card personally but since it's not listed as supported 
(http://www.sun.com/io_technologies/nic/NIC1.html) I'd be careful

> My question is what is the optimum way of dividing
> these drives across vdevs?
I would do 14 x 12 drive raidz2 + 3 spare = 140*3TB = ~382TiB usable
this would allow for a logical mapping of drives to vdevs, giving you in each 
case 2 vdevs in the front and 1 in the back with the 9 drive blocks in the back 
of the JBODs used as 3 x 4/4/1, giving the remaining 2 x 12 drive vdevs plus 
one spare per case

> I could also go with 2TB drives and add an extra 45
> JBOD chassis. This would significantly decrease cost,
> but I'm running a gauntlet by getting very close to
> minimum useable space.
> 
> 12 x 18 drive raidz2
I would never do vdevs that large, it's just an accident waiting to happen!


hopefully these recommendations help you with your project. in any case, it's 
huge - the biggest system I worked on (which I actually have at home, go 
figure) only has a bit over 100TB in the following configuration:
6 x 12 drive raidz2 of Hitachi 5K3000 2TB
3 Norco 4224 with a HP SAS Expander in each
Supermicro X8DTi-LN4F with 3x LSI SAS9200-8e

so yeah, I based my thoughts on my own system but considering that it's been 
running smoothly for a while now (and that I had a very similar setup with 
smaller drives and older controllers before), I'm confident in my suggestions

Regards from Switzerland,
voyman
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Richard Elling
On May 16, 2011, at 5:02 AM, Sandon Van Ness  wrote:

> On 05/15/2011 09:58 PM, Richard Elling wrote:
>>>  In one of my systems, I have 1TB mirrors, 70% full, which can be
>>> sequentially completely read/written in 2 hrs.  But the resilver took 12
>>> hours of idle time.  Supposing you had a 70% full pool of raidz3, 2TB disks,
>>> using 10 disks + 3 parity, and a usage pattern similar to mine, your
>>> resilver time would have been minimum 10 days,
>> bollix
>> 
>>> likely approaching 20 or 30
>>> days.  (Because you wouldn't get 2-3 weeks of consecutive idle time, and the
>>> random access time for a raidz approaches 2x the random access time of a
>>> mirror.)
>> totally untrue
>> 
>>> BTW, the reason I chose 10+3 disks above was just because it makes
>>> calculation easy.  It's easy to multiply by 10.  I'm not suggesting using
>>> that configuration.  You may notice that I don't recommend raidz for most
>>> situations.  I endorse mirrors because they minimize resilver time (and
>>> maximize performance in general).  Resilver time is a problem for ZFS, which
>>> they may fix someday.
>> Resilver time is not a significant problem with ZFS. Resilver time is a much
>> bigger problem with traditional RAID systems. In any case, it is bad systems
>> engineering to optimize a system for best resilver time.
>>  -- richard
> 
> Actually I have seen resilvers take a very long time (weeks) on 
> solaris/raidz2 when I almost never see a hardware raid controller take more 
> than a day or two. In one case i thrashed the disks absolutely as hard as I 
> could (hardware controller) and finally was able to get the rebuild to take 
> almost 1 week.. Here is an example of one right now:
> 
>   pool: raid3060
>   state: ONLINE
>   status: One or more devices is currently being resilvered. The pool will
>   continue to function, possibly in a degraded state.
>   action: Wait for the resilver to complete.
>   scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
>   config:

I have seen worse cases, but the root cause was hardware failures
that are not reported by zpool status. Have you checked the health
of the disk transports? Hint: fmdump -e

Also, what zpool version is this? There were improvements made in the
prefetch and the introduction of throttles last year. One makes it faster,
the other intentionally slows it down.

As a rule of thumb, the resilvering disk is expected to max out at around
80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
the throttles or broken data path.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Edward Ned Harvey
> From: Sandon Van Ness [mailto:san...@van-ness.com]
> 
> ZFS resilver can take a very long time depending on your usage pattern.
> I do disagree with some things he said though... like a 1TB drive being
> able to be read/written in 2 hours? I seriously doubt this. Just reading
> 1 TB in 2 hours means an average speed of over 130 megabytes/sec.

1Gbit/sec sustainable sequential disk speed is not uncommon these days, and
it is in fact the performance of the disks in the system in question.  SATA
7.2krpm disks...  Not even special disks.  Just typical boring normal disks.


> Definitely no way to be that fast with reading *and* writing 1TB of data
> to the drive. I guess if you count reading from one and writing to the
> other. 3 hours is a much more likely figure and best case.

No need to read & write from the same drive.  You can read from one drive
and write to the other simultaneously at full speed.  If there is any
performance difference between read & write on these drives, it's not
measurable.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Edward Ned Harvey
> From: Richard Elling [mailto:richard.ell...@gmail.com]
> 
> >  In one of my systems, I have 1TB mirrors, 70% full, which can be
> > sequentially completely read/written in 2 hrs.  But the resilver took 12
> > hours of idle time.  Supposing you had a 70% full pool of raidz3, 2TB
disks,
> > using 10 disks + 3 parity, and a usage pattern similar to mine, your
> > resilver time would have been minimum 10 days,
> 
> bollix
> 
> Resilver time is not a significant problem with ZFS. Resilver time is a
much
> bigger problem with traditional RAID systems. In any case, it is bad
systems
> engineering to optimize a system for best resilver time.

Because RE seems to be emotionally involved with ZFS resilver times, I don't
believe it's going to be productive for me to try addressing his off-hand
comments.  Instead, I'm only going to say this much:

In my system mentioned above, a complete disk can be copied to another
complete disk, sequentially, in 131 minutes.  But during idle time it took
12 hours because ZFS resilver only does the used parts of disk, in
essentially random order.  So ZFS resilver often takes many times longer
than a complete hardware-based complete disk resilver.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Karl Wagner
I have to agree. ZFS needs a more intelligent scrub/resilver algorithm, which 
can 'sequentialise' the process. 
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Giovanni Tirloni  wrote:

On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness  wrote:


Actually I have seen resilvers take a very long time (weeks) on solaris/raidz2 
when I almost never see a hardware raid controller take more than a day or two. 
In one case i thrashed the disks absolutely as hard as I could (hardware 
controller) and finally was able to get the rebuild to take almost 1 week.. 
Here is an example of one right now:

  pool: raid3060
  state: ONLINE
  status: One or more devices is currently being resilvered. The pool will
  continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
  scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
  config:


Resilver has been a problem with RAIDZ volumes for a while. I've routinely seen 
it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All disks 
are maxed out on IOPS while still reading 1-2MB/s and there rarely is any 
writes. I've written about it before here (and provided data). 

My only guess is that fragmentation is a real problem in a scrub/resilver 
situation but whenever the conversation changes to point weaknesses in ZFS we 
start seeing "that is not a problem" comments. With the 7000s appliance I've 
heard that the 900hr estimated resilver time was "normal" and "everything is 
working as expected". Can't help but think there is some walled garden syndrome 
floating around.

-- 
Giovanni Tirloni

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Giovanni Tirloni
On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness wrote:

>
> Actually I have seen resilvers take a very long time (weeks) on
> solaris/raidz2 when I almost never see a hardware raid controller take more
> than a day or two. In one case i thrashed the disks absolutely as hard as I
> could (hardware controller) and finally was able to get the rebuild to take
> almost 1 week.. Here is an example of one right now:
>
>   pool: raid3060
>   state: ONLINE
>   status: One or more devices is currently being resilvered. The pool will
>   continue to function, possibly in a degraded state.
>   action: Wait for the resilver to complete.
>   scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
>   config:
>
>
Resilver has been a problem with RAIDZ volumes for a while. I've routinely
seen it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All
disks are maxed out on IOPS while still reading 1-2MB/s and there rarely is
any writes. I've written about it before here (and provided data).

My only guess is that fragmentation is a real problem in a scrub/resilver
situation but whenever the conversation changes to point weaknesses in ZFS
we start seeing "that is not a problem" comments. With the 7000s appliance
I've heard that the 900hr estimated resilver time was "normal" and
"everything is working as expected". Can't help but think there is some
walled garden syndrome floating around.

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Sandon Van Ness

On 05/15/2011 09:58 PM, Richard Elling wrote:

  In one of my systems, I have 1TB mirrors, 70% full, which can be
sequentially completely read/written in 2 hrs.  But the resilver took 12
hours of idle time.  Supposing you had a 70% full pool of raidz3, 2TB disks,
using 10 disks + 3 parity, and a usage pattern similar to mine, your
resilver time would have been minimum 10 days,

bollix


likely approaching 20 or 30
days.  (Because you wouldn't get 2-3 weeks of consecutive idle time, and the
random access time for a raidz approaches 2x the random access time of a
mirror.)

totally untrue


BTW, the reason I chose 10+3 disks above was just because it makes
calculation easy.  It's easy to multiply by 10.  I'm not suggesting using
that configuration.  You may notice that I don't recommend raidz for most
situations.  I endorse mirrors because they minimize resilver time (and
maximize performance in general).  Resilver time is a problem for ZFS, which
they may fix someday.

Resilver time is not a significant problem with ZFS. Resilver time is a much
bigger problem with traditional RAID systems. In any case, it is bad systems
engineering to optimize a system for best resilver time.
  -- richard


Actually I have seen resilvers take a very long time (weeks) on 
solaris/raidz2 when I almost never see a hardware raid controller take 
more than a day or two. In one case i thrashed the disks absolutely as 
hard as I could (hardware controller) and finally was able to get the 
rebuild to take almost 1 week.. Here is an example of one right now:


   pool: raid3060
   state: ONLINE
   status: One or more devices is currently being resilvered. The pool will
   continue to function, possibly in a degraded state.
   action: Wait for the resilver to complete.
   scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
   config:

ZFS resilver can take a very long time depending on your usage pattern. 
I do disagree with some things he said though... like a 1TB drive being 
able to be read/written in 2 hours? I seriously doubt this. Just reading 
1 TB in 2 hours means an average speed of over 130 megabytes/sec.


Only really new 1TB drives will even hit that type of speed at the 
begging of the drive and the average would be much closer to around 100 
MB/sec at the end of the drive. Also that is best case scenario. I know 
1TB drives (when they first came out) took aound 4-5 hours to do a 
complete read of all data on the disk at full speed.


Definitely no way to be that fast with reading *and* writing 1TB of data 
to the drive. I guess if you count reading from one and writing to the 
other. 3 hours is a much more likely figure and best case.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-15 Thread Brandon High
On Sun, May 15, 2011 at 10:14 PM, Richard Elling
 wrote:
> On May 15, 2011, at 10:18 AM, Jim Klimov  wrote:
>> In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) 
>> or 18 (16+2) disks - the latter being mentioned in the original post.
>
> A similar theory was disproved back in 2006 or 2007. I'd be very surprised if
> there was a reliable way to predict the actual use patterns in advance. 
> Features
> like compression and I/O coalescing improve performance, but make the old
> "rules of thumb" even more obsolete.

I thought that having data disks that were a power of two was still
recommended, due to the way that ZFS splits records/blocks in a raidz
vdev. Or are you responding to some other point?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-15 Thread Richard Elling
On May 15, 2011, at 10:18 AM, Jim Klimov  wrote:

> Hi, Very interesting suggestions as I'm contemplating a Supermicro-based 
> server for my work as well, but probably in a lower budget as a backup store 
> for an aging Thumper (not as its superior replacement).
> 
> Still, I have a couple of questions regarding your raidz layout 
> recommendation.
> 
> On one hand, I've read that as current drives get larger (while their random 
> IOPS/MBPS don't grow nearly as fast with new generations), it is becoming 
> more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for 
> vdevs made of many disks - a dozen or so. When a drive fails, you still have 
> two redundant parities, and with a resilver window expected to be in hours if 
> not days range, I would want that airbag, to say the least. You know, 
> failures rarely come one by one ;)

Not to worry. If you add another level of redundancy, the data protection
is improved by orders of magnitude. If the resilver time increases, the effect
on data protection is reduced by a relatively small divisor. To get some sense 
of this, the MTBF is often 1,000,000,000 hours and there are only 24 hours in
a day.

> On another hand, I've recently seen many recommendations that in a RAIDZ* 
> drive set, the number of data disks should be a power of two - so that ZFS 
> blocks/stripes and those of of its users (like databases) which are inclined 
> to use 2^N-sized blocks can be often accessed in a single IO burst across all 
> drives, and not in "one and one-quarter IO" on the average, which might delay 
> IOs to other stripes while some of the disks in a vdev are busy processing 
> leftovers of a previous request, and others are waiting for their peers.

I've never heard of this and it doesn't pass the sniff test. Can you cite a 
source? 

> In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) 
> or 18 (16+2) disks - the latter being mentioned in the original post.

A similar theory was disproved back in 2006 or 2007. I'd be very surprised if
there was a reliable way to predict the actual use patterns in advance. Features
like compression and I/O coalescing improve performance, but make the old
"rules of thumb" even more obsolete.

So, protect your data and if the performance doesn't meet your expectation, then
you can make adjustments.
 -- richard

> 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-15 Thread Richard Elling
On May 15, 2011, at 8:01 PM, Edward Ned Harvey 
 wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Jim Klimov
>> 
>> On one hand, I've read that as current drives get larger (while their
> random
>> IOPS/MBPS don't grow nearly as fast with new generations), it is becoming
>> more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least
>> for vdevs made of many disks - a dozen or so. When a drive fails, you
> still
>> have two redundant parities, and with a resilver window expected to be in
>> hours if not days range, I would want that airbag, to say the least. 
> 
> This is both an underestimation of the time required, and a sort of backward
> logic...
> 
> In all of the following, I'm assuming you're creating a pool whose primary
> storage is hard drives, not SSDs or similar.
> 
> The resilver time scales linearly with the number of slabs (blocks) in the
> degraded vdev, and depends on your usage patterns, which determine how
> randomly your data got scattered throughout the vdev upon writes.  

In all of my studies of resilvering, I have never seen a linear correlation
nor a correlation to the number of blocks. Can you share your data, or is
this another guess?

> I assume
> your choice of raid type will not determine your usage patterns.  So if you
> create a big vdev (raidz3) as opposed to a bunch of smaller ones (mirrors)
> the resilver time is longer for the large vdev. 
> 
> Also, even in the best case scenario (mirrors) assuming you have a pool
> that's reasonably full (say, 50% to 70%) the resilver time is likely to take
> several times longer than a complete sequential read/write of the entire
> disk.

Several times isn't significant wrt data protection. 10x to 100x is significant.

>  In one of my systems, I have 1TB mirrors, 70% full, which can be
> sequentially completely read/written in 2 hrs.  But the resilver took 12
> hours of idle time.  Supposing you had a 70% full pool of raidz3, 2TB disks,
> using 10 disks + 3 parity, and a usage pattern similar to mine, your
> resilver time would have been minimum 10 days,

bollix

> likely approaching 20 or 30
> days.  (Because you wouldn't get 2-3 weeks of consecutive idle time, and the
> random access time for a raidz approaches 2x the random access time of a
> mirror.)

totally untrue

> BTW, the reason I chose 10+3 disks above was just because it makes
> calculation easy.  It's easy to multiply by 10.  I'm not suggesting using
> that configuration.  You may notice that I don't recommend raidz for most
> situations.  I endorse mirrors because they minimize resilver time (and
> maximize performance in general).  Resilver time is a problem for ZFS, which
> they may fix someday.

Resilver time is not a significant problem with ZFS. Resilver time is a much
bigger problem with traditional RAID systems. In any case, it is bad systems
engineering to optimize a system for best resilver time.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-15 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Jim Klimov
> 
> On one hand, I've read that as current drives get larger (while their
random
> IOPS/MBPS don't grow nearly as fast with new generations), it is becoming
> more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least
> for vdevs made of many disks - a dozen or so. When a drive fails, you
still
> have two redundant parities, and with a resilver window expected to be in
> hours if not days range, I would want that airbag, to say the least. 

This is both an underestimation of the time required, and a sort of backward
logic...

In all of the following, I'm assuming you're creating a pool whose primary
storage is hard drives, not SSDs or similar.

The resilver time scales linearly with the number of slabs (blocks) in the
degraded vdev, and depends on your usage patterns, which determine how
randomly your data got scattered throughout the vdev upon writes.  I assume
your choice of raid type will not determine your usage patterns.  So if you
create a big vdev (raidz3) as opposed to a bunch of smaller ones (mirrors)
the resilver time is longer for the large vdev. 

Also, even in the best case scenario (mirrors) assuming you have a pool
that's reasonably full (say, 50% to 70%) the resilver time is likely to take
several times longer than a complete sequential read/write of the entire
disk.  In one of my systems, I have 1TB mirrors, 70% full, which can be
sequentially completely read/written in 2 hrs.  But the resilver took 12
hours of idle time.  Supposing you had a 70% full pool of raidz3, 2TB disks,
using 10 disks + 3 parity, and a usage pattern similar to mine, your
resilver time would have been minimum 10 days, likely approaching 20 or 30
days.  (Because you wouldn't get 2-3 weeks of consecutive idle time, and the
random access time for a raidz approaches 2x the random access time of a
mirror.)

BTW, the reason I chose 10+3 disks above was just because it makes
calculation easy.  It's easy to multiply by 10.  I'm not suggesting using
that configuration.  You may notice that I don't recommend raidz for most
situations.  I endorse mirrors because they minimize resilver time (and
maximize performance in general).  Resilver time is a problem for ZFS, which
they may fix someday.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-15 Thread Jim Klimov
Hi, Very interesting suggestions as I'm contemplating a Supermicro-based server 
for my work as well, but probably in a lower budget as a backup store for an 
aging Thumper (not as its superior replacement).

Still, I have a couple of questions regarding your raidz layout recommendation.

On one hand, I've read that as current drives get larger (while their random 
IOPS/MBPS don't grow nearly as fast with new generations), it is becoming more 
and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for vdevs 
made of many disks - a dozen or so. When a drive fails, you still have two 
redundant parities, and with a resilver window expected to be in hours if not 
days range, I would want that airbag, to say the least. You know, failures 
rarely come one by one ;)

On another hand, I've recently seen many recommendations that in a RAIDZ* drive 
set, the number of data disks should be a power of two - so that ZFS 
blocks/stripes and those of of its users (like databases) which are inclined to 
use 2^N-sized blocks can be often accessed in a single IO burst across all 
drives, and not in "one and one-quarter IO" on the average, which might delay 
IOs to other stripes while some of the disks in a vdev are busy processing 
leftovers of a previous request, and others are waiting for their peers.

In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) or 
18 (16+2) disks - the latter being mentioned in the original post.

Did you consider this aspect or test if the theoretical warnings are valid?

Thanks,
//Jim Klimov
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss