Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-05-01 Thread Lutz Schumann
I was going though this posting and it seems that were is some personal 
tension :). 

However going back to the technical problem of scrubbing a 200 TB pool I think 
this issue needs to be addressed. 

One warning up front: This writing is rather long, and if you like to jump to 
the part dealing with Scrub, jump to Scrub implementation below.

From my perspective: 

  - ZFS is great for huge amounts of data 

Thats what it was made for with 128bit and jbod design in mind. So ZFS is 
perfect for internet multi media in terms of scalability. 

  - ZFS is great for commodity hardware

Ok you should use 24x7 drives, but 2 TB 7200 disks are ok for internet media 
mass storage. We want huge amounts of data stored and in the internet age 
nobody pays for this. So you must use low cost hardware (well it must be 
compatible) - but you should not need enterprise components - thats what we 
have ZFS as clever software for. For mass storage internet services, the 
alternative is NOT EMC, NetApp (remember nobody pays a lot for the services 
because you can get it free at google) - the alternative is Linux based HW raid 
(with its well known limitations) and home grown solutions. Those do not have 
the nice ZFS features mentioned below.

  - ZFS guarantees data intrity by self-healing silent data corruption (thats 
what the checksums are for) - But only if you have redundancy. 

There are a lot of posts on the net saying when people will notice the bad 
blocks - it happens when a disk in a raid5 failes, and they have to resilver 
everything. Then you detect the missing redundancy. So people use Raid6 and 
hope that everything works. Or people do scrubs on their advanced raid 
controllers (if they provide internal checksumming). 

The same problem exists for huge, passive, raidz1 data sets in ZFS. If you do 
not scrub the array regularly, chances are higher that you will have a bad 
block during resilvering and then ZFS can not help. For active data sets the 
problem is not as critical, because on every read the checksum is verified - 
but still - because once in arc cache noboy checks - the problem exists. So we 
need scrub! 

  - ZFS can do many other nice things 

There's compression, dedupe etc .. however I look at them as nice to have. 

  - ZFS needs proper pool design 

Using ZFS right is not easy, sizing the system is even more complicated. There 
are a lot of threads reagarding pool design - the easiest is to say do a lot 
of mirrors, cause then the read performance really scales. However in internet 
mass media services, you cant - too expensive - because mirrored ZFS is more 
expensive then HW Raid 6 with Linux. How much members to a vdev ? multiple 
pools or single pools ? 

  - ZFS is open and community based 

... well lets see how this goes wth Oracle financing the whole stuff :)

And some of those points make ZFS a hit for internet service provider and mass 
media requirements (VOD etc.)!

So whats my point you may ask ? 

My experience with ZFS is that some points are simply not addressed well enough 
yet - BUT - ZFS is a living piece of software and thanks to the many great 
people developing it, it evolves faster then all the other storage solutions. 
So for the longer term - I believe ZFS will (hopefully) have all 
enterprice-ness it needs and it will revolutionize the storage industry (like 
cisco did in the 70's). I really believe that. 

From my perspective some of the points not addressed well in ZFS are:

  - pool defragmentation - you need this for a COW filesystem 

I think the ZFS developers are working on this with the background rewriter. So 
I hope it will come 2010. With the rewriter on disk layout can be optimized for 
read performance for sequencial workloads - also for raidz1 and raidz2 - 
meaning ZFS can compete with Raid5 and Raid6 - also for wider vdevs. And wider 
vdevs mean more effective capacity. If the vdev read-ahead cache is working 
nice with a sequencially aligned on disk layout then - (from disk) read 
performance will be great.

  - IO priorization for zvols / zfs filesystems (aka Storage QoS)

Unfortunately you can not prioritize I/O to zfs filesystems and zvols right 
now. I think this is one of the features that make ZFS not suitable for 1st 
tier storage (like EMC Symmetrix or NetApp FAS6000 series).  You need 
priorization here - because your SAP system really is more important than my 
MP3 web server :)

  - Deduplication not ready for production

Currently dedup is nice, but the DDT table handling and memory sizing is tricky 
and hardly usable for larger pools (my perspective). The DDT is handled like 
any other component - meaning user I/O can push the DDT out of the arc (and the 
L2ARC) - even with (primarycache=secondarycache)=metadata. For typical mass 
media storage applications, the working set is much larger then the memory (and 
L2ARC) meaning your DDT will come from disk - causing real performance 
degration.  

This is especially true for COMSTAR 

Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-19 Thread Tonmaus
 
   sata
   disks don't understand the prioritisation, so

 
 Er, the point was exactly that there is no
 discrimination, once the
 request is handed to the disk. 

So, do you say that SCSI drives do understand prioritisation (i.e. TCQ supports 
the schedule from ZFS), while SATA/NCQ drives don't, or is it just boiling down 
to what Richard told us, SATA disks being too slow?

 If the
 internal-to-disk queue is
 enough to keep the heads saturated / seek bound, then
 a new
 high-priority-in-the-kernel request will get to the
 disk sooner, but
 may languish once there.  

Thanks. That makes sense to me.


 
 You can shorten the number of outstanding IO's per
 vdev for the pool
 overall, or preferably the number scrub will generate
 (to avoid
 penalising all IO).  

That sounds like a meaningful approach to addressing bottlenecks caused by 
zpool scrub to me.

The tunables for each of these
 should be found
 readily, probably in the Evil Tuning Guide.

I think I should try to digest the Evil Tuning Guide occasionally with respect 
to this topic. Thanks for pointing me to a direction. Maybe what you have 
suggested above (shorten the number of I/Os issued by scrub) is already 
possible? If not, I think it would be a meaningful improvement to request.

 Disks with write cache effectively do this [command cueing] for
 writes, by pretending
 they complete immediately, but reads would block the
 channel until
 satisfied.  (This is all for ATA which lacked this,
 before NCQ. SCSI
 has had these capabilities for a long time).

As scrub is about reads, are you saying that this is still a problem with 
SATA/NCQ drives, or not? I am unsure what you mean at this point.

   limiting the number of concurrent IO's handed to
 the disk to try
   and avoid saturating the heads.
  
  Indeed, that was what I had in mind. With the
 addition that I think
  it is as well necessary to avoid saturating other
 components, such
  as CPU.  
 
 Less important, since prioritisation can be applied
 there too, but
 potentially also an issue.  Perhaps you want to keep
 the cpu fan
 speed/noise down for a home server, even if the scrub
 runs longer.

Well, the only thing that was really remarkable while scrubbing was CPU load 
constantly near 100%. I still think that is at least contributing to the 
collapse of concurrent payload. I.e., it's all about services that take place 
in Kernel: CIFS, ZFS, iSCSI Mostly, about concurrent load within ZFS 
itself. That means an implicit trade-off while a file is being provided over 
CIFS, i.e..

 
 AHCI should be fine.  In practice if you see actv  1
 (with a small
 margin for sampling error) then ncq is working.

Ok, and how is that in respect to mpt? My assertion that mpt will support NCQ 
is mainly based on the marketing information provided by LSI that these 
controllers offer NCQ support with SATA drives. How (by which tool) do I get to 
this actv parameter?

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-18 Thread Richard Elling
On Mar 16, 2010, at 4:41 PM, Tonmaus wrote:
 Are you sure that you didn't also enable
 something which 
 does consume lots of CPU such as enabling some sort
 of compression, 
 sha256 checksums, or deduplication?
 
 None of them is active on that pool or in any existing file system. Maybe the 
 issue is particular to RAIDZ2, which is comparably recent. On that occasion: 
 does anybody know if ZFS reads all parities during a scrub?

Yes

 Wouldn't it be sufficient for stale corruption detection to read only one 
 parity set unless an error occurs there?

No, because the parity itself is not verified.

 The main concern that one should have is I/O
 bandwidth rather than CPU 
 consumption since software based RAID must handle
 the work using the 
 system's CPU rather than expecting it to be done by
 some other CPU. 
 There are more I/Os and (in the case of mirroring)
 more data 
 transferred.
 
 What I am trying to say is that CPU may become the bottleneck for I/O in case 
 of parity-secured stripe sets. Mirrors and simple stripe sets have almost 0 
 impact on CPU. So far at least my observations. Moreover, x86 processors not 
 optimized for that kind of work as much as i.e. an Areca controller with a 
 dedicated XOR chip is, in its targeted field.

All x86 processors you care about do XOR at memory bandwidth speed.
XOR is one of the simplest instructions to implement on a microprocessor.
The need for a dedicated XOR chip for older hardware RAID systems is
because they use very slow processors with low memory bandwidth. Cheap
is as cheap does :-)

However, the issue for raidz2 and above (including RAID-6) is that the 
second parity is a more computationally complex Reed-Solomon code, 
not a simple XOR. So there is more computing required and that would be 
reflected in the CPU usage.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-18 Thread Tonmaus
 On that
 occasion: does anybody know if ZFS reads all parities
 during a scrub?
 
 Yes
 
  Wouldn't it be sufficient for stale corruption
 detection to read only one parity set unless an error
 occurs there?
 
 No, because the parity itself is not verified.

Aha. Well, my understanding was that a scrub basically means reading all data, 
and compare with the parities, which means that these have to be re-computed. 
Is that correct?

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-18 Thread Daniel Carosone
On Thu, Mar 18, 2010 at 05:21:17AM -0700, Tonmaus wrote:
  No, because the parity itself is not verified.
 
 Aha. Well, my understanding was that a scrub basically means reading
 all data, and compare with the parities, which means that these have
 to be re-computed. Is that correct? 

A scrub does, yes. It reads all data and metadata and checksums and
verifies they're correct.

A read of the pool might not - for example, it might:
 - read only one side of a mirror
 - read only one instance of a ditto block (metadata or copies1)
 - use cached copies of data or metadata; for a long-running system it
   might be a long time since some metadata blocks were ever read, if
   they're frequently used.

Roughly speaking, reading through the filesystem does the least work
possible to return the data. A scrub does the most work possible to
check the disks (and returns none of the data). 

For the OP:  scrub issues low-priority IO (and the details of how much
and how low have changed a few times along the version trail).
However, that prioritisation applies only within the kernel; sata disks
don't understand the prioritisation, so once the requests are with the
disk they can still saturate out other IOs that made it to the front
of the kernel's queue faster.  If you're looking for something to
tune, you may want to look at limiting the number of concurrent IO's
handed to the disk to try and avoid saturating the heads.  

You also want to confirm that your disks are on an NCQ-capable
controller (eg sata rather than cmdk) otherwise they will be severely
limited to processing one request at a time, at least for reads if you
have write-cache on (they will be saturated at the stop-and-wait
channel, long before the heads). 

--
Dan.

pgpoGKGntteaH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-18 Thread Tonmaus
Hello Dan,

Thank you very much for this interesting reply.

 roughly speaking, reading through the filesystem does
 the least work
 possible to return the data. A scrub does the most
 work possible to
 check the disks (and returns none of the data).

Thanks for the clarification. That's what I had thought.

 
 For the OP:  scrub issues low-priority IO (and the
 details of how much
 and how low have changed a few times along the
 version trail).

Is there any documentation about this, besides source code?

 However, that prioritisation applies only within the
 kernel; sata disks
 don't understand the prioritisation, so once the
 requests are with the
 disk they can still saturate out other IOs that made
 it to the front
 of the kernel's queue faster. 

I am not sure what you are hinting at. I initially thought about TCQ vs. NCQ 
when I read this. But I am not sure which detail of TCQ would allow for I/O 
discrimination that NCQ doesn't have. All I know about command cueing is that 
it is about optimising DMA strategies and optimising the handling of the I/O 
requests currently issued in respect to what to do first to return all data in 
the least possible time. (??)

 If you're looking for
 something to
 tune, you may want to look at limiting the number of
 concurrent IO's
 handed to the disk to try and avoid saturating the
 heads.

Indeed, that was what I had in mind. With the addition that I think it is as 
well necessary to avoid saturating other components, such as CPU.
 
 
 You also want to confirm that your disks are on an
 NCQ-capable
 controller (eg sata rather than cmdk) otherwise they
 will be severely
 limited to processing one request at a time, at least
 for reads if you
 have write-cache on (they will be saturated at the
 stop-and-wait
 channel, long before the heads). 

I have two systems here, a production system that is on LSI SAS (mpt) 
controllers, and another one that is on ICH-9 (ahci). Disks are SATA-2. The 
plan was that this combo will have NCQ support. On the other hand, do you know 
if there a method to verify if its functioning?

Best regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-18 Thread Daniel Carosone
On Thu, Mar 18, 2010 at 09:54:28PM -0700, Tonmaus wrote:
  (and the details of how much and how low have changed a few times
  along the version trail).  
 
 Is there any documentation about this, besides source code?

There are change logs and release notes, and random blog postings
along the way - they're less structured but often more informative.
There were some good descriptions about the scrub improvements 6-12
months ago.  The bugid's listed in change logs that mention scrub
should be pretty simple to find and sequence with versions.

  However, that prioritisation applies only within the kernel; sata
  disks don't understand the prioritisation, so once the requests
  are with the disk they can still saturate out other IOs that made
  it to the front of the kernel's queue faster.  
 
 I am not sure what you are hinting at. I initially thought about TCQ
 vs. NCQ when I read this. But I am not sure which detail of TCQ
 would allow for I/O discrimination that NCQ doesn't have. 

Er, the point was exactly that there is no discrimination, once the
request is handed to the disk.  If the internal-to-disk queue is
enough to keep the heads saturated / seek bound, then a new
high-priority-in-the-kernel request will get to the disk sooner, but
may languish once there.   

You'll get best overall disk throughput by letting the disk firmware
optimise seeks, but your priority request won't get any further
preference. 

Shortening the list of requests handed to the disk in parallel may
help, and still keep the channel mostly busy, perhaps at the expense
of some extra seek length and lower overall throughput.

You can shorten the number of outstanding IO's per vdev for the pool
overall, or preferably the number scrub will generate (to avoid
penalising all IO).  The tunables for each of these should be found
readily, probably in the Evil Tuning Guide.

 All I know about command cueing is that it is about optimising DMA
 strategies and optimising the handling of the I/O requests currently
 issued in respect to what to do first to return all data in the
 least possible time. (??)  

Mostly, as above it's about giving the disk controller more than one
thing to work on at a time, and having the issuance of a request and
its completion overlap with others, so the head movement can be
optimised and the controller channel can be busy with data transfer
for another while seeking.

Disks with write cache effectively do this for writes, by pretending
they complete immediately, but reads would block the channel until
satisfied.  (This is all for ATA which lacked this, before NCQ. SCSI
has had these capabilities for a long time).

  If you're looking for something to tune, you may want to look at
  limiting the number of concurrent IO's handed to the disk to try
  and avoid saturating the heads.
 
 Indeed, that was what I had in mind. With the addition that I think
 it is as well necessary to avoid saturating other components, such
 as CPU.  

Less important, since prioritisation can be applied there too, but
potentially also an issue.  Perhaps you want to keep the cpu fan
speed/noise down for a home server, even if the scrub runs longer.

 I have two systems here, a production system that is on LSI SAS
 (mpt) controllers, and another one that is on ICH-9 (ahci). Disks
 are SATA-2. The plan was that this combo will have NCQ support. On
 the other hand, do you know if there a method to verify if its
 functioning? 

AHCI should be fine.  In practice if you see actv  1 (with a small
margin for sampling error) then ncq is working.

--
Dan.



pgpIQ2VrNVyJl.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-17 Thread Bob Friesenhahn

On Tue, 16 Mar 2010, Tonmaus wrote:


None of them is active on that pool or in any existing file system. 
Maybe the issue is particular to RAIDZ2, which is comparably recent. 
On that occasion: does anybody know if ZFS reads all parities during 
a scrub? Wouldn't it be sufficient for stale corruption detection to 
read only one parity set unless an error occurs there?


Zfs scrub reads and verifies everything.  That is it's purpose.

What I am trying to say is that CPU may become the bottleneck for 
I/O in case of parity-secured stripe sets. Mirrors and simple stripe 
sets have almost 0 impact on CPU. So far at least my observations. 
Moreover, x86 processors not optimized for that kind of work as much 
as i.e. an Areca controller with a dedicated XOR chip is, in its 
targeted field.


It would be astonishing if the XOR algorithm consumed very much CPU 
with modern CPUs.  Zfs's own checksum is more brutal than XOR.  The 
scrub re-assembles full (usually 128K) data blocks and verifies the 
zfs checksum.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-17 Thread Tonmaus
Hi,

I got a message from you off-list that doesn't show up in the thread even after 
hours. As you mentioned the aspect here as well I'd like to respond to, I'll do 
it from here:

 Third, as for ZFS scrub prioritization, Richard
 answered your question about that.  He said it is
 low priority and can be tuned lower.  However, he was
 answering within the brcontext of an 11 disk RAIDZ2
 with slow disks  His exact words were:
 
 
 This could be tuned lower, but your storage
 is slow and *any* I/O activity will be
 noticed.

Richard told us two times that scrub already is as low in priority as can be. 
From another message:

Scrub is already the lowest priority. Would you like it to be lower?

=

As much as the comparison goes between slow and fast storage. I have 
understood that Richard's message was that with storage providing better random 
I/O zfs priority scheduling will perform significantly better, providing less 
degradation of concurrent load. While I am even inclined to buy that, nobody 
will be able to tell me how a certain system will behave until it was tested, 
and to what degree concurrent scrubbing still will be possible.
Another thing: people are talking a lot about narrow vdevs and mirrors. 
However, when you need to build a 200 TB pool you end up with a lot of disks in 
the first place. You will need at least double failover resilience for such a 
pool. If one would do that with mirrors, ending up with app. 600 TB gross to 
provide 200 TB net capacity is definitely NOT an option.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-17 Thread Khyron
Ugh!  I meant that to go to the list, so I'll probably re-send it for the
benefit
of everyone involved in the discussion.  There were parts of that that I
wanted
others to read.

From a re-read of Richard's e-mail, maybe he meant that the number of I/Os
queued to a device can be tuned lower and not the priority of the scrub (as
I took him to mean).  Hopefully Richard can clear that up.  I personally
stand
corrected for mis-reading Richard there.

Of course the performance of a given system cannot be described until it is
built.  Again, my interpretation of your e-mail was that you were looking
for
a model for the performance of concurrent scrub and I/O load of a RAIDZ2
VDEV that you could scale up from your test environment of 11 disks to a
200+ TB behemoth.  As I mentioned several times, I doubt such a model
exists, and I have not seen anything published to that effect.  I don't know

how useful it would be if it did exist because the performance of your disks

would be a critical factor.  (Although *any* model beats no model any day.)
Let's just face it.  You're using a new storage system that has not been
modeled.  To get the model you seek, you will probably have to create it
yourself.

(It's notable that most of the ZFS models that I have seen have been done
by Richard.  Of course, they were MTTDL models, not scrub vs. I/O
performance models for different VDEV types.)

As for your point about building large pools from lots of mirror VDEVs, my
response is meh.  I've said several times, and maybe you've missed it
several times, that there may be pathologies for which YOU should open
bugs.  RAIDZ3 may exhibit the same kind of pathologies you observed with
RAIDZ2.  Apparently RAIDZ does not.  I've also noticed (and I'm sure I'll
be corrected if I'm mistaken) that there is not a limit on the number of
VDEVs in a pool but single digit RAID VDEVs are recommended.  So there
is nothing preventing you from building (for example) VDEVs from 1 TB
disks.  If you take 9 x 1 TB disks per VDEV, and use RAIDZ2, you get 7 TB
usable.  That means about 29 VDEVs to get 200 TB.  Double the disk
capacity and you can probably get to 15 top level VDEVs.  (And you'll want
that RAIDZ2 as well since I don't know if you could trust that many disks,
whether enterprise or consumer.)  However, that number of top level VDEVs
sounds reasonable based on what others have reported.  What's been
proven to be A Bad Idea(TM) is putting lots of disks in a single VDEV.

Remember that ZFS is a *new* software system.  It is complex.  It will have
bugs.  You have chosen ZFS; it didn't choose you.  So I'd say you can
contribute to the community by reporting back your experiences, opening
bugs on things which make sense to open bugs on, testing configurations,
modeling, documenting and sharing.  So far, you just seem to be interested
in taking w/o so much as an offer of helping the community or developers to
understand what works and what doesn't.  All take and no give is not cool.
And if you don't like ZFS, then choose something else.  I'm sure EMC or
NetApp will willingly sell you all the spindles you want.  However, I think
it is
still early to write off ZFS as a losing proposition, but that's my opinion.

So far, you seem to be spending a lot of time complaining about a *new*
software system that you're not paying for.  That's pretty tasteless, IMO.

And now I'll re-send that e-mail...

P.S.: Did you remember to re-read this e-mail?  Read it 2 or 3 times and be
clear about what I said and what I did _not_ say.

On Wed, Mar 17, 2010 at 16:12, Tonmaus sequoiamo...@gmx.net wrote:

 Hi,

 I got a message from you off-list that doesn't show up in the thread even
 after hours. As you mentioned the aspect here as well I'd like to respond
 to, I'll do it from here:

  Third, as for ZFS scrub prioritization, Richard
  answered your question about that.  He said it is
  low priority and can be tuned lower.  However, he was
  answering within the brcontext of an 11 disk RAIDZ2
  with slow disks  His exact words were:
 
 
  This could be tuned lower, but your storage
  is slow and *any* I/O activity will be
  noticed.

 Richard told us two times that scrub already is as low in priority as can
 be. From another message:

 Scrub is already the lowest priority. Would you like it to be lower?


 =

 As much as the comparison goes between slow and fast storage. I have
 understood that Richard's message was that with storage providing better
 random I/O zfs priority scheduling will perform significantly better,
 providing less degradation of concurrent load. While I am even inclined to
 buy that, nobody will be able to tell me how a certain system will behave
 until it was tested, and to what degree concurrent scrubbing still will be
 possible.
 Another thing: people are talking a lot about narrow vdevs and mirrors.
 However, when you need to build a 200 TB pool you end up with a lot 

Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-17 Thread Khyron
For those following along, this is the e-mail I meant to send to the list
but
instead sent directly to Tonmaus.  My mistake, and I apologize for having to

re-send.

=== Start ===

My understanding, limited though it may be, is that a scrub touches ALL data
that
has been written, including the parity data.  It confirms the validity of
every bit that
has been written to the array.  Now, there may be an implementation detail
that is
responsible for the pathology that you observed.  More than likely, I'd
imagine.  Filing
a bug may be in order.  Since triple parity RAIDZ exists now, you may want
to test
with that by grabbing a LiveCD or LiveUSB image from genunix.org.  Maybe
RAIDZ3
has the same (or worse) problems?

As for scrub management, I pointed out the specific responses from Richard
where
he noted that scrub I/O priority *can* be tuned.  How you do that, I'm not
sure.
Richard, how does one tune scrub I/O priority?  Other than that, as I said,
I don't
think there is a model (publicly available anyway) describing scrub behavior
and how it
scales with pool size ( 5 TB, 5 TB - 50 TB,  50 TB, etc.) or data layout
(mirror vs.
RAIDZ vs. RAIDZ2).  ZFS is really that new, that all of this needs to be
reconsidered
and modeled.  Maybe this is something you can contribute to the community?
ZFS
is a new storage system, not the same old file systems whose behaviors and
quirks
are well known because of 20+ years of history.  We're all writing a new
chapter in
data storage here, so it is incumbent upon us to share knowledge in order to
answer
these types of questions.

I think the questions I raised in my longer response are also valid and need
to be
re-considered.  There are large pools in production today.  So how are
people
scrubbing these pools?  Please post your experiences with scrubbing 100+ TB
pools.

Tonmaus, maybe you should repost my other questions in a new, separate
thread?

=== End ===

On Tue, Mar 16, 2010 at 19:41, Tonmaus sequoiamo...@gmx.net wrote:

  Are you sure that you didn't also enable
  something which
  does consume lots of CPU such as enabling some sort
  of compression,
  sha256 checksums, or deduplication?

 None of them is active on that pool or in any existing file system. Maybe
 the issue is particular to RAIDZ2, which is comparably recent. On that
 occasion: does anybody know if ZFS reads all parities during a scrub?
 Wouldn't it be sufficient for stale corruption detection to read only one
 parity set unless an error occurs there?

  The main concern that one should have is I/O
  bandwidth rather than CPU
  consumption since software based RAID must handle
  the work using the
  system's CPU rather than expecting it to be done by
  some other CPU.
  There are more I/Os and (in the case of mirroring)
  more data
  transferred.

 What I am trying to say is that CPU may become the bottleneck for I/O in
 case of parity-secured stripe sets. Mirrors and simple stripe sets have
 almost 0 impact on CPU. So far at least my observations. Moreover, x86
 processors not optimized for that kind of work as much as i.e. an Areca
 controller with a dedicated XOR chip is, in its targeted field.

 Regards,

 Tonmaus
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
You can choose your friends, you can choose the deals. - Equity Private

If Linux is faster, it's a Solaris bug. - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Tonmaus
Hi Richard,

  - scrubbing the same pool, configured as raidz1
 didn't max out CPU which is no surprise (haha, slow
 storage...) the notable part is that it didn't slow
 down payload that much either.
 
 raidz creates more, smaller writes than a mirror or
 simple stripe. If the disks are slow,
 then the IOPS will be lower and the scrub takes
 longer, but the I/O scheduler can
 manage the queue better (disks are slower).

This wasn't mirror vs. raidz but raidz1 vs. raidz2, whereas the latter maxes 
out CPU and the former maxes out physical disc I/O. Concurrent payload 
degradation isn't that extreme on raidz1 pools, as it seems. Hence, the CPU 
theory that you still seem to be reluctant to follow.


 There are several
 bugs/RFEs along these lines, something like:
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu
 g_id=6743992

Thanks to pointing at this. As it seems, it's a problem for a couple of years 
already. Obviously, the opinion is being shared that this a management problem, 
not a HW issue.

As a Project Manager I will soon have to take a purchase decision for an 
archival storage system (A/V media), and one of the options we are looking into 
is SAMFS/QFS solution including tiers on disk with ZFS. I will have to make up 
my mind if the pool sizes we are looking into (typically we will need 150-200 
TB) are really manageable under the current circumstances when we think about 
including zfs scrub in the picture. From what I have learned here it rather 
looks as if there will be an extra challenge, if not even a problem for the 
system integrator. That's unfortunate.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Khyron
In following this discussion, I get the feeling that you and Richard are
somewhat
talking past each other.  He asked you about the hardware you are currently
running
on, whereas you seem to be interested in a model for the impact of scrubbing
on
I/O throughput that you can apply to some not-yet-acquired hardware.

It should be clear by now that the model you are looking for does not exist
given
how new ZFS is, and Richard has been focusing his comments on your existing
(home)
configuration since that is what you provided specs for.

Since you haven't provided specs for this larger system you may be
purchasing in the
future, I don't think anyone can give you specific guidance on what the I/O
impact of
scrubs on your configuration will be.  Richard seems to be giving more
design guidelines
and hints, and just generally good to know information to keep in mind while
designing
your solution.  Of course, he's been giving it in the context of your 11
disk wide
RAIDZ2 and not the 200 TB monster you only described in the last e-mail.

Stepping back, it may be worthwhile to examine the advice Richard has given,
in the
context of the larger configuration.

First, you won't be using commodity hardware for your enterprise-class
storage system,
will you?

Second, I would imagine that as a matter of practice, most people schedule
their pools
to scrub as far away from prime hours as possible.  Maybe it's possible, and
maybe it's
not.  The question to the larger community should be who is running a 100+
TB pool
and how have you configured your scrubs?  Or even for those running 100+
TB pools,
do your scrubs interfere with your production traffic/throughput?  If so,
how do you
compensate for this?

Third, as for ZFS scrub prioritization, Richard answered your question about
that.  He
said it is low priority and can be tuned lower.  However, he was answering
within the
context of an 11 disk RAIDZ2 with slow disks  His exact words were:

This could be tuned lower, but your storage is slow and *any* I/O activity
will be
noticed.

If you had asked about a 200 TB enterprise-class pool, he may have had a
different
response.  I don't know if ZFS will make different decisisons regarding I/O
priority on
commodity hardware as opposed to enterprise hardware, but I imagine it does
*not*.
If I'm mistaken, someone should correct me.  Richard also said In b133, the
priority
scheduler will work better than on older releases.  That may not be an
issue since
you haven't acquired your hardware YET, but again, Richard didn't know that
you
were talking about a 200 TB behemoth because you never said that.

Fourth, Richard mentioned a wide RAIDZ2 set.  Hopefully, if nothing else,
we've
seen that designing larger ZFS storage systems with pools composed of
smaller top
level VDEVs works better, and preferably mirrored top level VDEVs in the
case of lots
of small, random reads.  You didn't indicate the profile of the data to be
stored on
your system, so no one can realistically speak to that.  I think the general
guidance
is sound.  Multiple top level VDEVs, preferably mirrors.  If you're creating
RAIDZ2
top level VDEVs, then they should be smaller (narrower) in terms of the
number of
disks in the set.  11 would be too many, based on what I have seen and heard
on
this list cross referenced with the (little) information you have provided.

RAIDZ2 would appear to require more CPU power that RAIDZ, based on the
report
you gave and thus may have less negative impact on the performance of your
storage
system.  I'll cop to that.  However, you never mentioned how your 200 TB
behemoth
system will be used, besides an off-hand remark about CIFS.  Will it be
serving CIFS?
NFS?  Raw ZVOLs over iSCSI?  You never mentioned any of that.  Asking about
CIFS
if you're not going to serve CIFS doesn't make much sense.  That would
appear to
be another question for the ZFS gurus here -- WHY does RAIDZ2 cause so much
negative performance impact on your CIFS service while RAIDZ does not?  Your

experience is that a scrub of a RAIDZ2 maxed CPU while a RAIDZ scrub did
not, right?

Fifth, the pool scrub should probably be as far away from peak usage times
as possible.
That may or may not be feasible, but I don't think anyone would disagree
with that
advice.  Again, I know there are people running large pools who perform
scrubs.  It
might be worthwhile to directly ask what these people have experienced in
terms of
scrub performance on RAIDZ vs. RAIDZ2, or in general.

Finally, I would also note that Richard has been very responsive to your
questions (in
his own way) but you increasingly seem to be hostile and even disrespectful
toward
him.  (I've noticed this in more then one of your e-mails; they sound
progressively
more self-centered and selfish.  That's just my opinion.)  If this is a
community, that's
not a helpful way to treat a senior member of the community, even if he's
not
answering the question you want answered.

Keep in mind that asking the wrong questions is the 

Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Bruno Sousa
Well...i can only say well said.

BTW i have a raidz2 with 9 vdevs with 4 disks each (sata enterprise
disks) and the scrub of the pool takes between 12 to 39 hours..depends
on the workload of the server.
So far it's acceptable but each case is a case i think...


Bruno

On 16-3-2010 14:04, Khyron wrote:
 In following this discussion, I get the feeling that you and Richard
 are somewhat
 talking past each other.  He asked you about the hardware you are
 currently running
 on, whereas you seem to be interested in a model for the impact of
 scrubbing on
 I/O throughput that you can apply to some not-yet-acquired hardware. 

 It should be clear by now that the model you are looking for does not
 exist given
 how new ZFS is, and Richard has been focusing his comments on your
 existing (home)
 configuration since that is what you provided specs for.

 Since you haven't provided specs for this larger system you may be
 purchasing in the
 future, I don't think anyone can give you specific guidance on what
 the I/O impact of
 scrubs on your configuration will be.  Richard seems to be giving more
 design guidelines
 and hints, and just generally good to know information to keep in mind
 while designing
 your solution.  Of course, he's been giving it in the context of your
 11 disk wide
 RAIDZ2 and not the 200 TB monster you only described in the last e-mail.

 Stepping back, it may be worthwhile to examine the advice Richard has
 given, in the
 context of the larger configuration. 

 First, you won't be using commodity hardware for your enterprise-class
 storage system,
 will you?

 Second, I would imagine that as a matter of practice, most people
 schedule their pools
 to scrub as far away from prime hours as possible.  Maybe it's
 possible, and maybe it's
 not.  The question to the larger community should be who is running a
 100+ TB pool
 and how have you configured your scrubs?  Or even for those running
 100+ TB pools,
 do your scrubs interfere with your production traffic/throughput?  If
 so, how do you
 compensate for this?

 Third, as for ZFS scrub prioritization, Richard answered your question
 about that.  He
 said it is low priority and can be tuned lower.  However, he was
 answering within the
 context of an 11 disk RAIDZ2 with slow disks  His exact words were:

 This could be tuned lower, but your storage is slow and *any* I/O
 activity will be
 noticed.

 If you had asked about a 200 TB enterprise-class pool, he may have had
 a different
 response.  I don't know if ZFS will make different decisisons
 regarding I/O priority on
 commodity hardware as opposed to enterprise hardware, but I imagine it
 does *not*. 
 If I'm mistaken, someone should correct me.  Richard also said In
 b133, the priority
 scheduler will work better than on older releases.  That may not be
 an issue since
 you haven't acquired your hardware YET, but again, Richard didn't know
 that you
 were talking about a 200 TB behemoth because you never said that.

 Fourth, Richard mentioned a wide RAIDZ2 set.  Hopefully, if nothing
 else, we've
 seen that designing larger ZFS storage systems with pools composed of
 smaller top
 level VDEVs works better, and preferably mirrored top level VDEVs in
 the case of lots
 of small, random reads.  You didn't indicate the profile of the data
 to be stored on
 your system, so no one can realistically speak to that.  I think the
 general guidance
 is sound.  Multiple top level VDEVs, preferably mirrors.  If you're
 creating RAIDZ2
 top level VDEVs, then they should be smaller (narrower) in terms of
 the number of
 disks in the set.  11 would be too many, based on what I have seen and
 heard on
 this list cross referenced with the (little) information you have
 provided.

 RAIDZ2 would appear to require more CPU power that RAIDZ, based on the
 report
 you gave and thus may have less negative impact on the performance of
 your storage
 system.  I'll cop to that.  However, you never mentioned how your 200
 TB behemoth
 system will be used, besides an off-hand remark about CIFS.  Will it
 be serving CIFS?
 NFS?  Raw ZVOLs over iSCSI?  You never mentioned any of that.  Asking
 about CIFS
 if you're not going to serve CIFS doesn't make much sense.  That would
 appear to
 be another question for the ZFS gurus here -- WHY does RAIDZ2 cause so
 much
 negative performance impact on your CIFS service while RAIDZ does
 not?  Your
 experience is that a scrub of a RAIDZ2 maxed CPU while a RAIDZ scrub
 did not, right?

 Fifth, the pool scrub should probably be as far away from peak usage
 times as possible.
 That may or may not be feasible, but I don't think anyone would
 disagree with that
 advice.  Again, I know there are people running large pools who
 perform scrubs.  It
 might be worthwhile to directly ask what these people have experienced
 in terms of
 scrub performance on RAIDZ vs. RAIDZ2, or in general.

 Finally, I would also note that Richard has been very responsive to
 your questions (in
 his own way) 

Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Bob Friesenhahn

On Tue, 16 Mar 2010, Tonmaus wrote:


This wasn't mirror vs. raidz but raidz1 vs. raidz2, whereas the 
latter maxes out CPU and the former maxes out physical disc I/O. 
Concurrent payload degradation isn't that extreme on raidz1 pools, 
as it seems. Hence, the CPU theory that you still seem to be 
reluctant to follow.


If CPU is maxed out then that usually indicates some severe problem 
with choice of hardware or a misbehaving device driver.  Modern 
systems have an abundance of CPU.


I don't think that the size of the pool is particularly significant 
since zfs scrubs in a particular order and scrub throughput is dicated 
by access times and bandwidth.  In fact there should be less impact 
from scrub in a larger pool (even though scrub may take much longer) 
since the larger pool will have more vdevs.  The vdev design is most 
important.  Too many drives per vdev leads to poor performance, 
particularly if the drives are huge sluggish ones.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread thomas
Even if it might not be the best technical solution, I think what a lot of 
people are looking for when this comes up is a knob they can use to say I only 
want X IOPS per vdev (in addition to low prioritization) to be used while 
scrubbing. Doing so probably helps them feel more at ease that they have some 
excess capacity on cpu and vdev if production traffic should come along.

That's probably a false sense of moderating resource usage when the current 
full speed, but lowest prioritization is just as good and would finish 
quicker.. but, it gives them peace of mind?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread David Dyer-Bennet

On Tue, March 16, 2010 11:53, thomas wrote:
 Even if it might not be the best technical solution, I think what a lot of
 people are looking for when this comes up is a knob they can use to say I
 only want X IOPS per vdev (in addition to low prioritization) to be used
 while scrubbing. Doing so probably helps them feel more at ease that they
 have some excess capacity on cpu and vdev if production traffic should
 come along.

 That's probably a false sense of moderating resource usage when the
 current full speed, but lowest prioritization is just as good and would
 finish quicker.. but, it gives them peace of mind?

I may have been reading too quickly, but I have the impression that at
least some of the people not happy with the current prioritization were
reporting severe impacts to non-scrub performance when a scrub was in
progress.  If that's the case, then they have a real problem, they're not
just looking for more peace of mind in a hypothetical situation.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Khyron
The issue as presented by Tonmaus was that a scrub was negatively impacting
his RAIDZ2 CIFS performance, but he didn't see the same impact with RAIDZ.
I'm not going to say whether that is a problem one way or the other; it
may
be expected behavior under the circumstances.  That's for ZFS developers to
speak on.  (This was one of many issues Tonmaus mentioned.)

However, what was lost was the context.  Tonmaus reported this behavior on
a commodity server using slow disks in an 11 disk RAIDZ2 set.  However, he
*really* wants to know if this will be an issue on a 100+ TB pool.  So his
examples were given on a pool that was possibly 5% of the size the pool that

he actually wants to deploy.  He never said any of this in the original
e-mail, so
Richard assumed the context to be the smaller system.  That's why I pointed
out all of the discrepancies and questions he could/should have asked which
might have yielded more useful answers.

There's quite a difference between the 11 disk RAIDZ2 set and a 100+ TB ZFS
pool, especially when the use case, VDEV layout and other design aspects of
the 100+ TB pool have not been described.

On Tue, Mar 16, 2010 at 13:41, David Dyer-Bennet d...@dd-b.net wrote:


 On Tue, March 16, 2010 11:53, thomas wrote:
  Even if it might not be the best technical solution, I think what a lot
 of
  people are looking for when this comes up is a knob they can use to say
 I
  only want X IOPS per vdev (in addition to low prioritization) to be used
  while scrubbing. Doing so probably helps them feel more at ease that they
  have some excess capacity on cpu and vdev if production traffic should
  come along.
 
  That's probably a false sense of moderating resource usage when the
  current full speed, but lowest prioritization is just as good and would
  finish quicker.. but, it gives them peace of mind?

 I may have been reading too quickly, but I have the impression that at
 least some of the people not happy with the current prioritization were
 reporting severe impacts to non-scrub performance when a scrub was in
 progress.  If that's the case, then they have a real problem, they're not
 just looking for more peace of mind in a hypothetical situation.
 --
 David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
 Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
 Photos: http://dd-b.net/photography/gallery/
 Dragaera: http://dragaera.info

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
You can choose your friends, you can choose the deals. - Equity Private

If Linux is faster, it's a Solaris bug. - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Tonmaus
Hello,

 In following this discussion, I get the feeling that
 you and Richard are somewhat talking past each
 other.

Talking past each other is a problem I have noted and remarked earlier. I have 
to admit to have got frustrated about the discussion narrowing down to a 
certain perspective that was quite the opposite of my own observations and what 
I had initially described. It may be that I have been more harsh than I should. 
Please accept my apology.
I was trying from the outset to obtain a perspective on the matter that is 
independent from an actual configuration. I firmly believe that the scrub 
function is more meaningful if it can be applied in a variety of 
implementations.
I think however that the insight that there seems to be no specific scrub 
management functions is transferable from a commodity implementation to a 
enterprise configuration.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Tonmaus
 If CPU is maxed out then that usually indicates some
 severe problem 
 with choice of hardware or a misbehaving device
 driver.  Modern 
 systems have an abundance of CPU.

AFAICS the CPU loads are only high while scrubbing a double parity pool. I have 
no indication of a technical misbehaviour with the exception of dismal 
concurrent performance.

What is not getting beyond me is the notion that even if I *had* a storage 
configuration with 20 times more I/O capacity it still would max out any CPU I 
could buy better than the single L5410 I am running from currently. I am seeing 
CPU performance being a pain point on any software based array I have used so 
far. From SOHO NAS boxes (the usual Thecus stuff) to NetApp 3200 filers, all 
exposed a nominal performance drop once parity configurations were employed.
Performance of the L5410 is abundant for the typical operation of my system, 
btw. It can easiely saturate the dual 1000 Mbit NICs for iSCSI and CIFS 
services. I am slightly reluctant to buy a second L5410 just to provide more 
headroom during maintenance operations, as the device will be idle otherwise, 
consuming power.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Bob Friesenhahn

On Tue, 16 Mar 2010, Tonmaus wrote:

AFAICS the CPU loads are only high while scrubbing a double parity 
pool. I have no indication of a technical misbehaviour with the 
exception of dismal concurrent performance.


This seems pretty weird to me.  I have not heard anyone else complain 
about this sort of problem before in the several years I have been on 
this list.  Are you sure that you didn't also enable something which 
does consume lots of CPU such as enabling some sort of compression, 
sha256 checksums, or deduplication?


running from currently. I am seeing CPU performance being a pain 
point on any software based array I have used so far. From SOHO 
NAS boxes (the usual Thecus stuff) to NetApp 3200 filers, all 
exposed a nominal performance drop once parity configurations were 
employed.


The main concern that one should have is I/O bandwidth rather than CPU 
consumption since software based RAID must handle the work using the 
system's CPU rather than expecting it to be done by some other CPU. 
There are more I/Os and (in the case of mirroring) more data 
transferred.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-16 Thread Tonmaus
 Are you sure that you didn't also enable
 something which 
 does consume lots of CPU such as enabling some sort
 of compression, 
 sha256 checksums, or deduplication?

None of them is active on that pool or in any existing file system. Maybe the 
issue is particular to RAIDZ2, which is comparably recent. On that occasion: 
does anybody know if ZFS reads all parities during a scrub? Wouldn't it be 
sufficient for stale corruption detection to read only one parity set unless an 
error occurs there?

 The main concern that one should have is I/O
 bandwidth rather than CPU 
 consumption since software based RAID must handle
 the work using the 
 system's CPU rather than expecting it to be done by
 some other CPU. 
 There are more I/Os and (in the case of mirroring)
 more data 
 transferred.

What I am trying to say is that CPU may become the bottleneck for I/O in case 
of parity-secured stripe sets. Mirrors and simple stripe sets have almost 0 
impact on CPU. So far at least my observations. Moreover, x86 processors not 
optimized for that kind of work as much as i.e. an Areca controller with a 
dedicated XOR chip is, in its targeted field.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-15 Thread Tonmaus
Hello again,

I am still concerned if my points are being well taken.

 If you are concerned that a
 single 200TB pool would take a long
 time to scrub, then use more pools and scrub in
 parallel.

The main concern is not scrub time. Scrub time could be weeks if scrub just 
would behave. You may imagine that there are applications where segmentation is 
a pain point, too.

  The scrub will queue no more
 han 10 I/Os at one time to a device, so devices which
 can handle concurrent I/O
 are not consumed entirely by scrub I/O. This could be
 tuned lower, but your storage 
 is slow and *any* I/O activity will be noticed.

There are a couple of things I maybe don't understand, then.

- zpool iostat is reporting more than 1k of outputs while scrub
- throughput is as high as can be until maxing out CPU
- nominal I/O capacity of a single device is still around 90, how can 10 I/Os 
already bring down payload
- scrubbing the same pool, configured as raidz1 didn't max out CPU which is no 
surprise (haha, slow storage...) the notable part is that it didn't slow down 
payload that much either.
- scrub is obviously fine with data added or deleted during a pass. So, it 
could be possible to pause and resume a pass, couldn't it?

My conclusion from these observations is that not only disk speed counts here, 
but other bottlenecks may strike as well. Solving the issue by the wallet is 
one way, solving it by configuration of parameters is another. So, is there a 
lever for scrub I/O prio, or not? Is there a possibility to pause scrub passed 
and resume?

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-15 Thread Richard Elling
On Mar 14, 2010, at 11:25 PM, Tonmaus wrote:
 Hello again,
 
 I am still concerned if my points are being well taken.
 
 If you are concerned that a
 single 200TB pool would take a long
 time to scrub, then use more pools and scrub in
 parallel.
 
 The main concern is not scrub time. Scrub time could be weeks if scrub just 
 would behave. You may imagine that there are applications where segmentation 
 is a pain point, too.

I agree.

 The scrub will queue no more
 han 10 I/Os at one time to a device, so devices which
 can handle concurrent I/O
 are not consumed entirely by scrub I/O. This could be
 tuned lower, but your storage 
 is slow and *any* I/O activity will be noticed.
 
 There are a couple of things I maybe don't understand, then.
 
 - zpool iostat is reporting more than 1k of outputs while scrub

ok

 - throughput is as high as can be until maxing out CPU

You would rather your CPU be idle?  What use is an idle CPU, besides wasting 
energy :-)?

 - nominal I/O capacity of a single device is still around 90, how can 10 I/Os 
 already bring down payload

90 IOPS is approximately the worst-case rate for a 7,200 rpm disk for a small, 
random
workload. ZFS tends to write sequentially, so random writes tend to become 
sequential
writes on ZFS. So it is quite common to see scrub workloads with  90 IOPS.

 - scrubbing the same pool, configured as raidz1 didn't max out CPU which is 
 no surprise (haha, slow storage...) the notable part is that it didn't slow 
 down payload that much either.

raidz creates more, smaller writes than a mirror or simple stripe. If the disks 
are slow,
then the IOPS will be lower and the scrub takes longer, but the I/O scheduler 
can
manage the queue better (disks are slower).

 - scrub is obviously fine with data added or deleted during a pass. So, it 
 could be possible to pause and resume a pass, couldn't it?

You can start or stop scrubs, there no resume directive.   There are several
bugs/RFEs along these lines, something like:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6743992

 My conclusion from these observations is that not only disk speed counts 
 here, but other bottlenecks may strike as well. Solving the issue by the 
 wallet is one way, solving it by configuration of parameters is another. So, 
 is there a lever for scrub I/O prio, or not? Is there a possibility to pause 
 scrub passed and resume?

Scrub is already the lowest priority.  Would you like it to be lower?
I think the issue is more related to which queue is being managed by
the ZFS priority scheduler rather than the lack of scheduling priority.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-14 Thread Tonmaus
Hi Richard,

these are 
- 11x WD1002fbys (7200rpm SATA drives) in 1 raidz2 group
- 4 GB RAM
- 1 CPU L5410
- snv_133 (where the current array was created as well)

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-14 Thread Richard Elling
On Mar 14, 2010, at 12:16 AM, Tonmaus wrote:

 Hi Richard,
 
 these are 
 - 11x WD1002fbys (7200rpm SATA drives) in 1 raidz2 group
 - 4 GB RAM
 - 1 CPU L5410
 - snv_133 (where the current array was created as well)

These are slow drives and the configuration will have poor random
read performance. Do not expect blazing fast scrubs.

In b133, the priority scheduler will work better than on older 
releases. But it may not be enough to overcome a very wide raidz2
set.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-14 Thread Tonmaus
Hi Richard,

thanks for the answer. I think I am aware on the properties of my configuration 
and how it will scale. Let me please stress that this is not the point in the 
discussion.
The target of this discussion should rather be if scrubbing can co-exist with 
payload or if we are thrown back to scrub in the after-hours.
So, do I have to conclude that zfs is not able to make good decisions about 
load prioritisation on commodity hardware and that there are no further options 
available to tweak scrub load impact, or are there other options? 

I am thinking about managing pools capable of hundred times the capacity of 
mine (currently there are 3,7 TB on disk, and it takes 2,5 h to scrub them on 
the double-parity pool) that practically would be un-scrub-able. (Yes, 
Enterprise HW is faster, but Enterprise service windows are much more narrow as 
well... you can't move around or offline 200 TB of live data for days only 
because you need to scrub the disks can you?)

The only idea I could think of myself is to exchange individual drives in a 
round-robin fashion all the time and use re-silver instead of full scrubs. But 
actually I don't like the idea anymore at second glance.

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-14 Thread Richard Elling
On Mar 14, 2010, at 11:45 AM, Tonmaus wrote:
 Hi Richard,
 
 thanks for the answer. I think I am aware on the properties of my 
 configuration and how it will scale. Let me please stress that this is not 
 the point in the discussion.
 The target of this discussion should rather be if scrubbing can co-exist with 
 payload or if we are thrown back to scrub in the after-hours.
 So, do I have to conclude that zfs is not able to make good decisions about 
 load prioritisation on commodity hardware and that there are no further 
 options available to tweak scrub load impact, or are there other options? 

ZFS prioritizes I/O.  Scrub has the lowest priority.  The scrub will queue no 
more
than 10 I/Os at one time to a device, so devices which can handle concurrent I/O
are not consumed entirely by scrub I/O. This could be tuned lower, but your 
storage 
is slow and *any* I/O activity will be noticed.

 I am thinking about managing pools capable of hundred times the capacity of 
 mine (currently there are 3,7 TB on disk, and it takes 2,5 h to scrub them on 
 the double-parity pool) that practically would be un-scrub-able. (Yes, 
 Enterprise HW is faster, but Enterprise service windows are much more narrow 
 as well... you can't move around or offline 200 TB of live data for days only 
 because you need to scrub the disks can you?)

I can't follow your logic here. Scrub is a low priority process and should be 
done at
infrequent intervals. If you are concerned that a single 200TB pool would take 
a long
time to scrub, then use more pools and scrub in parallel.

 The only idea I could think of myself is to exchange individual drives in a 
 round-robin fashion all the time and use re-silver instead of full scrubs. 
 But actually I don't like the idea anymore at second glance.

You are right, this would not be a good idea.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Atlanta, March 16-18, 2010 http://nexenta-atlanta.eventbrite.com 
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How to manage scrub priority or defer scrub?

2010-03-13 Thread Tonmaus
Dear zfs fellows,

during a specific test I have got the impression that scrub may have quite an 
impact on other I/O. CIFS throughput is down to 7 MB/s from 100 MB/s while 
scrub on my main NAS. That is not a surprise as scrub of my raidz2 pool maxes 
out CPU on that machine. (1 Xeon L5410). 
I am running scrubs during week-ends, so this is not a problem. I am asking 
myself however what will happen on larger pools where a scrub pass will take 
days to weeks. Obviously, zfs file systems are much more scalable than CPU 
power ever will be.
Hence, I am seeing a requirement to manage scrub activity so that trade-offs 
can be done to maintain availability and performance of the pool. Does anybody 
know how this is done?

Thanks in advance for any hints,

Regards,

Tonmaus
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss