[zfs-discuss] Zpool with data errors

2011-06-20 Thread Todd Urie
I have a zpool that shows the following from a zpool status -v 

brsnnfs0104 [/var/spool/cron/scripts]# zpool status -v ABC0101
  pool:ABC0101
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
ABC0101   ONLINE   0 010
  /dev/vx/dsk/ABC01dg/ABC0101_01  ONLINE   0 0 2
  /dev/vx/dsk/ABC01dg/ABC0101_02  ONLINE   0 0 8
  /dev/vx/dsk/ABC01dg/ABC0101_03  ONLINE   0 010

errors: Permanent errors have been detected in the following files:

/clients/ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/717b52282ea059452621587173561360
/clients/
ABC0101/rep/local/bfm/web/htdocs/tmp/rscache/6e6a9f37c4d13fdb3dcb8649272a2a49
/clients/ABC0101/rep/d0/prod1/reports/ReutersCMOLoad/ReutersCMOLoad.
ABCntss001.20110620.141330.26496.ROLLBACK_FOR_UPDATE_COUPONS.html
/clients/
ABC0101/rep/local/bfm/web/htdocs/tmp/G2_0.related_detail_loader.1308593666.54643.n5cpoli3355.data
/clients/
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/F_OLPO82_A.gp.
ABCIM_GA.nlaf.xml.gz
/clients/
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNVLXCIAFI.gp.
ABCIM_GA.nlaf.xml.gz
/clients/
ABC0101/rep/d0/prod1/reports/gp_reports/ALLMNG/20110429/UNIVLEXCIA.gp.BARCRATING_
ABC.nlaf.xml.gz

I think that a scrub at least has the possibility to clear this up.  A quick
search suggests that others have had some good experience with using scrub
in similar circumstances.  I was wondering if anyone could share some of
their experiences, good and bad, so that I can assess the risk and
probability of success with this approach.  Also, any other ideas would
certainly be appreciated.


-RTU
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Daniel Carosone
On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote:
> Yes. I've been looking at what the value of zfs_vdev_max_pending should be.
> The old value was 35 (a guess, but a really bad guess) and the new value is
> 10 (another guess, but a better guess).  I observe that data from a fast, 
> modern 
> HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 
> IOPS. 
> But as we add threads, the average response time increases from 2.3ms to 
> 137ms.

Interesting.  What happens to total throughput, since that's the
expected tradeoff against latency here.  I might guess that in your
tests with a constant io size, it's linear with IOPS - but I wonder if
that remains so for larger IO or with mixed sizes?

> Since the whole idea is to get lower response time, and we know disks are not 
> simple queues so there is no direct IOPS to response time relationship, maybe 
> it
> is simply better to limit the number of outstanding I/Os.

I also wonder if we're seeing a form of "bufferbloat" here in these
latencies.

As I wrote in another post yesterday, remember that you're not
counting actual outstanding IO's here, because the write IO's are
being acknowledged immediately and tracked internally. The disk may
therefore be getting itself into a state where either the buffer/queue
is efectively full, or the number of requests it is tracking
internally becomes inefficient (as well as the head-thrashing). 

Even before you get to that state and writes start slowing down too,
your averages are skewed by write cache. All the writes are fast,
while a longer queue exposes reads to contention with eachother, as
well as to a much wider window of writes.  Can you look at the average
response time for just the reads, even amongst a mixed r/w workflow?
Perhaps some alternate statistic than average, too.

Can you repeat the tests with write-cache disabled, so you're more
accurately exposing the controller's actual workload and backlog?

I hypothesise that this will avoid those latencies getting so
ridiculously out of control, and potentially also show better
(relative) results for higher concurrency counts.  Alternately, it
will show that your disk firmware really is horrible at managing
concurrency even for small values :)

Whether it shows better absolute results than a shorter queue + write
cache is an entirely different question.  The write cache will
certainly make things faster in the common case, which is another way
of saying that your lower-bound average latencies are artificially low
and making the degradation look worse.

> > This comment seems to indicate that the drive queues up a whole bunch of
> > requests, and since the queue is large, each individual response time has
> > become large.  It's not that physical actual performance has degraded with
> > the cache enabled, it's that the queue has become long.  For async writes,
> > you don't really care how long the queue is, but if you have a mixture of
> > async writes and occasional sync writes...  Then the queue gets long, and
> > when you sync, the sync operation will take a long time to complete.  You
> > might actually benefit by disabling the disk cache.
> > 
> > Richard, have I gotten the gist of what you're saying?
> 
> I haven't formed an opinion yet, but I'm inclined towards wanting overall
> better latency.

And, in particlar, better latency for specific (read) requests that zfs
prioritises; these are often the ones that contribute most to a system
feeling unresponsive.  If this prioritisation is lost once passed to
the disk, both because the disk doesn't have a priority mechanism and
because it's contending with the deferred cost of previous writes,
then you'll get better latency for the requests you care most about
with a shorter queue.

--
Dan.




pgp4AqJyAubZi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Garrett D'Amore
For SSD we have code in illumos that disables disksort.  Ultimately, we believe 
that the cost of disksort is in the noise for performance.

  -- Garrett D'Amore

On Jun 20, 2011, at 8:38 AM, "Andrew Gabriel"  wrote:

> Richard Elling wrote:
>> On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote:
>>  
>>> Richard Elling wrote:
>>>
 Actually, all of the data I've gathered recently shows that the number of 
 IOPS does not significantly increase for HDDs running random workloads. 
 However the response time does :-( My data is leading me to want to 
 restrict the queue depth to 1 or 2 for HDDs.
   
>>> Thinking out loud here, but if you can queue up enough random I/Os, the 
>>> embedded disk controller can probably do a good job reordering them into 
>>> less random elevator sweep pattern, and increase IOPs through reducing the 
>>> total seek time, which may be why IOPs does not drop as much as one might 
>>> imagine if you think of the heads doing random seeks (they aren't random 
>>> anymore). However, this requires that there's a reasonable queue of I/Os 
>>> for the controller to optimise, and processing that queue will necessarily 
>>> increase the average response time. If you run with a queue depth of 1 or 
>>> 2, the controller can't do this.
>>>
>> 
>> I agree. And disksort is in the mix, too.
>>  
> 
> Oh, I'd never looked at that.
> 
>>> This is something I played with ~30 years ago, when the OS disk driver was 
>>> responsible for the queuing and reordering disc transfers to reduce total 
>>> seek time, and disk controllers were dumb.
>>>
>> 
>> ...and disksort still survives... maybe we should kill it?
>>  
> 
> It looks like it's possibly slightly worse than the pathologically worst 
> response time case I described below...
> 
>>> There are lots of options and compromises, generally weighing reduction in 
>>> total seek time against longest response time. Best reduction in total seek 
>>> time comes from planning out your elevator sweep, and inserting newly 
>>> queued requests into the right position in the sweep ahead. That also gives 
>>> the potentially worse response time, as you may have one transfer queued 
>>> for the far end of the disk, whilst you keep getting new transfers queued 
>>> for the track just in front of you, and you might end up reading or writing 
>>> the whole disk before you get to do that transfer which is queued for the 
>>> far end. If you can get a big enough queue, you can modify the insertion 
>>> algorithm to never insert into the current sweep, so you are effectively 
>>> planning two sweeps ahead. Then the worse response time becomes the time to 
>>> process one queue full, rather than the time to read or write the whole 
>>> disk. Lots of other tricks too (e.g. insertion into sweeps taking into 
>>> account priority, such as i
 f
> the I/O is a synchronous or asynchronous, and age of existing queue entries). 
> I had much fun playing with this at the time.
>>>
>> 
>> The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os 
>> sent to the disk.
>>  
> 
> Does that also go through disksort? Disksort doesn't seem to have any concept 
> of priorities (but I haven't looked in detail where it plugs in to the whole 
> framework).
> 
>> So it might make better sense for ZFS to keep the disk queue depth small for 
>> HDDs.
>> -- richard
>>  
> 
> -- 
> Andrew Gabriel
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Andrew Gabriel

Richard Elling wrote:

On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote:
  

Richard Elling wrote:


Actually, all of the data I've gathered recently shows that the number of IOPS 
does not significantly increase for HDDs running random workloads. However the 
response time does :-( My data is leading me to want to restrict the queue 
depth to 1 or 2 for HDDs.
 
  

Thinking out loud here, but if you can queue up enough random I/Os, the 
embedded disk controller can probably do a good job reordering them into less 
random elevator sweep pattern, and increase IOPs through reducing the total 
seek time, which may be why IOPs does not drop as much as one might imagine if 
you think of the heads doing random seeks (they aren't random anymore). 
However, this requires that there's a reasonable queue of I/Os for the 
controller to optimise, and processing that queue will necessarily increase the 
average response time. If you run with a queue depth of 1 or 2, the controller 
can't do this.



I agree. And disksort is in the mix, too.
  


Oh, I'd never looked at that.


This is something I played with ~30 years ago, when the OS disk driver was 
responsible for the queuing and reordering disc transfers to reduce total seek 
time, and disk controllers were dumb.



...and disksort still survives... maybe we should kill it?
  


It looks like it's possibly slightly worse than the pathologically worst 
response time case I described below...



There are lots of options and compromises, generally weighing reduction in 
total seek time against longest response time. Best reduction in total seek 
time comes from planning out your elevator sweep, and inserting newly queued 
requests into the right position in the sweep ahead. That also gives the 
potentially worse response time, as you may have one transfer queued for the 
far end of the disk, whilst you keep getting new transfers queued for the track 
just in front of you, and you might end up reading or writing the whole disk 
before you get to do that transfer which is queued for the far end. If you can 
get a big enough queue, you can modify the insertion algorithm to never insert 
into the current sweep, so you are effectively planning two sweeps ahead. Then 
the worse response time becomes the time to process one queue full, rather than 
the time to read or write the whole disk. Lots of other tricks too (e.g. 
insertion into sweeps taking into account priority, such as if

 the I/O is a synchronous or asynchronous, and age of existing queue entries). 
I had much fun playing with this at the time.



The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os 
sent to the disk.
  


Does that also go through disksort? Disksort doesn't seem to have any 
concept of priorities (but I haven't looked in detail where it plugs in 
to the whole framework).



So it might make better sense for ZFS to keep the disk queue depth small for 
HDDs.
 -- richard
  


--
Andrew Gabriel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Server with 4 drives, how to configure ZFS?

2011-06-20 Thread Richard Elling
On Jun 15, 2011, at 1:33 PM, Nomen Nescio wrote:

> Has there been any change to the server hardware with respect to number of
> drives since ZFS has come out? Many of the servers around still have an even
> number of drives (2, 4) etc. and it seems far from optimal from a ZFS
> standpoint. All you can do is make one or two mirrors, or a 3 way mirror and
> a spare, right? Wouldn't it make sense to ship with an odd number of drives
> so you could at least RAIDZ? Or stop making provision for anything except 1
> or two drives or no drives at all and require CD or netbooting and just
> expect everybody to be using NAS boxes? I am just a home server user, what
> do you guys who work on commercial accounts think? How are people using
> these servers?

I see 2 disks for boot and usually one or more 24-disk JBODs.  A few 12-disk 
JBODs
are still being sold, but I rarely see a single 12-disk JBOD. I'm also seeing a 
few
SBBs that have 16 disks and boot from SATA DOMs. Anyone else?

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Richard Elling
On Jun 20, 2011, at 6:31 AM, Gary Mills wrote:

> On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote:
>> On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote:
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 Sent: Saturday, June 18, 2011 7:47 PM
 
 Actually, all of the data I've gathered recently shows that the number of
 IOPS does not significantly increase for HDDs running random workloads.
 However the response time does :-( 
>>> 
>>> Could you clarify what you mean by that?  
>> 
>> Yes. I've been looking at what the value of zfs_vdev_max_pending should be.
>> The old value was 35 (a guess, but a really bad guess) and the new value is
>> 10 (another guess, but a better guess).  I observe that data from a fast, 
>> modern 
>> HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 
>> IOPS. 
>> But as we add threads, the average response time increases from 2.3ms to 
>> 137ms.
>> Since the whole idea is to get lower response time, and we know disks are 
>> not 
>> simple queues so there is no direct IOPS to response time relationship, 
>> maybe it
>> is simply better to limit the number of outstanding I/Os.
> 
> How would this work for a storage device with an intelligent
> controller that provides only a few LUNs to the host, even though it
> contains a much larger number of disks?  I would expect the controller
> to be more efficient with a large number of outstanding IOs because it
> could distribute those IOs across the disks.  It would, of course,
> require a non-volatile cache to provide fast turnaround for writes.

Yes, I've set it as high as 4,000 for a fast storage array. One size does not 
fit all.

For normal operations, with a separate log and HDDs in the pool, I'm leaning 
towards 16.
Except when resilvering or scrubbing, in which case 1 is better for HDDs.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Gary Mills
On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote:
> On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote:
> >> From: Richard Elling [mailto:richard.ell...@gmail.com]
> >> Sent: Saturday, June 18, 2011 7:47 PM
> >> 
> >> Actually, all of the data I've gathered recently shows that the number of
> >> IOPS does not significantly increase for HDDs running random workloads.
> >> However the response time does :-( 
> > 
> > Could you clarify what you mean by that?  
> 
> Yes. I've been looking at what the value of zfs_vdev_max_pending should be.
> The old value was 35 (a guess, but a really bad guess) and the new value is
> 10 (another guess, but a better guess).  I observe that data from a fast, 
> modern 
> HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 
> IOPS. 
> But as we add threads, the average response time increases from 2.3ms to 
> 137ms.
> Since the whole idea is to get lower response time, and we know disks are not 
> simple queues so there is no direct IOPS to response time relationship, maybe 
> it
> is simply better to limit the number of outstanding I/Os.

How would this work for a storage device with an intelligent
controller that provides only a few LUNs to the host, even though it
contains a much larger number of disks?  I would expect the controller
to be more efficient with a large number of outstanding IOs because it
could distribute those IOs across the disks.  It would, of course,
require a non-volatile cache to provide fast turnaround for writes.

-- 
-Gary Mills--Unix Group--Computer and Network Services-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Edward Ned Harvey
> From: Richard Elling [mailto:richard.ell...@gmail.com]
> Sent: Sunday, June 19, 2011 11:03 AM
> 
> > I was planning, in the near
> > future, to go run iozone on some system with, and without the disk cache
> > enabled according to format -e.  If my hypothesis is right, it shouldn't
> > significantly affect the IOPS, which seems to be corroborated by your
> > message.
> 
> iozone is a file system benchmark, won't tell you much about IOPS at the
disk
> level.
> Be aware of all of the caching that goes on there.

Yeah, that's the whole point.  The basis of my argument was:  Due to the
caching & buffering the system does in RAM, the disks' cache & buffer are
not relevant.  The conversation spawns from the premise of whole-disk versus
partition-based pools, possibly toggling the disk cache to off.  See the
subject of this email.   ;-)

Hopefully I'll have time to (dis) prove that conjecture this week.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import crashs SX11 trying to recovering a corrupted zpool

2011-06-20 Thread Stefano Lassi
Thank you very much Jim for your suggestions.


Trying any kind of import (including importing Read-Only) on SX11 will lead, 
everytime, a system panic with following error:

panic[cpu12]/thread=ff02de5e20c0: assertion failed: zap_count(os, object, 
&count) == 0, file: ../../common/fs/zfs/ddt_zap.c, line: 142

BTW, I saw ddt_zap.c is part of ZFS dedup implementation.



Trying with lastest OpenIndiana, I got:

cannot import 'rpool': pool is formatted using a newer ZFS version



You are right, there are no ZFS implementation newer than version 28 (outside 
v31 on SX11).


So, now I got 2 possibilities:
- a complete scratch
- waiting for SX12!!!
(Probably I'm going to scratch .)


Anyway I learnt some lessons from this experience:
- trust more in ZFS than in HW RAID
- OI is better than SX11


"Long life to ZFS!!!", not very impressed by SX11 


Stefano
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS raid1 crash kernel panic

2011-06-20 Thread Aleksey
Hello,

I have a ZFS raid1 from 2 drives to 1TB .
Recently, my system OS: FreeBSD 8.2-RELEASE has crashed, with kernel panic:


panic: solaris assert: ss->ss_end >= end (0x6a80753600 >= 0x6a80753800),
file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c,
line: 174

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
panic: solaris assert: ss->ss_end >= end (0x6a80753600 >= 0x6a80753800),
file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c,
line: 174
cpuid = 0
KDB: stack backtrace:
#0 0x805f4e0e at kdb_backtrace+0x5e
#1 0x805c2d07 at panic+0x187
#2 0x80ee36f6 at space_map_remove+0x296
#3 0x80ee3d9b at space_map_load+0x1bb
#4 0x80ed4c19 at metaslab_activate+0x89
#5 0x80ed586e at metaslab_alloc+0x6ae
#6 0x80f00299 at zio_dva_allocate+0x69
#7 0x80efe287 at zio_execute+0x77
#8 0x80e9e303 at taskq_run_safe+0x13
#9 0x805ffeb5 at taskqueue_run_locked+0x85
#10 0x8060004e at taskqueue_thread_loop+0x4e
#11 0x805994f8 at fork_exit+0x118
#12 0x8089547e at fork_trampoline+0xe
---

Reinstall OS and import zfs pool not change anything.
smartctl-a says that everything is OK
But with the Solaris 11 Express LiveCD, I was able to import the pool,
using:
zpool import -o readonly=on -f 
In r/w mode, it falls too.
Can anybody tell me what is it?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss