Re: [zfs-discuss] Extremely bad performance - hw failure?

Richard Elling Sun, 27 Dec 2009 17:28:28 -0800

The best place to start looking at disk-related performance problemsis iostat.

Slow disks will show high service times.  There are many options, but I
usually use something like:
        iostat -zxcnPT d 1

Ignore the first line. Look at the service times. They should bebelow 10ms

for good performance.
 -- richard


On Dec 27, 2009, at 4:52 PM, Morten-Christian Bernson wrote:

Lately my zfs pool in my home server has degraded to a state whereit can be said it doesn't work at all. Read spead is slower than Ican read from the internet on my slow dsl-line... This is comparedto just a short while ago, where I could read from it with over 50mb/sec over the network.


My setup:
Running latest Solaris 10: # uname -a
SunOS solssd01 5.10 Generic_142901-02 i86pc i386 i86pc

# zpool status DATA
 pool: DATA
state: ONLINE
config:
       NAME        STATE     READ WRITE CKSUM
       DATA        ONLINE       0     0     0
         raidz1    ONLINE       0     0     0
           c2t5d0  ONLINE       0     0     0
           c2t4d0  ONLINE       0     0     0
           c2t3d0  ONLINE       0     0     0
           c2t2d0  ONLINE       0     0     0
       spares
         c0t2d0    AVAIL
errors: No known data errors

# zfs list -r DATA
NAME                               USED  AVAIL  REFER  MOUNTPOINT
DATA                              3,78T   229G  3,78T  /DATA

All of the drives in this pool are 1.5tb western digital greendrives. I am not seeing any error messages in /var/adm/messages, and"fmdump -eV" shows no errors... However, I am seeing some softfaults in "iostat -eEn":

---- errors ---
 s/w h/w trn tot device
 2   0   0   2 c0t0d0
 1   0   0   1 c1t0d0
 2   0   0   2 c2t1d0
151   0   0 151 c2t2d0
151   0   0 151 c2t3d0
153   0   0 153 c2t4d0
153   0   0 153 c2t5d0
 2   0   0   2 c0t1d0
 3   0   0   3 c0t2d0
 0   0   0   0 solssd01:vold(pid531)
c0t0d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: Sun      Product: STK RAID INT     Revision: V1.0 Serial No:
Size: 31.87GB <31866224128 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
c1t0d0           Soft Errors: 1 Hard Errors: 0 Transport Errors: 0
Vendor: _NEC     Product: DVD_RW ND-3500AG Revision: 2.16 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
c2t1d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: SAMSUNG HD753LJ  Revision: 1113 Serial No:
Size: 750.16GB <750156373504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
c2t2d0           Soft Errors: 151 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: WDC WD15EADS-00R Revision: 0A01 Serial No:
Size: 1500.30GB <1500301909504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 151 Predictive Failure Analysis: 0
c2t3d0           Soft Errors: 151 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: WDC WD15EADS-00R Revision: 0A01 Serial No:
Size: 1500.30GB <1500301909504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 151 Predictive Failure Analysis: 0
c2t4d0           Soft Errors: 153 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: WDC WD15EADS-00R Revision: 0A01 Serial No:
Size: 1500.30GB <1500301909504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 153 Predictive Failure Analysis: 0
c2t5d0           Soft Errors: 153 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: WDC WD15EADS-00R Revision: 0A01 Serial No:
Size: 1500.30GB <1500301909504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 153 Predictive Failure Analysis: 0
c0t1d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: Sun      Product: STK RAID INT     Revision: V1.0 Serial No:
Size: 31.87GB <31866224128 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
c0t2d0           Soft Errors: 3 Hard Errors: 0 Transport Errors: 0
Vendor: Sun      Product: STK RAID INT     Revision: V1.0 Serial No:
Size: 1497.86GB <1497859358208 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 3 Predictive Failure Analysis: 0

I am curious as to why the counter for "Illegal request" goes up allthe time. The machine was rebooted ~11 hours ago, and it goes upall the time when I try to use the pool...

The machine is a quite powerful one, and top shows no cpu load, noiowait and plenty of available memory. The machine basicly doesn'tdo anything at the moment, still it can take several minutes to copya 300mb file from somewhere in the pool to /tmp/...

# top

last pid: 1383; load avg: 0.01, 0.00, 0.00; up0+10:47:5701:39:17

55 processes: 54 sleeping, 1 on cpu

CPU states: 99.0% idle, 0.0% user, 1.0% kernel, 0.0% iowait,0.0% swap

Kernel: 193 ctxsw, 3 trap, 439 intr, 298 syscall, 3 flt

Memory: 8186M phys mem, 4699M free mem, 2048M total swap, 2048M freeswap

I thought I might have run into problems described here on theforums with the ARC and fragmentation, but it doesn't seem so:

# echo "::arc"|mdb -k
hits                      =    490044
misses                    =     37004
demand_data_hits          =    282392
demand_data_misses        =      2113
demand_metadata_hits      =    191757
demand_metadata_misses    =     21034
prefetch_data_hits        =       851
prefetch_data_misses      =     10265
prefetch_metadata_hits    =     15044
prefetch_metadata_misses  =      3592
mru_hits                  =     73416
mru_ghost_hits            =        16
mfu_hits                  =    401500
mfu_ghost_hits            =        24
deleted                   =      1555
recycle_miss              =         0
mutex_miss                =         0
evict_skip                =      1487
hash_elements             =     37032
hash_elements_max         =     37045
hash_collisions           =     10094
hash_chains               =      4365
hash_chain_max            =         4
p                         =      3576 MB
c                         =      7154 MB
c_min                     =       894 MB
c_max                     =      7154 MB
size                      =      1797 MB
hdr_size                  =   8002680
data_size                 = 1866272256
other_size                =  10519712
l2_hits                   =         0
l2_misses                 =         0
l2_feeds                  =         0
l2_rw_clash               =         0
l2_read_bytes             =         0
l2_write_bytes            =         0
l2_writes_sent            =         0
l2_writes_done            =         0
l2_writes_error           =         0
l2_writes_hdr_miss        =         0
l2_evict_lock_retry       =         0
l2_evict_reading          =         0
l2_free_on_write          =         0
l2_abort_lowmem           =         0
l2_cksum_bad              =         0
l2_io_error               =         0
l2_size                   =         0
l2_hdr_size               =         0
memory_throttle_count     =         0
arc_no_grow               =         0
arc_tempreserve           =         0 MB
arc_meta_used             =       372 MB
arc_meta_limit            =      1788 MB
arc_meta_max              =       372 MB

I then tried to start a scrub, and it seems like it will takeforever... It used to take a few hours, now it says it will be donein almost 700 hours:

scrub: scrub in progress for 4h43m, 0,68% done, 685h2m to go

Does anyone have any clue as to what is happening, and what I cando? If a disk is failing without the OS noticing, it would be niceto find a way to know which drive it is, and get it exchanged beforeit's too late.


All help is apreciated...

Yours sincerly,
Morten-Christian Bernson
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Extremely bad performance - hw failure?

Reply via email to