> On Jan 26, 2017, at 12:20 AM, Stephan Budach <[email protected]> wrote: > > Hi Richard, > > gotcha… read on, below…
"thin provisioning" bit you. For "thick provisioning" you’ll have a refreservation and/or reservation. — richard > > Am 26.01.17 um 00:43 schrieb Richard Elling: >> more below… >> >>> On Jan 25, 2017, at 3:01 PM, Stephan Budach <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Ooops… should have waited with sending that message after I rebootet the >>> S11.1 host… >>> >>> >>> Am 25.01.17 um 23:41 schrieb Stephan Budach: >>>> Hi Richard, >>>> >>>> Am 25.01.17 um 20:27 schrieb Richard Elling: >>>>> Hi Stephan, >>>>> >>>>>> On Jan 25, 2017, at 5:54 AM, Stephan Budach <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Hi guys, >>>>>> >>>>>> I have been trying to import a zpool, based on a 3way-mirror provided by >>>>>> three omniOS boxes via iSCSI. This zpool had been working flawlessly >>>>>> until some random reboot of the S11.1 host. Since then, S11.1 has been >>>>>> importing this zpool without success. >>>>>> >>>>>> This zpool consists of three 108TB LUNs, based on a raidz-2 zvols… yeah >>>>>> I know, we shouldn't have done that in the first place, but performance >>>>>> was not the primary goal for that, as this one is a backup/archive pool. >>>>>> >>>>>> When issueing a zpool import, it says this: >>>>>> >>>>>> root@solaris11atest2:~# zpool import >>>>>> pool: vsmPool10 >>>>>> id: 12653649504720395171 >>>>>> state: DEGRADED >>>>>> status: The pool was last accessed by another system. >>>>>> action: The pool can be imported despite missing or damaged devices. The >>>>>> fault tolerance of the pool may be compromised if imported. >>>>>> see: http://support.oracle.com/msg/ZFS-8000-EY >>>>>> <http://support.oracle.com/msg/ZFS-8000-EY> >>>>>> config: >>>>>> >>>>>> vsmPool10 DEGRADED >>>>>> mirror-0 DEGRADED >>>>>> c0t600144F07A3506580000569398F60001d0 DEGRADED corrupted >>>>>> data >>>>>> c0t600144F07A35066C00005693A0D90001d0 DEGRADED corrupted >>>>>> data >>>>>> c0t600144F07A35001A00005693A2810001d0 DEGRADED corrupted >>>>>> data >>>>>> >>>>>> device details: >>>>>> >>>>>> c0t600144F07A3506580000569398F60001d0 DEGRADED >>>>>> scrub/resilver needed >>>>>> status: ZFS detected errors on this device. >>>>>> The device is missing some data that is recoverable. >>>>>> >>>>>> c0t600144F07A35066C00005693A0D90001d0 DEGRADED >>>>>> scrub/resilver needed >>>>>> status: ZFS detected errors on this device. >>>>>> The device is missing some data that is recoverable. >>>>>> >>>>>> c0t600144F07A35001A00005693A2810001d0 DEGRADED >>>>>> scrub/resilver needed >>>>>> status: ZFS detected errors on this device. >>>>>> The device is missing some data that is recoverable. >>>>>> >>>>>> However, when actually running zpool import -f vsmPool10, the system >>>>>> starts to perform a lot of writes on the LUNs and iostat report an >>>>>> alarming increase in h/w errors: >>>>>> >>>>>> root@solaris11atest2:~# iostat -xeM 5 >>>>>> extended device statistics ---- errors >>>>>> --- >>>>>> device r/s w/s Mr/s Mw/s wait actv svc_t %w %b s/w h/w trn >>>>>> tot >>>>>> sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 >>>>>> 0 >>>>>> sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 >>>>>> 0 >>>>>> sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 71 0 >>>>>> 71 >>>>>> sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 >>>>>> 0 >>>>>> sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 >>>>>> 0 >>>>>> sd5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 >>>>>> 0 >>>>>> extended device statistics ---- errors >>>>>> --- >>>>>> device r/s w/s Mr/s Mw/s wait actv svc_t %w %b s/w h/w trn >>>>>> tot >>>>>> sd0 14.2 147.3 0.7 0.4 0.2 0.1 2.0 6 9 0 0 0 >>>>>> 0 >>>>>> sd1 14.2 8.4 0.4 0.0 0.0 0.0 0.3 0 0 0 0 0 >>>>>> 0 >>>>>> sd2 0.0 4.2 0.0 0.0 0.0 0.0 0.0 0 0 0 92 0 >>>>>> 92 >>>>>> sd3 157.3 46.2 2.1 0.2 0.0 0.7 3.7 0 14 0 30 0 >>>>>> 30 >>>>>> sd4 123.9 29.4 1.6 0.1 0.0 1.7 10.9 0 36 0 40 0 >>>>>> 40 >>>>>> sd5 142.5 43.0 2.0 0.1 0.0 1.9 10.2 0 45 0 88 0 >>>>>> 88 >>>>>> extended device statistics ---- errors >>>>>> --- >>>>>> device r/s w/s Mr/s Mw/s wait actv svc_t %w %b s/w h/w trn >>>>>> tot >>>>>> sd0 0.0 234.5 0.0 0.6 0.2 0.1 1.4 6 10 0 0 0 >>>>>> 0 >>>>>> sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 >>>>>> 0 >>>>>> sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 92 0 >>>>>> 92 >>>>>> sd3 3.6 64.0 0.0 0.5 0.0 4.3 63.2 0 63 0 235 0 >>>>>> 235 >>>>>> sd4 3.0 67.0 0.0 0.6 0.0 4.2 60.5 0 68 0 298 0 >>>>>> 298 >>>>>> sd5 4.2 59.6 0.0 0.4 0.0 5.2 81.0 0 72 0 406 0 >>>>>> 406 >>>>>> extended device statistics ---- errors >>>>>> --- >>>>>> device r/s w/s Mr/s Mw/s wait actv svc_t %w %b s/w h/w trn >>>>>> tot >>>>>> sd0 0.0 234.8 0.0 0.7 0.4 0.1 2.2 11 10 0 0 0 >>>>>> 0 >>>>>> sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 >>>>>> 0 >>>>>> sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 92 0 >>>>>> 92 >>>>>> sd3 5.4 54.4 0.0 0.3 0.0 2.9 48.5 0 67 0 384 0 >>>>>> 384 >>>>>> sd4 6.0 53.4 0.0 0.3 0.0 4.6 77.7 0 87 0 519 0 >>>>>> 519 >>>>>> sd5 6.0 60.8 0.0 0.3 0.0 4.8 72.5 0 87 0 727 0 >>>>>> 727 >>>>> >>>>> h/w errors are a classification of other errors. The full error list is >>>>> available from "iostat -E" and will >>>>> be important to tracking this down. >>>>> >>>>> A better, more detailed analysis can be gleaned from the "fmdump -e" >>>>> ereports that should be >>>>> associated with each h/w error. However, there are dozens of causes of >>>>> these so we don’t have >>>>> enough info here to fully understand. >>>>> — richard >>>>> >>>> Well… I can't provide you with the output of fmdump -e (since I am >>>> currently unable to get the '-' typed in to the console, due to some fancy >>>> keyboard layout issues and nit being able to login via ssh as well (can >>>> authenticate, but I don't get to the shell, which may be due to the >>>> running zpool import), but I can confirm that fmdump does show nothing at >>>> all. I could just reset the S11.1 host, after removing the zpool.cache >>>> file, such as that the system will not try to import the zpool upon >>>> restart right away… >>>> >>>> …plus I might get the option to set the keyboard right, after reboot, but >>>> that's another issue… >>>> >>> After resetting the S11.1 host and getting the keyboard layout right, I >>> issued a fmdump -e and there they are… lots of: >>> >>> Jan 25 23:25:13.5643 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.8944 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.8945 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.8946 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9274 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9275 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9276 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9277 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9282 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9284 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9285 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9286 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9287 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9288 ereport.fs.zfs.dev.merr.write >>> Jan 25 23:25:13.9290 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9294 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9301 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:25:13.9306 ereport.io.scsi.cmd.disk.dev.rqs.merr.write >>> Jan 25 23:50:44.7195 ereport.io.scsi.cmd.disk.dev.rqs.derr >>> Jan 25 23:50:44.7306 ereport.io.scsi.cmd.disk.dev.rqs.derr >>> Jan 25 23:50:44.7434 ereport.io.scsi.cmd.disk.dev.rqs.derr >>> Jan 25 23:53:31.4386 ereport.io.scsi.cmd.disk.dev.rqs.derr >>> Jan 25 23:53:31.4579 ereport.io.scsi.cmd.disk.dev.rqs.derr >>> Jan 25 23:53:31.4710 ereport.io.scsi.cmd.disk.dev.rqs.derr >>> >>> >>> These seem to be media errors and disk errors on the zpools/zvols that make >>> up the LUNs for this zpool… I am wondering, why this happens. >> >> yes, good question >> That we get media errors "merr" on write is one clue. To find out more >> details, "fmdump -eV" >> will show in gory details the exact SCSI asc/ascq codes, LBAs, etc. >> >> ZFS is COW, so if the LUs are backed by ZFS and there isn’t enough free >> space, then this is >> the sort of error we expect. But there could be other reasons. >> — richard >> > Oh Lord… I really think, that this is it… this is what the zpool/zvol looks > like on one of the three targets: > > root@tr1206900:/root# zpool list > NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH > ALTROOT > rpool 29,8G 21,5G 8,27G - 45% 72% 1.00x ONLINE - > tr1206900data 109T 106T 3,41T - 51% 96% 1.00x ONLINE - > > root@tr1206900:/root# zfs list -r tr1206900data > NAME USED AVAIL REFER MOUNTPOINT > tr1206900data 86,6T 0 236K /tr1206900data > tr1206900data/vsmPool10 86,6T 0 86,6T - > > root@tr1206900:/root# zfs get all tr1206900data/vsmPool10 > NAME PROPERTY VALUE SOURCE > tr1206900data/vsmPool10 type volume - > tr1206900data/vsmPool10 creation Mo. Jan 11 12:57 2016 - > tr1206900data/vsmPool10 used 86,6T - > tr1206900data/vsmPool10 available 0 - > tr1206900data/vsmPool10 referenced 86,6T - > tr1206900data/vsmPool10 compressratio 1.00x - > tr1206900data/vsmPool10 reservation none default > tr1206900data/vsmPool10 volsize 109T local > tr1206900data/vsmPool10 volblocksize 128K - > tr1206900data/vsmPool10 checksum on default > tr1206900data/vsmPool10 compression off default > tr1206900data/vsmPool10 readonly off default > tr1206900data/vsmPool10 copies 1 default > tr1206900data/vsmPool10 refreservation none default > tr1206900data/vsmPool10 primarycache all local > tr1206900data/vsmPool10 secondarycache all default > tr1206900data/vsmPool10 usedbysnapshots 0 - > tr1206900data/vsmPool10 usedbydataset 86,6T - > tr1206900data/vsmPool10 usedbychildren 0 - > tr1206900data/vsmPool10 usedbyrefreservation 0 - > tr1206900data/vsmPool10 logbias latency default > tr1206900data/vsmPool10 dedup off default > tr1206900data/vsmPool10 mlslabel none default > tr1206900data/vsmPool10 sync standard default > tr1206900data/vsmPool10 refcompressratio 1.00x - > tr1206900data/vsmPool10 written 86,6T - > tr1206900data/vsmPool10 logicalused 86,5T - > tr1206900data/vsmPool10 logicalreferenced 86,5T - > tr1206900data/vsmPool10 snapshot_limit none default > tr1206900data/vsmPool10 snapshot_count none default > tr1206900data/vsmPool10 redundant_metadata all default > > This must be the dumbest failure one can possibly have, when setting up a > zvol iSCSI target. So, someone - no, it wasn't me, actually, but this doesn't > do me any good anyway, created a zvol equal the size to the zpool and now it > is as you suspected: the zvol has run out of space. > > So, the only chance would be to add additional space to these zpools, such as > that the zvol actually can occupy the space the claim to have? Should be > manageable… I could provide some iSCSI LUNs to the targets themselves and add > another vdev. There will be some serious cleanup needed afterwards… > > What about the "free" 3,41T in zpool itself? Could those be somehow utilised? > > > Thanks, > Stephan
_______________________________________________ OmniOS-discuss mailing list [email protected] http://lists.omniti.com/mailman/listinfo/omnios-discuss
