Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Jun 16, 2011, at 8:05 PM, Daniel Carosone wrote: > On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey wrote: >>> From: Daniel Carosone [mailto:d...@geek.com.au] >>> Sent: Thursday, June 16, 2011 10:27 PM >>> >>> Is it still the case, as it once was, that allocating anything other >>> than whole disks as vdevs forces NCQ / write cache off on the drive >>> (either or both, forget which, guess write cache)? >> >> I will only say, that regardless of whether or not that is or ever was true, >> I believe it's entirely irrelevant. Because your system performs read and >> write caching and buffering in ram, the tiny little ram on the disk can't >> possibly contribute anything. > > I disagree. It can vastly help improve the IOPS of the disk and keep > the channel open for more transactions while one is in progress. > Otherwise, the channel is idle, blocked on command completion, while > the heads seek. Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( My data is leading me to want to restrict the queue depth to 1 or 2 for HDDs. SDDs are another story, they scale much better in the response time and IOPS vs queue depth analysis. Has anyone else studied this? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is ZFS internal reservation excessive?
On Jun 17, 2011, at 4:07 PM, MasterCATZ wrote: > >> > ok what is the Point of the RESERVE > > When we can not even delete a file when their is no space left !!! > > if they are going to have a RESERVE they should make it a little smarter and > maybe have the FS use some of that free space so when we do hit 0 bytes > data can still be deleted because their is over 50 gig free in the reserve .. Is there a quota? -- richard > > > # zfs list > NAME USED AVAIL REFER MOUNTPOINT > tank 2.68T 0 2.68T /tank > # zpool list > NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT > tank 3.64T 3.58T 58.2G98% 1.00x ONLINE - > > rm -f -r downloads > rm: downloads: No space left on device > > > > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenIndiana | ZFS | scrub | network | awful slow
On Jun 16, 2011, at 3:36 PM, Sven C. Merckens wrote: > Hi roy, Hi Dan, > > many thanks for Your responses. > > I am using napp-it to control the OpenSolaris-Systems > The napp-it-interface shows a dedup factor of 1.18x on System 1 and 1.16x on > System 2. You're better off disabling dedup for this workload. If the dedup ratio was more like 10, 342, or some number > 2 dedup can be worthwhile. IMHO, dedup ratios < 2 are not good candidates for dedup. You can look at existing, non-deduped data to get an estimate of the potential dedup savings using: zdb -S poolname > Dedup is on "always" (not only at the start), also compression is activated: > System 1 = compression on (lzib?) > System 2 = compression on (gzip-6) > compression rates: > System 1 = 1.10x > System 2 = 1.48x > > compression and dedup were some of the primary reasons to choose zfs in this > situation. Compression is a win for a large number of cases. Dedup, not so much. > > We tried more RAM (48GB) at the beginning to check if this would do anything > on performance. But it did not, but we had only about 3-4TB of data on the > storage at that time (performance was good). So I will order some RAM-modules > and double the RAM to 48GB. > > The RAM usage is about 21GB, 3GB free (on both systems after a while). At the > start (after a few hours of usage and only 3-4TB of data) the usage was > identical. I read on some places, that zfs will use all memory, leaving only > 1GB left (so I thought the RAM wasn't used completely). The swap isn't used > by the system, > > Now the systems are idle and the RAM-usage is very little: > top: > > System 1 > Memory: 24G phys mem, 21G free mem, 12G total swap, 12G free swap > > System 2 > Memory: 24G phys mem, 21G free mem, 12G total swap, 12G free swap > > > Starting to read about 12GB from System 2 and RAM-Usage goes up to > performance is at about 65-70MB/s via GigaBit (iSCSI). > Memory: 24G phys mem, 3096M free mem, 12G total swap, 12G free swap > > ok, I understand, more RAM will be no fault.. ;) > > On System 1 there is no such massive change in RAM usage while copying files > to and from the volume. > But the Performance is only about 20MB/s via GigaBit (iSCSI). This is likely not a dedup issue. More likely, Nagle is biting you or there is a serialization that is not immediately obvious. I have also seen questionable network configurations cause strange slowdowns for iSCSI. -- richard > > So RAM can´t be the issue on System 1 (which has more data stored). > This system ist also equipped with a 240GB SSD used for L2ARC at the second > LSI-controller inside the server-enclosure. > > > Roy: > But is the L2ARC also important while writing to the device? Because the > storeges are used most of the time only for writing data on it, the > Read-Cache (as I thought) isn´t a performance-factor... Please correct me, if > my thoughts are wrong... > > But this is only a "small cost"-addition, to add also a 120GB/240GB OCZ > Vertex 2 SSD to System 2 (≈ 150/260 Euro). I will give it a try. > > Would it be better to add the SSD to the LSI-Controller (put it in the > JBOD-Storage) or put it in the server enclosure itself and connect it to the > internal SATA-Controller? > > > Do You have any tips for the settings of the dataset? > These are the settings > > > PROPERTY System 1System 2 > used 34.4T 19.4T > available 10.7T 40.0T > referenced34.4T 19.4T > compressratio 1.10x 1.43x > mounted yes yes > quota nonenone > reservation nonenone > recordsize128K128k > mountpoint/ / > sharenfs off off > checksum on on > compression on gzip > atime off off > devices on on > exec on on > setuidon on > readonly off off > zoned off off > snapdir hidden hidden > aclinheritpassthrough passthrough > canmount on on > xattr on on > copie
Re: [zfs-discuss] zfs global hot spares?
more below... On Jun 16, 2011, at 2:27 AM, Fred Liu wrote: > Fixing a typo in my last thread... > >> -Original Message- >> From: Fred Liu >> Sent: 星期四, 六月 16, 2011 17:22 >> To: 'Richard Elling' >> Cc: Jim Klimov; zfs-discuss@opensolaris.org >> Subject: RE: [zfs-discuss] zfs global hot spares? >> >>> This message is from the disk saying that it aborted a command. These >>> are >>> usually preceded by a reset, as shown here. What caused the reset >>> condition? >>> Was it actually target 11 or did target 11 get caught up in the reset >>> storm? >>> >> > It happed in the mid-night and nobody touched the file box. > I assume it is the transition status before the disk is *thoroughly* > damaged: > > Jun 10 09:34:11 cn03 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS- > 8000-FD, TYPE: Fault, VER: 1, SEVERITY: > > Major > Jun 10 09:34:11 cn03 EVENT-TIME: Fri Jun 10 09:34:11 CST 2011 > Jun 10 09:34:11 cn03 PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890, > HOSTNAME: cn03 > Jun 10 09:34:11 cn03 SOURCE: zfs-diagnosis, REV: 1.0 > Jun 10 09:34:11 cn03 EVENT-ID: 4f4bfc2c-f653-ed20-ab13-eef72224af5e > Jun 10 09:34:11 cn03 DESC: The number of I/O errors associated with a > ZFS device exceeded > Jun 10 09:34:11 cn03 acceptable levels. Refer to > http://sun.com/msg/ZFS-8000-FD for more information. > Jun 10 09:34:11 cn03 AUTO-RESPONSE: The device has been offlined and > marked as faulted. An attempt > Jun 10 09:34:11 cn03 will be made to activate a hot spare if > available. > Jun 10 09:34:11 cn03 IMPACT: Fault tolerance of the pool may be > compromised. > Jun 10 09:34:11 cn03 REC-ACTION: Run 'zpool status -x' and replace the > bad device. zpool status -x output would be useful. These error reports do not include a pointer to the faulty device. fmadm can also give more info. > > After I rebooted it, I got: > Jun 10 11:38:49 cn03 genunix: [ID 540533 kern.notice] ^MSunOS Release > 5.11 Version snv_134 64-bit > Jun 10 11:38:49 cn03 genunix: [ID 683174 kern.notice] Copyright 1983- > 2010 Sun Microsystems, Inc. All rights > > reserved. > Jun 10 11:38:49 cn03 Use is subject to license terms. > Jun 10 11:38:49 cn03 unix: [ID 126719 kern.info] features: > > 7f7f t,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,d > > e,pge,mtrr,msr,tsc,lgpg> > > Jun 10 11:39:06 cn03 scsi: [ID 365881 kern.info] > /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0): > Jun 10 11:39:06 cn03mptsas0 unrecognized capability 0x3 > > Jun 10 11:39:42 cn03 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c50009723937 (sd3): > Jun 10 11:39:42 cn03drive offline > Jun 10 11:39:47 cn03 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c50009723937 (sd3): > Jun 10 11:39:47 cn03drive offline > Jun 10 11:39:52 cn03 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c50009723937 (sd3): > Jun 10 11:39:52 cn03drive offline > Jun 10 11:39:57 cn03 scsi: [ID 107833 kern.warning] WARNING: > /scsi_vhci/disk@g5000c50009723937 (sd3): > Jun 10 11:39:57 cn03drive offline mpathadm can be used to determine the device paths for this disk. Notice how the disk is offline at multiple times. There is some sort of recovery going on here that continues to fail later. I call these "wounded soldiers" because they take a lot more care than a dead soldier. You would be better off if the drive completely died. >> >> >>> >>> Hot spare will not help you here. The problem is not constrained to >> one >>> disk. >>> In fact, a hot spare may be the worst thing here because it can kick >> in >>> for the disk >>> complaining about a clogged expander or spurious resets. This causes >> a >>> resilver >>> that reads from the actual broken disk, that causes more resets, that >>> kicks out another >>> disk that causes a resilver, and so on. >>> -- richard >>> >> > So the warm spares could be "better" choice under this situation? > BTW, in what condition, the scsi reset storm will happen? In my experience they start randomly and in some cases are not reproducible. > How can we be immune to this so as NOT to interrupt the file > service? Are you asking for fault tolerance? If so, then you need a fault tolerant system like a Tandem. If you are asking for a way to build a cost effective solution using commercial, off-the-shelf (COTS) components, then that is far beyond what can be easily said in a forum posting. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Finding disks [was: # disks per vdev]
On Jun 17, 2011, at 12:55 AM, Lanky Doodle wrote: > Thanks Richard. > > How does ZFS enumerate the disks? In terms of listing them does it do them > logically, i.e; > > controller #1 (motherboard) >| >|--- disk1 >|--- disk2 > controller #3 >|--- disk3 >|--- disk4 >|--- disk5 >|--- disk6 >|--- disk7 >|--- disk8 >|--- disk9 >|--- disk10 > controller #4 >|--- disk11 >|--- disk12 >|--- disk13 >|--- disk14 >|--- disk15 >|--- disk16 >|--- disk17 >|--- disk18 > > or is it completely random leaving me with some trial and error to work out > what disk is on what port? For all intents and purposes, it is random. Slot locations are the responsibility of the enclosure, not the disk. Until we get a better framework integrated into illumos, you can get the bay location from a SES-compliant enclosure from the fmtopo output, lsiutil, or the sg_utils. For NexentaStor users I provide some automation for this in a KB article on the customer portal. Also for NexentaStor users, DataON offers a GUI plugin called DSM that shows the enclosure, blinky lights, and all of the status information available -- power supplies, fans, etc -- good stuff! For the curious, fmtopo shows the bay for each disk and the serial number of the disk therein. You can then cross-reference the c*t*d* number for the OS instance to the serial number. Note that for dual-port disks, you can get different c*t*d* numbers for each node connected to the disk (rare, but possible). Caveat: please verify prior to rolling into production that the bay number matches the enclosure silkscreen. The numbers are programmable and different vendors deliver the same enclosure with different silkscreened numbers. As always, the disk serial number is supposed to be unique, so you can test this very easily. For the later Nexenta, OpenSolaris or Solaris 11 Express releases the mpt_sas driver will try to light the OK2RM (ok to remove) LED for a disk when you use cfgadm to disconnect the paths. Apparently this also works for SATA disks in an enclosure that manages SATA disks. The process is documented very nicely by Cindy in the ZFS Admin Guide. However, there are a number of enclosures that do not have an OK2RM LED. YMMV. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] read data block
Do anyone know how to walk back to the root if we are at the data block (level 0) in zfs? Is this possible since data block can be picked random and we don't know the parent of the indirect block at this level 0? Thanks Henry ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Question about drive LEDs
Hi all I have a few machines setup with OI 148, and I can't make the LEDs on the drives work when something goes bad. The chassies are supermicro ones, and work well, normally. Any idea how to make drive LEDs wirk with this setup? -- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] # disks per vdev
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Marty Scholes > > On a busy array it is hard even to use the leds as indicators. Offline the disk. Light stays off. Use dd to read the disk. Light stays on. That should make it easy enough. Also, depending on your HBA, lots of times you can blink an Amber LED instead of the standard green one. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
On 18/06/11 12:44 AM, Michael Sullivan wrote: > ... > Way off-topic, but Smalltalk and its variants do this by maintaining the > state of everything in an operating environment image. > ...Which is in memory, so things are rather different from the world of filesystems. --Toby > But then again, I could be wrong. > > Mike > > --- > Michael Sullivan > m...@axsh.us > http://www.axsh.us/ > Phone: +1-662-259- > Mobile: +1-662-202-7716 > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss