Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-18 Thread Richard Elling
On Jun 16, 2011, at 8:05 PM, Daniel Carosone wrote:

> On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey wrote:
>>> From: Daniel Carosone [mailto:d...@geek.com.au]
>>> Sent: Thursday, June 16, 2011 10:27 PM
>>> 
>>> Is it still the case, as it once was, that allocating anything other
>>> than whole disks as vdevs forces NCQ / write cache off on the drive
>>> (either or both, forget which, guess write cache)?
>> 
>> I will only say, that regardless of whether or not that is or ever was true,
>> I believe it's entirely irrelevant.  Because your system performs read and
>> write caching and buffering in ram, the tiny little ram on the disk can't
>> possibly contribute anything.
> 
> I disagree.  It can vastly help improve the IOPS of the disk and keep
> the channel open for more transactions while one is in progress.
> Otherwise, the channel is idle, blocked on command completion, while
> the heads seek. 

Actually, all of the data I've gathered recently shows that the number of 
IOPS does not significantly increase for HDDs running random workloads. 
However the response time does :-( My data is leading me to want to restrict 
the queue depth to 1 or 2 for HDDs.

SDDs are another story, they scale much better in the response time and
IOPS vs queue depth analysis.

Has anyone else studied this?
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS internal reservation excessive?

2011-06-18 Thread Richard Elling
On Jun 17, 2011, at 4:07 PM, MasterCATZ wrote:
> 
>> 
> ok what is the Point of the RESERVE 
> 
> When we can not even delete a file when their is no space left !!!
> 
> if they are going to have a RESERVE they should make it a little smarter and
> maybe have the FS use some of that free space so when we do hit 0 bytes 
> data can still be deleted because their is over 50 gig free in the reserve .. 

Is there a quota?
 -- richard

> 
> 
> # zfs list
> NAME   USED  AVAIL  REFER  MOUNTPOINT
> tank  2.68T  0  2.68T  /tank
> # zpool list
> NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
> tank  3.64T  3.58T  58.2G98%  1.00x  ONLINE  -
> 
> rm -f -r downloads
> rm: downloads: No space left on device
> 
> 
> 
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenIndiana | ZFS | scrub | network | awful slow

2011-06-18 Thread Richard Elling

On Jun 16, 2011, at 3:36 PM, Sven C. Merckens wrote:

> Hi roy, Hi Dan,
> 
> many thanks for Your responses.
> 
> I am using napp-it to control the OpenSolaris-Systems
> The napp-it-interface shows a dedup factor of 1.18x on System 1 and 1.16x on 
> System 2.

You're better off disabling dedup for this workload. If the dedup ratio was 
more like
10, 342, or some number > 2 dedup can be worthwhile.  IMHO, dedup ratios < 2 are
not good candidates for dedup.

You can look at existing, non-deduped data to get an estimate of the potential 
dedup
savings using:
zdb -S poolname

> Dedup is on "always" (not only at the start), also compression is activated:
> System 1 = compression on (lzib?)
> System 2 = compression on (gzip-6)
> compression rates:
> System 1 = 1.10x
> System 2 = 1.48x
> 
> compression and dedup were some of the primary reasons to choose zfs in this 
> situation.

Compression is a win for a large number of cases.  Dedup, not so much.

> 
> We tried more RAM (48GB) at the beginning to check if this would do anything 
> on performance. But it did not, but we had only about 3-4TB of data on the 
> storage at that time (performance was good). So I will order some RAM-modules 
> and double the RAM to 48GB.
> 
> The RAM usage is about 21GB, 3GB free (on both systems after a while). At the 
> start (after a few hours of usage and only 3-4TB of data) the usage was 
> identical. I read on some places, that zfs will use all memory, leaving only 
> 1GB left (so I thought the RAM wasn't used completely). The swap isn't used 
> by the system, 
> 
> Now the systems are idle and the RAM-usage is very little:
> top:
> 
> System 1
> Memory: 24G phys mem, 21G free mem, 12G total swap, 12G free swap
> 
> System 2
> Memory: 24G phys mem, 21G free mem, 12G total swap, 12G free swap
> 
> 
> Starting to read about 12GB from System 2 and RAM-Usage goes up to
> performance is at about 65-70MB/s via GigaBit (iSCSI).
> Memory: 24G phys mem, 3096M free mem, 12G total swap, 12G free swap
> 
> ok, I understand, more RAM will be no fault.. ;)
> 
> On System 1 there is no such massive change in RAM usage while copying files 
> to and from the volume.
> But the Performance is only about 20MB/s via GigaBit (iSCSI).

This is likely not a dedup issue. More likely, Nagle is biting you or there is a
serialization that is not immediately obvious. I have also seen questionable
network configurations cause strange slowdowns for iSCSI.
 -- richard

> 
> So RAM can´t be the issue on System 1 (which has more data stored).
> This system ist also equipped with a 240GB SSD used for L2ARC at the second 
> LSI-controller inside the server-enclosure.
> 
> 
> Roy:
> But is the L2ARC also important while writing to the device? Because the 
> storeges are used most of the time only for writing data on it, the 
> Read-Cache (as I thought) isn´t a performance-factor... Please correct me, if 
> my thoughts are wrong...
> 
> But this is only a "small cost"-addition, to add also a 120GB/240GB OCZ 
> Vertex 2 SSD to System 2 (≈ 150/260 Euro). I will give it a try.
> 
> Would it be better to add the SSD to the LSI-Controller (put it in the 
> JBOD-Storage) or put it in the server enclosure itself and connect it to the 
> internal SATA-Controller?
> 
> 
> Do You have any tips for the settings of the dataset? 
> These are the settings
> 
> 
> PROPERTY  System 1System 2
> used  34.4T   19.4T
> available 10.7T   40.0T
> referenced34.4T   19.4T
> compressratio 1.10x   1.43x
> mounted   yes yes
> quota nonenone
> reservation   nonenone
> recordsize128K128k
> mountpoint/   /
> sharenfs  off off
> checksum  on  on
> compression   on  gzip
> atime off off
> devices   on  on
> exec  on  on
> setuidon  on
> readonly  off off
> zoned off off
> snapdir   hidden  hidden
> aclinheritpassthrough passthrough
> canmount  on  on
> xattr on  on
> copie

Re: [zfs-discuss] zfs global hot spares?

2011-06-18 Thread Richard Elling
more below...

On Jun 16, 2011, at 2:27 AM, Fred Liu wrote:

> Fixing a typo in my last thread...
> 
>> -Original Message-
>> From: Fred Liu
>> Sent: 星期四, 六月 16, 2011 17:22
>> To: 'Richard Elling'
>> Cc: Jim Klimov; zfs-discuss@opensolaris.org
>> Subject: RE: [zfs-discuss] zfs global hot spares?
>> 
>>> This message is from the disk saying that it aborted a command. These
>>> are
>>> usually preceded by a reset, as shown here. What caused the reset
>>> condition?
>>> Was it actually target 11 or did target 11 get caught up in the reset
>>> storm?
>>> 
>> 
> It happed in the mid-night and nobody touched the file box.
> I assume it is the transition status before the disk is *thoroughly*
> damaged:
> 
> Jun 10 09:34:11 cn03 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-
> 8000-FD, TYPE: Fault, VER: 1, SEVERITY:
> 
> Major
> Jun 10 09:34:11 cn03 EVENT-TIME: Fri Jun 10 09:34:11 CST 2011
> Jun 10 09:34:11 cn03 PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890,
> HOSTNAME: cn03
> Jun 10 09:34:11 cn03 SOURCE: zfs-diagnosis, REV: 1.0
> Jun 10 09:34:11 cn03 EVENT-ID: 4f4bfc2c-f653-ed20-ab13-eef72224af5e
> Jun 10 09:34:11 cn03 DESC: The number of I/O errors associated with a
> ZFS device exceeded
> Jun 10 09:34:11 cn03 acceptable levels.  Refer to
> http://sun.com/msg/ZFS-8000-FD for more information.
> Jun 10 09:34:11 cn03 AUTO-RESPONSE: The device has been offlined and
> marked as faulted.  An attempt
> Jun 10 09:34:11 cn03 will be made to activate a hot spare if
> available.
> Jun 10 09:34:11 cn03 IMPACT: Fault tolerance of the pool may be
> compromised.
> Jun 10 09:34:11 cn03 REC-ACTION: Run 'zpool status -x' and replace the
> bad device.

zpool status -x output would be useful. These error reports do not include a
pointer to the faulty device. fmadm can also give more info.

> 
> After I rebooted it, I got:
> Jun 10 11:38:49 cn03 genunix: [ID 540533 kern.notice] ^MSunOS Release
> 5.11 Version snv_134 64-bit
> Jun 10 11:38:49 cn03 genunix: [ID 683174 kern.notice] Copyright 1983-
> 2010 Sun Microsystems, Inc.  All rights
> 
> reserved.
> Jun 10 11:38:49 cn03 Use is subject to license terms.
> Jun 10 11:38:49 cn03 unix: [ID 126719 kern.info] features:
> 
> 7f7f t,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,d
> 
> e,pge,mtrr,msr,tsc,lgpg>
> 
> Jun 10 11:39:06 cn03 scsi: [ID 365881 kern.info]
> /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
> Jun 10 11:39:06 cn03mptsas0 unrecognized capability 0x3
> 
> Jun 10 11:39:42 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:42 cn03drive offline
> Jun 10 11:39:47 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:47 cn03drive offline
> Jun 10 11:39:52 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:52 cn03drive offline
> Jun 10 11:39:57 cn03 scsi: [ID 107833 kern.warning] WARNING:
> /scsi_vhci/disk@g5000c50009723937 (sd3):
> Jun 10 11:39:57 cn03drive offline

mpathadm can be used to determine the device paths for this disk.

Notice how the disk is offline at multiple times. There is some sort of 
recovery going on here that continues to fail later. I call these "wounded
soldiers" because they take a lot more care than a dead soldier. You
would be better off if the drive completely died.

>> 
>> 
>>> 
>>> Hot spare will not help you here. The problem is not constrained to
>> one
>>> disk.
>>> In fact, a hot spare may be the worst thing here because it can kick
>> in
>>> for the disk
>>> complaining about a clogged expander or spurious resets.  This causes
>> a
>>> resilver
>>> that reads from the actual broken disk, that causes more resets, that
>>> kicks out another
>>> disk that causes a resilver, and so on.
>>> -- richard
>>> 
>> 
> So the warm spares could be "better" choice under this situation?
> BTW, in what condition, the scsi reset storm will happen?

In my experience they start randomly and in some cases are not reproducible.

> How can we be immune to this so as NOT to interrupt the file
> service?

Are you asking for fault tolerance?  If so, then you need a fault tolerant 
system like
a Tandem. If you are asking for a way to build a cost effective solution using 
commercial, off-the-shelf (COTS) components, then that is far beyond what can 
be easily
said in a forum posting.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Finding disks [was: # disks per vdev]

2011-06-18 Thread Richard Elling
On Jun 17, 2011, at 12:55 AM, Lanky Doodle wrote:

> Thanks Richard.
> 
> How does ZFS enumerate the disks? In terms of listing them does it do them 
> logically, i.e;
> 
> controller #1 (motherboard)
>|
>|--- disk1
>|--- disk2
> controller #3
>|--- disk3
>|--- disk4
>|--- disk5
>|--- disk6
>|--- disk7
>|--- disk8
>|--- disk9
>|--- disk10
> controller #4
>|--- disk11
>|--- disk12
>|--- disk13
>|--- disk14
>|--- disk15
>|--- disk16
>|--- disk17
>|--- disk18
> 
> or is it completely random leaving me with some trial and error to work out 
> what disk is on what port?

For all intents and purposes, it is random.

Slot locations are the responsibility of the enclosure, not the disk. Until we 
get a better framework
integrated into illumos, you can get the bay location from a SES-compliant 
enclosure from the fmtopo
output, lsiutil, or the sg_utils. For NexentaStor users I provide some 
automation for this in a KB article on 
the customer portal.  Also for NexentaStor users, DataON offers a GUI plugin 
called DSM that shows the 
enclosure, blinky lights, and all of the status information available -- power 
supplies, fans, etc -- good
stuff!

For the curious, fmtopo shows the bay for each disk and the serial number of 
the disk therein. You can
then cross-reference the c*t*d* number for the OS instance to the serial 
number. Note that for dual-port
disks, you can get different c*t*d* numbers for each node connected to the disk 
(rare, but possible).
Caveat: please verify prior to rolling into production that the bay number 
matches the enclosure silkscreen.
The numbers are programmable and different vendors deliver the same enclosure 
with different 
silkscreened numbers. As always, the disk serial number is supposed to be 
unique, so you can test this
very easily.

For the later Nexenta, OpenSolaris or Solaris 11 Express releases the mpt_sas 
driver will try to light the
OK2RM (ok to remove) LED for a disk when you use cfgadm to disconnect the 
paths. Apparently this also
works for SATA disks in an enclosure that manages SATA disks. The process is 
documented very nicely
by Cindy in the ZFS Admin Guide. However, there are a number of enclosures that 
do not have an OK2RM
LED. YMMV.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] read data block

2011-06-18 Thread Henry Lau
Do anyone know how to walk back to the root if we are at the data block (level 
0) in zfs? Is this possible since data block can be picked random and we don't 
know the parent of the indirect block at this level 0?

Thanks
Henry
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Question about drive LEDs

2011-06-18 Thread Roy Sigurd Karlsbakk
Hi all

I have a few machines setup with OI 148, and I can't make the LEDs on the 
drives work when something goes bad. The chassies are supermicro ones, and work 
well, normally. Any idea how to make drive LEDs wirk with this setup?

-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # disks per vdev

2011-06-18 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Marty Scholes
> 
> On a busy array it is hard even to use the leds as indicators.

Offline the disk.  Light stays off.
Use dd to read the disk.  Light stays on.
That should make it easy enough.

Also, depending on your HBA, lots of times you can blink an Amber LED
instead of the standard green one.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-18 Thread Toby Thain
On 18/06/11 12:44 AM, Michael Sullivan wrote:
> ...
> Way off-topic, but Smalltalk and its variants do this by maintaining the 
> state of everything in an operating environment image.
> 

...Which is in memory, so things are rather different from the world of
filesystems.

--Toby

> But then again, I could be wrong.
> 
> Mike
> 
> ---
> Michael Sullivan   
> m...@axsh.us
> http://www.axsh.us/
> Phone: +1-662-259-
> Mobile: +1-662-202-7716
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss