Re: [zfs-discuss] ZFS upgrade.
Hi John, On 08/01/2010, at 7:19 AM, john_dil...@blm.gov wrote: > Is there a way to upgrade my current ZFS version. I show the version could > be as high as 22. The version of Solaris you are running only suport ZFS versions up to version 15 as demonstrated by your zfs upgrade -v output. You probably need a newer version of Solaris, but I cannot tell you if any newer versions support later zfs versions. This forum is for OpenSolaris support. You should contact your Solaris support provider for further help on this matter. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to destroy your system in funny way with ZFS
Hi Tomas, On 27/12/2009, at 7:25 PM, Tomas Bodzar wrote: > pfexec zpool set dedup=verify rpool > pfexec zfs set compression=gzip-9 rpool > pfexec zfs set devices=off rpool/export/home > pfexec zfs set exec=off rpool/export/home > pfexec zfs set setuid=off rpool/export/home grub doesn’t support gzip - so you will need to unset that and hope that it can still boot with what has been written to disk. It is possible you will need to backup/reinstall. I learnt this one the hard way - don’t use gzip compression on the root of your rpool (you can on child filesystems that are not involved in the boot process though) HTH, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] will deduplication know about old blocks?
On 10/12/2009, at 5:36 AM, Adam Leventhal wrote: > The dedup property applies to all writes so the settings for the pool of > origin don't matter, just those on the destination pool. Just a quick related question I’ve not seen answered anywhere else: Is it safe to have dedup running on your rpool? (at install time, or if you need to migrate your rpool to new media) cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL/log on SSD weirdness
On 18/11/2009, at 7:33 AM, Dushyanth wrote: > Now when i run dd and create a big file on /iftraid0/fs and watch `iostat > -xnz 2` i dont see any stats for c8t4d0 nor does the write performance > improves. > > I have not formatted either c9t9d0 or c8t4d0. What am i missing ? Last I checked, iSCSI volumes go direct to the primary storage and not via the slog device. Can anybody confirm that is the case and if there is a mechanism/tuneable to force it via the slog and if there is any benefit/point in this for most cases? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
On 03/11/2009, at 7:32 AM, Daniel Streicher wrote: But how can I "update" my current OpenSolaris (2009.06) or Solaris 10 (5/09) to use this. Or have I wait for a new stable release of Solaris 10 / OpenSolaris? For OpenSolaris, you change your repository and switch to the development branches - should be available to public in about 3-3.5 weeks time. Plenty of instructions on how to do this on the net and in this list. For Solaris, you need to wait for the next update release. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10 samba in AD mode broken when user in > 32 AD groups
On 14/10/2009, at 2:27 AM, casper@sun.com wrote: So why not the built-in CIFS support in OpenSolaris? Probably has a similar issue, but still. In my case, it’s at least two reasons: * Crossing mountpoints requires separate shares - Samba can share an entire hierarchy regardless of ZFS filesystems beneath the sharepoint. * LDAP integration - the in-kernel CIFS only supports real AD (LDAP +krb5) for directory binding otherwise all users must have a separately managed local system accounts. Until these features are available via the in-kernel CIFS implementation, I’m forced to stick with Samba for our CIFS needs. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 26/09/2009, at 1:14 AM, Ross Walker wrote: By any chance do you have copies=2 set? No, only 1. So the double data going to the slog (as reported by iostat) is still confusing me and clearly potentially causing significant harm to my performance. Also, try setting zfs_write_limit_override equal to the size of the NVRAM cache (or half depending on how long it takes to flush): echo zfs_write_limit_override/W0t268435456 | mdb -kw That’s an interesting concept. All data still appears to go via the slog device, however, under heavy load my responsive to a new write is typically below 2s (a few outliers at about 3.5s) and a read (directory listing of a non-cached entry) is about 2s. What will this do once it hits the limit? Will streaming writes now be sent directly to a txg and streamed to the primary storage devices? (that is what I would like to see happen). As a side an slog device will not be too beneficial for large sequential writes, because it will be throughput bound not latency bound. slog devices really help when you have lots of small sync writes. A RAIDZ2 with the ZIL spread across it will provide much higher throughput then an SSD. An example of a workload that benefits from an slog device is ESX over NFS, which does a COMMIT for each block written, so it benefits from an slog, but a standard media server will not (but an L2ARC would be beneficial). Better workload analysis is really what it is about. It seems that it doesn’t matter what the workload is if the NFS pipe can sustain more continuous throughput the slog chain can support. I suppose some creative use of the logbias setting might assist this situation and force all potentially heavy writers directly to the primary storage. This would, however, negate any benefit for having a fast, low latency device for those filesystems for the times when it is desirable (any large batch of small writes, for example). Is there a way to have a dynamic, auto logbias type setting depending on the transaction currently presented to the server such that if it is clearly a large streaming write it gets treated as logbias=throughput and if it is a small transaction it gets treated as logbias=latency? (i.e. such that NFS transactions can be effectively treated as if it was local storage but minorly breaking the benefits of the txg scheduling). On 26/09/2009, at 3:39 AM, Richard Elling wrote: Back of the envelope math says: 10 Gbe = ~1 GByte/sec of I/O capacity If the SSD can only sink 70 MByte/s, then you will need: int(1000/70) + 1 = 15 SSDs for the slog For capacity, you need: 1 GByte/sec * 30 sec = 30 GBytes Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes or so. At this point, enter the fusionIO cards or similar devices. Unfortunately there does not seem to be anything on the market with infinitely fast write capacity (memory speeds) that is also supported under OpenSolaris as a slog device. I think this is precisely what I (and anybody running a general purpose NFS server) need for a general purpose slog device. Both of the above assume there is lots of memory in the server. This is increasingly becoming easier to do as the memory costs come down and you can physically fit 512 GBytes in a 4u server. By default, the txg commit will occur when 1/8 of memory is used for writes. For 30 GBytes, that would mean a main memory of only 240 Gbytes... feasible for modern servers. However, most folks won't stomach 15 SSDs for slog or 30 GBytes of NVRAM in their arrays. So Bob's recommendation of reducing the txg commit interval below 30 seconds also has merit. Or, to put it another way, the dynamic sizing of the txg commit interval isn't quite perfect yet. [Cue for Neil to chime in... :-)] How does reducing the txg commit interval really help? WIll data no longer go via the slog once it is streaming to disk? or will data still all be pushed through the slog regardless? For a predominantly NFS server purpose, it really looks like a case of the slog has to outperform your main pool for continuous write speed as well as an instant response time as the primary criterion. Which might as well be a fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of them. Is there also a way to throttle synchronous writes to the slog device? Much like the ZFS write throttling that is already implemented, so that there is a gap for new writers to enter when writing to the slog device? (or is this the norm and includes slog writes?) cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
I thought I would try the same test using dd bs=131072 if=source of=/ path/to/nfs to see what the results looked liked… It is very similar to before, about 2x slog usage and same timing and write totals. Friday, 25 September 2009 1:49:48 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/ w trn tot device 0.0 1538.70.0 196834.0 0.0 23.10.0 15.0 2 67 0 0 0 0 c7t2d0 0.0 562.00.0 71942.3 0.0 35.00.0 62.3 1 100 0 0 0 0 c7t2d0 0.0 590.70.0 75614.4 0.0 35.00.0 59.2 1 100 0 0 0 0 c7t2d0 0.0 600.90.0 76920.0 0.0 35.00.0 58.2 1 100 0 0 0 0 c7t2d0 0.0 546.00.0 69887.9 0.0 35.00.0 64.1 1 100 0 0 0 0 c7t2d0 0.0 554.00.0 70913.9 0.0 35.00.0 63.2 1 100 0 0 0 0 c7t2d0 0.0 598.00.0 76549.2 0.0 35.00.0 58.5 1 100 0 0 0 0 c7t2d0 0.0 563.00.0 72065.1 0.0 35.00.0 62.1 1 100 0 0 0 0 c7t2d0 0.0 588.10.0 75282.6 0.0 31.50.0 53.5 1 100 0 0 0 0 c7t2d0 0.0 564.00.0 72195.7 0.0 34.80.0 61.7 1 100 0 0 0 0 c7t2d0 0.0 582.80.0 74599.8 0.0 35.00.0 60.0 1 100 0 0 0 0 c7t2d0 0.0 544.00.0 69633.3 0.0 35.00.0 64.3 1 100 0 0 0 0 c7t2d0 0.0 530.00.0 67191.5 0.0 30.60.0 57.7 0 90 0 0 0 0 c7t2d0 And then the write to primary storage a few seconds later: Friday, 25 September 2009 1:50:14 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 426.30.0 32196.3 0.0 12.70.0 29.8 1 45 0 0 0 0 c11t0d0 0.0 410.40.0 31857.1 0.0 12.40.0 30.3 1 45 0 0 0 0 c11t1d0 0.0 426.30.0 30698.1 0.0 13.00.0 30.5 1 45 0 0 0 0 c11t2d0 0.0 429.30.0 31392.3 0.0 12.60.0 29.4 1 45 0 0 0 0 c11t3d0 0.0 443.20.0 33280.8 0.0 12.90.0 29.1 1 45 0 0 0 0 c11t4d0 0.0 424.30.0 33872.4 0.0 12.70.0 30.0 1 45 0 0 0 0 c11t5d0 0.0 432.30.0 32903.2 0.0 12.60.0 29.2 1 45 0 0 0 0 c11t6d0 0.0 418.30.0 32562.0 0.0 12.50.0 29.9 1 45 0 0 0 0 c11t7d0 0.0 417.30.0 31746.2 0.0 12.40.0 29.8 1 44 0 0 0 0 c11t8d0 0.0 424.30.0 31270.6 0.0 12.70.0 29.9 1 45 0 0 0 0 c11t9d0 Friday, 25 September 2009 1:50:15 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 434.90.0 37028.5 0.0 17.30.0 39.7 1 52 0 0 0 0 c11t0d0 1.0 436.9 64.3 37372.1 0.0 17.10.0 39.0 1 51 0 0 0 0 c11t1d0 1.0 442.9 64.3 38543.2 0.0 17.20.0 38.7 1 52 0 0 0 0 c11t2d0 1.0 436.9 64.3 37834.2 0.0 17.30.0 39.6 1 52 0 0 0 0 c11t3d0 1.0 412.8 64.3 35935.0 0.0 16.80.0 40.7 0 52 0 0 0 0 c11t4d0 1.0 413.8 64.3 35342.5 0.0 16.60.0 40.1 0 51 0 0 0 0 c11t5d0 2.0 418.8 128.6 36321.3 0.0 16.50.0 39.3 0 52 0 0 0 0 c11t6d0 1.0 425.8 64.3 36660.4 0.0 16.60.0 39.0 1 51 0 0 0 0 c11t7d0 1.0 437.9 64.3 37484.0 0.0 17.20.0 39.2 1 52 0 0 0 0 c11t8d0 0.0 437.90.0 37968.1 0.0 17.20.0 39.2 1 52 0 0 0 0 c11t9d0 So, 533MB source file, 13 seconds to write to the slog (14 before, no appreciable change), 1071.5MB written to the slog, 692.3MB written to primary storage. Just another data point. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote: The commentary says that normally the COMMIT operations occur during close(2) or fsync(2) system call, or when encountering memory pressure. If the problem is slow copying of many small files, this COMMIT approach does not help very much since very little data is sent per file and most time is spent creating directories and files. The problem appears to be slog bandwidth exhaustion due to all data being sent via the slog creating a contention for all following NFS or locally synchronous writes. The NFS writes do not appear to be synchronous in nature - there is only a COMMIT being issued at the very end, however, all of that data appears to be going via the slog and it appears to be inflating to twice its original size. For a test, I just copied a relatively small file (8.4MB in size). Looking at a tcpdump analysis using wireshark, there is a SETATTR which ends with a V3 COMMIT and no COMMIT messages during the transfer. iostat output that matches looks like this: slog write of the data (17MB appears to hit the slog) Friday, 25 September 2009 1:01:00 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 135.00.0 17154.5 0.0 0.80.06.0 0 3 0 0 0 0 c7t2d0 then a few seconds later, the transaction group gets flushed to primary storage writing nearly 11.4MB which is inline with raid Z2 (expect around 10.5MB; 8.4/8*10): Friday, 25 September 2009 1:01:13 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 91.00.0 1170.4 0.0 0.10.01.3 0 2 0 0 0 0 c11t0d0 0.0 84.00.0 1171.4 0.0 0.10.01.2 0 2 0 0 0 0 c11t1d0 0.0 92.00.0 1172.4 0.0 0.10.01.2 0 2 0 0 0 0 c11t2d0 0.0 84.00.0 1172.4 0.0 0.10.01.3 0 2 0 0 0 0 c11t3d0 0.0 81.00.0 1176.4 0.0 0.10.01.4 0 2 0 0 0 0 c11t4d0 0.0 86.00.0 1176.4 0.0 0.10.01.4 0 2 0 0 0 0 c11t5d0 0.0 89.00.0 1175.4 0.0 0.10.01.4 0 2 0 0 0 0 c11t6d0 0.0 84.00.0 1175.4 0.0 0.10.01.3 0 2 0 0 0 0 c11t7d0 0.0 91.00.0 1168.9 0.0 0.10.01.3 0 2 0 0 0 0 c11t8d0 0.0 89.00.0 1170.9 0.0 0.10.01.4 0 2 0 0 0 0 c11t9d0 So I performed the same test with a much larger file (533MB) to see what it would do, being larger than the NVRAM cache in front of the SSD. Note that after the second second of activity the NVRAM is full and only allowing in about the sequential write speed of the SSD (~70MB/s). Friday, 25 September 2009 1:13:14 PM EST extended device statistics errors --- r/sw/s kr/skw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 640.90.0 81782.9 0.0 4.20.06.5 1 14 0 0 0 0 c7t2d0 0.0 1065.70.0 136408.1 0.0 18.60.0 17.5 1 78 0 0 0 0 c7t2d0 0.0 579.00.0 74113.3 0.0 30.70.0 53.1 1 100 0 0 0 0 c7t2d0 0.0 588.70.0 75357.0 0.0 33.20.0 56.3 1 100 0 0 0 0 c7t2d0 0.0 532.00.0 68096.3 0.0 31.50.0 59.1 1 100 0 0 0 0 c7t2d0 0.0 559.00.0 71428.0 0.0 32.50.0 58.1 1 100 0 0 0 0 c7t2d0 0.0 542.00.0 68755.9 0.0 25.10.0 46.4 1 100 0 0 0 0 c7t2d0 0.0 542.00.0 69376.4 0.0 35.00.0 64.6 1 100 0 0 0 0 c7t2d0 0.0 581.00.0 74368.0 0.0 30.60.0 52.6 1 100 0 0 0 0 c7t2d0 0.0 567.00.0 72574.1 0.0 33.20.0 58.6 1 100 0 0 0 0 c7t2d0 0.0 564.00.0 72194.1 0.0 31.10.0 55.2 1 100 0 0 0 0 c7t2d0 0.0 573.00.0 73343.5 0.0 33.20.0 57.9 1 100 0 0 0 0 c7t2d0 0.0 536.30.0 68640.5 0.0 33.10.0 61.7 1 100 0 0 0 0 c7t2d0 0.0 121.90.0 15608.9 0.0 2.70.0 22.1 0 22 0 0 0 0 c7t2d0 Again, the slog wrote about double the file size (1022.6MB) and a few seconds later, the data was pushed to the primary storage (684.9MB with an expectation of 666MB = 533MB/8*10) so again about the right number hit the spinning platters. Friday, 25 September 2009 1:13:43 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 338.30.0 32794.4 0.0 13.70.0 40.6 1 47 0 0 0 0 c11t0d0 0.0 325.30.0 31399.8 0.0 13.70.0 42.0 1
Re: [zfs-discuss] periodic slow responsiveness
On 25/09/2009, at 1:24 AM, Bob Friesenhahn wrote: On Thu, 24 Sep 2009, James Lever wrote: Is there a way to tune this on the NFS server or clients such that when I perform a large synchronous write, the data does not go via the slog device? Synchronous writes are needed by NFS to support its atomic write requirement. It sounds like your SSD is write-bandwidth bottlenecked rather than IOPS bottlenecked. Replacing your SSD with a more performant one seems like the first step. NFS client tunings can make a big difference when it comes to performance. Check the nfs(5) manual page for your Linux systems to see what options are available. An obvious tunable is 'wsize' which should ideally match (or be a multiple of) the zfs filesystem block size. The /proc/mounts file for my Debian install shows that 1048576 is being used. This is quite large and perhaps a smaller value would help. If you are willing to accept the risk, using the Linux 'async' mount option may make things seem better. From the Linux NFS FAQ. http://nfs.sourceforge.net/ NFS Version 3 introduces the concept of "safe asynchronous writes.” And it continues. My rsize and wsize are negotiating to 1MB. James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 25/09/2009, at 2:58 AM, Richard Elling wrote: On Sep 23, 2009, at 10:00 PM, James Lever wrote: So it turns out that the problem is that all writes coming via NFS are going through the slog. When that happens, the transfer speed to the device drops to ~70MB/s (the write speed of his SLC SSD) and until the load drops all new write requests are blocked causing a noticeable delay (which has been observed to be up to 20s, but generally only 2-4s). Thank you sir, can I have another? If you add (not attach) more slogs, the workload will be spread across them. But... My log configurations is : logs c7t2d0s0 ONLINE 0 0 0 c7t3d0s0 OFFLINE 0 0 0 I’m going to test the now removed SSD and see if I can get it to perform significantly worse than the first one, but my memory of testing these at pre-production testing was that they were both equally slow but not significantly different. On a related note, I had 2 of these devices (both using just 10GB partitions) connected as log devices (so the pool had 2 separate log devices) and the second one was consistently running significantly slower than the first. Removing the second device made an improvement on performance, but did not remove the occasional observed pauses. ...this is not surprising, when you add a slow slog device. This is the weakest link rule. So, in theory, even if one of the two SSDs was even slightly slower than the other, it would just appear that it would be more heavily effected? Here is part of what I’m not understanding - unless one SSD is significantly worse than the other, how can the following scenario be true? Here is some iostat output from the two slog devices at 1s intervals when it gets a large series of write requests. Idle at start. 0.0 1462.00.0 187010.2 0.0 28.60.0 19.6 2 83 0 0 0 0 c7t2d0 0.0 233.00.0 29823.7 0.0 28.70.0 123.3 0 83 0 0 0 0 c7t3d0 NVRAM cache close to full. (256MB BBC) 0.0 84.00.0 10622.0 0.0 3.50.0 41.2 0 12 0 0 0 0 c7t2d0 0.00.00.0 0.0 0.0 35.00.00.0 0 100 0 0 0 0 c7t3d0 0.00.00.0 0.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t2d0 0.0 305.00.0 39039.3 0.0 35.00.0 114.7 0 100 0 0 0 0 c7t3d0 0.00.00.0 0.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t2d0 0.0 361.00.0 46208.1 0.0 35.00.0 96.8 0 100 0 0 0 0 c7t3d0 0.00.00.0 0.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t2d0 0.0 329.00.0 42114.0 0.0 35.00.0 106.3 0 100 0 0 0 0 c7t3d0 0.00.00.0 0.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t2d0 0.0 317.00.0 40449.6 0.0 27.40.0 86.5 0 85 0 0 0 0 c7t3d0 0.04.00.0 263.8 0.0 0.00.00.2 0 0 0 0 0 0 c7t2d0 0.04.00.0 367.8 0.0 0.00.00.3 0 0 0 0 0 0 c7t3d0 What determines the size of the writes or distribution between slog devices? It looks like ZFS decided to send a large chunk to one slog which nearly filled the NVRAM, and then continue writing to the other one, which meant that it had to go at device speed (whatever that is for the data size/write size). Is there a way to tune the writes to multiple slogs to be (for arguments sake) 10MB slices? I was of the (mis)understanding that only metadata and writes smaller than 64k went via the slog device in the event of an O_SYNC write request? The threshold is 32 kBytes, which is unfortunately the same as the default NFS write size. See CR6686887 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887 If you have a slog and logbias=latency (default) then the writes go to the slog. So there is some interaction here that can affect NFS workloads in particular. Interesting CR. nfsstat -m output on one of the linux hosts (ubuntu) Flags: rw ,vers = 3 ,rsize = 1048576 ,wsize = 1048576 ,namlen = 255 ,hard ,nointr ,noacl ,proto = tcp ,timeo = 600 ,retrans =2,sec=sys,mountaddr=10.1.0.17,mountvers=3,mountproto=tcp,addr=10.1.0.17 rsize and wsize auto tuned to 1MB. How does this effect the sync request threshold? The clients are (mostly) RHEL5. Is there a way to tune this on the NFS server or clients such that when I perform a large synchronous write, the data does not go via the slog device? You can change the IOP size on the client. You’re suggesting modifying rsize/wsize? or something else? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 08/09/2009, at 2:01 AM, Ross Walker wrote: On Sep 7, 2009, at 1:32 AM, James Lever wrote: Well a MD1000 holds 15 drives a good compromise might be 2 7 drive RAIDZ2s with a hotspare... That should provide 320 IOPS instead of 160, big difference. The issue is interactive responsiveness and if there is a way to tune the system to give that while still having good performance for builds when they are run. Look at the write IOPS of the pool with the zpool iostat -v and look at how many are happening on the RAIDZ2 vdev. I was suggesting that slog write were possibly starving reads from the l2arc as they were on the same device. This appears not to have been the issue as the problem has persisted even with the l2arc devices removed from the pool. The SSD will handle a lot more IOPS then the pool and L2ARC is a lazy reader, it mostly just holds on to read cache data. It just may be that the pool configuration just can't handle the write IOPS needed and reads are starving. Possible, but hard to tell. Have a look at the iostat results I’ve posted. The busy times of the disks while the issue is occurring should let you know. So it turns out that the problem is that all writes coming via NFS are going through the slog. When that happens, the transfer speed to the device drops to ~70MB/s (the write speed of his SLC SSD) and until the load drops all new write requests are blocked causing a noticeable delay (which has been observed to be up to 20s, but generally only 2-4s). I can reproduce this behaviour by copying a large file (hundreds of MB in size) using 'cp src dst’ on an NFS (still currently v3) client and observe that all data is pushed through the slog device (10GB partition of a Samsung 50GB SSD behind a PERC 6/i w/256MB BBC) rather than going direct to the primary storage disks. On a related note, I had 2 of these devices (both using just 10GB partitions) connected as log devices (so the pool had 2 separate log devices) and the second one was consistently running significantly slower than the first. Removing the second device made an improvement on performance, but did not remove the occasional observed pauses. I was of the (mis)understanding that only metadata and writes smaller than 64k went via the slog device in the event of an O_SYNC write request? The clients are (mostly) RHEL5. Is there a way to tune this on the NFS server or clients such that when I perform a large synchronous write, the data does not go via the slog device? I have investigated using the logbias setting, but that will just kill small file performance also on any filesystem using it and defeat the purpose of having a slog device at all. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 07/09/2009, at 10:46 AM, Ross Walker wrote: zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). This config might lead to heavy sync writes (NFS) starving reads due to the fact that the whole RAIDZ2 behaves as a single disk on writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs? Just one or two other vdevs to spread the load can make the world of difference. This was a management decision. I wanted to go down the striped mirrored pair solution, but the amount of space lost was considered too great. RAIDZ2 was considered the best value option for our environment. The system is configured as an NFS (currently serving NFSv3), iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with authentication taking place from a remote openLDAP server. There are a lot of services here, all off one pool? You might be trying to bite off more then the config can chew. That’s not a lot of services, really. We have 6 users doing builds on multiple platforms and using the storage as their home directory (windows and unix). The issue is interactive responsiveness and if there is a way to tune the system to give that while still having good performance for builds when they are run. Try taking a particularly bad problem station and configuring it static for a bit to see if it is. That has been considered also, but the issue has also been observed locally on the fileserver. That doesn't make a lot of sense to me the L2ARC is secondary read cache, if writes are starving reads then the L2ARC would only help here. I was suggesting that slog write were possibly starving reads from the l2arc as they were on the same device. This appears not to have been the issue as the problem has persisted even with the l2arc devices removed from the pool. It just may be that the pool configuration just can't handle the write IOPS needed and reads are starving. Possible, but hard to tell. Have a look at the iostat results I’ve posted. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 07/09/2009, at 11:08 AM, Richard Elling wrote: Ok, just so I am clear, when you mean "local automount" you are on the server and using the loopback -- no NFS or network involved? Correct. And the behaviour has been seen locally as well as remotely. You are looking for I/O that takes seconds to complete or is stuck in the device. This is in the actv column stuck > 1 and the asvc_t >> 1000 Just started having some slow responsiveness reported form a user using emacs (autosave, start of a build) so a small file write request. The second or so before they went to do this, it appears as if the raid cache in front of the slog devices was nearly filled and the SSDs were being utilised quite heavily, but then there was a break where I am seeing relatively light usage on the slog but 100% busy on the device reported. The iostat output is at the end of this message - I can’t make any real sense out of why a user would have seen a ~4s delay at about 2:39:17-18. Only one of the two slog devices are being used at all. Is there some tunable about how multiple slogs are used? c7t[01] are rpool c7t[23] are slog devices in the data pool c11t* are the primary storage devices for the data pool cheers, James Monday, 7 September 2009 2:39:17 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 10 0 10 c9t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t1d0 0.0 1475.00.0 188799.0 0.0 30.20.0 20.5 2 90 0 0 0 0 c7t2d0 0.0 232.00.0 29571.8 0.0 33.80.0 145.9 0 98 0 0 0 0 c7t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t4d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t5d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t6d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t7d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t8d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:18 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 10 0 10 c9t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t2d0 0.00.00.00.0 0.0 35.00.00.0 0 100 0 0 0 0 c7t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t4d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t5d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t6d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t7d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t8d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:19 PM EST extended device statistics errors --- r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.00.00.00.0 0.0 0.00.00.0 0 0 0 10 0 10 c9t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t1d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c7t2d0 0.0 341.00.0 43650.1 0.0 35.00.0 102.5 0 100 0 0 0 0 c7t3d0 0.00.00.00.0 0.0 0.00.00.0 0 0 0 0 0 0 c11t0d0
Re: [zfs-discuss] periodic slow responsiveness
On 07/09/2009, at 6:24 AM, Richard Elling wrote: On Sep 6, 2009, at 7:53 AM, Ross Walker wrote: On Sun, Sep 6, 2009 at 9:15 AM, James Lever wrote: I’m experiencing occasional slow responsiveness on an OpenSolaris b118 system typically noticed when running an ‘ls’ (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the server and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for snv_124). The problem has been observed both locally on the primary filesystem, in an locally automounted reference (/home/foo) and remotely via NFS. I'm confused. If "This problem has only been noticed via NFS (v3" then how is it "observed locally?” Sorry, I was meaning to say it had not been noticed using CIFS or iSCSI. It has been observed in client:/home/user (NFSv3 automount from server:/home/user, redirected to server:/zpool/home/user) and also in server:/home/user (local automount) and server:/zpool/home/user (origin). iostat(1m) is the program for troubleshooting performance issues related to latency. It will show the latency of nfs mounts as well as other devices. What specifically should I be looking for here? (using ‘iostat -xen -T d’) and I’m guessing I’ll require a high level of granularity (1s intervals) to see the issue if it is a single disk or similar. stat(2) doesn't write, so you can stop worrying about the slog. My concern here was I may have been trying to write (via other concurrent processes) at the same time as there was a memory fault from the ARC to L2ARC. Rule out the network by looking at retransmissions and ioerrors with netstat(1m) on both the client and server. No errors or collisions from either server or clients observed. That behavior sounds a lot like a process has a memory leak and is filling the VM. On Linux there is an OOM killer for these, but on OpenSolaris, your the OOM killer. See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the "Physical Memory Control Using the Resource Capping Daemon" in System Administration Guide: Solaris Containers-Resource Management, and Solaris Zones Thanks Richard, I’ll have a look at that today and see where I get. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] periodic slow responsiveness
On 07/09/2009, at 12:53 AM, Ross Walker wrote: That behavior sounds a lot like a process has a memory leak and is filling the VM. On Linux there is an OOM killer for these, but on OpenSolaris, your the OOM killer. If it was this type of behaviour, where would it be logged when the process was killed/restarted? If it’s not logged by default, can that be enabled? I have not seen any evidence of this in /var/adm/messages, /var/log/ syslog, or my /var/log/debug (*.debug), but perhaps I’m not looking for the right clues. You have iSCSI, NFS, CIFS to choose from (most obvious), try restarting them one at a time during down time and see if performance improves after each restart to find the culprit. The downtime is being reported by users, and I have only seen it once (while in their office) so this method of debugging isn’t going to help, I’m afraid. (this is why I asked about alternate root cause analysis methods) cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] periodic slow responsiveness
I’m experiencing occasional slow responsiveness on an OpenSolaris b118 system typically noticed when running an ‘ls’ (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the server and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for snv_124). The problem has been observed both locally on the primary filesystem, in an locally automounted reference (/home/foo) and remotely via NFS. zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). The system is configured as an NFS (currently serving NFSv3), iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with authentication taking place from a remote openLDAP server. Automount is in use both locally and remotely (linux clients). Locally /home/* is remounted from the zpool, remotely /home and another filesystem (and children) are mounted using autofs. There was some suspicion that automount is the problem, but no definitive evidence as of yet. The problem has definitely been observed with stats (of some form, typically ‘/usr/bin/ls’ output) both remotely, locally in /home/* and locally in /zpool/home/* (the true source location). There is a clear correlation with recency of reads of the directories in question and reoccurrence of the fault in that one user has scripted a regular (15m/ 30m/hourly tests so far) ‘ls’ of the filesystems of interested and this has reduced the fault to have minimal noted impact since starting down this path (just for themself). I have removed the l2arc(s) (cache devices) from the pool and the same behaviour has been observed. My suspicion here was that there was perhaps occasional high synchronous load causing heavy writes to the slog devices and when a stat was requested it may have been faulting from ARC to L2ARC prior to going to the primary data store. The slowness has been reported since removing the extra cache devices. Another thought I had was along the lines of fileystem caching and heavy writes causing read blocking. I have no evidence that this is the case, but some suggestions on list recently of limiting the ZFS memory usage for write caching. Can anybody comment to the effectiveness of this (I have 256MB write cache in front of the slog SSDs and 512MB in front of the primary storage devices). My DTrace is very poor but I’m suspicious that this is the best way to root cause this problem. If somebody has any code that may assist in debugging this problem and was able to share it would much appreciated. Any other suggestions for how to identify this fault and work around it would be greatly appreciated. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 02/09/2009, at 9:54 AM, Adam Leventhal wrote: After investigating this problem a bit I'd suggest avoiding deploying RAID-Z until this issue is resolved. I anticipate having it fixed in build 124. Thanks for the status update on this Adam. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool
On 28/08/2009, at 3:23 AM, Adam Leventhal wrote: There appears to be a bug in the RAID-Z code that can generate spurious checksum errors. I'm looking into it now and hope to have it fixed in build 123 or 124. Apologies for the inconvenience. Are the errors being generated likely to cause any significant problem running 121 with a RAID-Z volume or should users of RAID-Z* wait until this issue is resolved? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs send/receive and compression
Is there a mechanism by which you can perform a zfs send | zfs receive and not have the data uncompressed and recompressed at the other end? I have a gzip-9 compressed filesystem that I want to backup to a remote system and would prefer not to have to recompress everything again at such great computation expense. If this doesn't exist, how would one go about creating an RFE for it? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
On 05/08/2009, at 11:41 AM, Ross Walker wrote: What is your recipe for these? There wasn't one! ;) The drive I'm using is a Dell badged Samsung MCCOE50G5MPQ-0VAD3. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
On 05/08/2009, at 11:36 AM, Ross Walker wrote: Which model? PERC 6/E w/512MB BBWC. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906
On 05/08/2009, at 10:36 AM, Carson Gaspar wrote: Isn't the PERC 6/e just a re-branded LSI? LSI added SSD support recently. Yep, it's a mega raid device. I have been using one with a Samsung SSD in RAID0 mode (to avail myself of the cache) recently with great success. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Need tips on zfs pool setup..
On 04/08/2009, at 9:42 PM, Joseph L. Casale wrote: I noticed a huge improvement when I moved a virtualized pool off a series of 7200 RPM SATA discs to even 10k SAS drives. Night and day... What I would really like to know is if it makes a big difference comparing say 7200RPM drives in mirror+stripe mode vs 15kRPM drives in raidz2 And how much of a difference raidz2 is compared to mirror+stripe in a contentious multi-client environment. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] ZFS and deduplication
Nathan Hudson-Crim, On 04/08/2009, at 8:02 AM, Nathan Hudson-Crim wrote: Andre, I've seen this before. What you have to do is ask James each question 3 times and on the third time he will tell the truth. ;) I know this is probably meant to be seen as a joke, but it's clearly in very poor taste and extremely discourteous and rude to make public statements to the effect of: "James McPherson is a liar and we should public berate him until he tells us what we want to hear regardless of the real situation of which I have no information other than what I want to believe". Really! Please actually think before you post. (another) James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] feature proposal
On 30/07/2009, at 11:32 PM, Darren J Moffat wrote: On the host that has the ZFS datasets (ie the NFS/CIFS server) you need to give the user the delegation to create snapshots and to mount them: # zfs allow -u james snapshot,mount,destroy tank/home/james Ahh, it was the lack of mount that caught me! Thanks Darren. I've read through the manpage but have not managed to get the correct set of permissions for it to work as a normal user (so far). What did you try ? What release of OpenSolaris are you running ? snv 118. I blame being tired and not trying enough options! I was trying to do it with just snapshot and destroy expecting something like a snapshot didn't need to be mounted for some reason. Thanks for the clarification. Next time I think I'll also consult the administration guide as well as the manpage though I guess an explicit example for the snapshot delegation wouldn't go astray in the manpage. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] feature proposal
Hi Darryn, On 30/07/2009, at 6:33 PM, Darren J Moffat wrote: That already works if you have the snapshot delegation as that user. It even works over NFS and CIFS. Can you give us an example of how to correctly get this working? I've read through the manpage but have not managed to get the correct set of permissions for it to work as a normal user (so far). I'm sure others here would be keen to see a correct recipe to allow user managed snapshots remotely via mkdir/rmdir. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [indiana-discuss] zfs issues?
On 29/07/2009, at 12:00 AM, James Lever wrote: CR 6865661 *HOT* Created, P1 opensolaris/triage-queue zfs scrub rpool causes zpool hang This bug I logged has been marked as related to CR 6843235 which is fixed in snv 119. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [n/zfs-discuss] Strange speeds with x4500, Solaris 10 10/08
On 29/07/2009, at 5:47 PM, Ross wrote: Everyone else should be using the Intel X25-E. There's a massive difference between the M and E models, and for a slog it's IOPS and low latency that you need. Do they have any capacitor backed cache? Is this cache considered stable storage? If so, then they would be a fine solution. Are there any details of the cache size and capacitor support time? SSD manufacturers aren't releasing this type of information and for use as a ZIL/slog for an NFS server, it's a pretty critical piece of the puzzle. We're not running an x4500, but we were lucky enough to get our hands on some PCI 512MB nvram cards a while back, and I can confirm they make a huge difference to NFS speeds - for our purposes they're identical to ramdisk slog performance. At the moment, short of an STEC ZEUS, this is the only viable solution I've been able to come up with. What is the NVRAM card you're using? For me, I'm putting it behind a raid controller with battery backed DRAM write cache. That works really well. cheers, James___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [indiana-discuss] zfs issues?
Thanks for that Brian. I've logged a bug: CR 6865661 *HOT* Created, P1 opensolaris/triage-queue zfs scrub rpool causes zpool hang Just discovered after trying to create a further crash dump that it's failing and rebooting with the following error (just caught it prior to the reboot): panic dump timeout so I'm not sure how else to assist with debugging this issue. cheers, James On 28/07/2009, at 9:08 PM, Brian Ruthven - Solaris Network Sustaining - Sun UK wrote: Yes: $Make sure your dumpadm is set up beforehand to enable savecore, and that you have a dump device. In my case the output looks like this: $ pfexec dumpadm Dump content: kernel pages Dump device: /dev/zvol/dsk/rpool/dump (dedicated) Savecore directory: /var/crash/opensolaris Savecore enabled: yes Then you should get a dump saved in /var/crash/ on next reboot. Brian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [indiana-discuss] zfs issues?
On 28/07/2009, at 9:22 AM, Robert Thurlow wrote: I can't help with your ZFS issue, but to get a reasonable crash dump in circumstances like these, you should be able to do "savecore -L" on OpenSolaris. That would be well and good if I could get a login - due to the rpool being unresponsive, that was not possible. So the only recourse we had was via kmdb :/ Is there a way to explicitly invoke savecore via kmdb? James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [indiana-discuss] zfs issues?
On 28/07/2009, at 6:44 AM, dick hoogendijk wrote: Are there any known issues with zfs in OpenSolaris B118? I run my pools formatted like the original release 2009.06 (I want to be able to go back to it ;-). I'm a bit scared after reading about serious issues in B119 (will be skipped, I heard). But B118 is "safe"? Well, actually, I have an issue with ZFS under b118 on osol. Under b117, I attached a second disk to my root pool and confirmed everything worked fine. Rebooted with the disks in reverse order to prove grub install worked and everything was fine. Removed one of the spindles, did an upgrade to b118, rebooted and tested and then rebooted and added the removed volume, this was an explicit test of automated resilvering and it worked perfectly. Did one or two explicit scrubs along the way and they were fine too. So then I upgrade my zpool from version 14 to version 16 and now zpool scrub rpool hangs the ZFS subsystem. The machine still runs, it's pingable etc, but anything that goes to disk (at least rpool) hangs indefinitely. This happens whether I boot with the mirror in tact or degraded with one spindle removed. I had help trying to create a crash dump, but everything we tried didn't cause the system to panic. 0>eip;:c;:c and other weird magic I don't fully grok Has anybody else seen this weirdness? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deduplication
On 15/07/2009, at 1:51 PM, Jean Dion wrote: Do we know if this web article will be discuss at Brisbane Australia the conference this week? http://www.pcworld.com/article/168428/sun_tussles_with_deduplication_startup.html?tk=rss_news I do not expect details but at least Sun position on this instead of letting peoples on rumors like published in this article. Any replay and materials from this conferences? There is a ustream feed that's live now at: http://www.ustream.tv/channel/kernel-conference-australia The conference is being recorded as well and will likely be re-encoded and uploaded somewhere down the track. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] De-duplication: possible to identify duplicate files?
On 15/07/2009, at 7:18 AM, Orvar Korvar wrote: With dedup, will it be possible somehow to identify files that are identical but has different names? Then I can find and remove all duplicates. I know that with dedup, removal is not really needed because the duplicate will just be a reference to an existing file. But nevertheless I want to keep down the file count. Based on Jeff and Bill's talk this morning, dedup (v1.0) is initially based on the block level hashes used within zfs - so data, regardless of zfs filesystem or zvol within a pool (for those vols/fs' have dedup enabled) will keep single copies of each block that has the same hash. I'm sure more detailed information will come as it is put back into ON and information is published on docs.sun.com and other places. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
On 07/07/2009, at 8:20 PM, James Andrewartha wrote: Have you tried putting the slog on this controller, either as an SSD or regular disk? It's supported by the mega_sas driver, x86 and amd64 only. What exactly are you suggesting here? Configure one disk on this array as a dedicated ZIL? Would that improve performance any over using all disks with an internal ZIL? I have now done some tests with the PERC6/E in both RAID10 (all devices RAID0 LUNs, ZFS mirror/striped config) and also as a hardware RAID5 both with an internal ZIL. RAID10 (10 disks, 5 mirror vdevs) create 2m14.448s unlink 0m54.503s RAID5 (9 disks, 1 hot spare) create 1m58.819s unlink 0m48.509s Unfortunately, linux on the same RAID5 array using XFS seems significantly faster, still. Linux RAID5 (9 disks, 1 hot spare), XFS create 1m30.911s unlink 0m38.953s Is there a way to disable the write barrier in ZFS in the way you can with Linux filesystems (-o barrier=0)? Would this make any difference? After much consideration, the lack of barrier capability makes no difference to filesystem stability in the scenario where you have a battery backed write cache. Due to using identical hardware and configurations, I think this is a fair apples to apples test now. I'm now wondering if XFS is just the faster filesystem... (not the most practical management solution, just speed). cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
On 06/07/2009, at 9:31 AM, Ross Walker wrote: There are two types of SSD drives on the market, the fast write SLC (single level cell) and the slow write MLC (multi level cell). MLC is usually used in laptops as SLC drives over 16GB usually go for $1000+ which isn't cost effective in a laptop. MLC is good for read caching though and most use it for L2ARC. I just ordered a bunch of 16GB Imation Pro 7500's (formerly Mtron) from CDW lately for $290 a pop. They are suppose to be fast sequential write SLC drives and so-so random write. We'll see. That will be interesting to see. The Samsung drives we have are 50GB (64GB) SLC and apparently 2nd generation. For a slog, is random write even an issue? Or is it just the mechanism used to measure the IOPS performance of a typical device? AFAIUI, the ZIL is used as a ring buffer. How does that work with an SSD? All this pain really makes me think the only sane slog is one that is RAM based and has enough capacitance to either make itself permanent or move the data to something permanent before failing (FusionIO, DDRdrive, for example). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] surprisingly poor performance
On 05/07/2009, at 1:57 AM, Ross Walker wrote: Barriers are by default are disabled on ext3 mounts... Google it and you'll see interesting threads in the LKML. Seems there was some serious performance degradation in using them. A lot of decisions in Linux are made in favor of performance over data consistency. After doing a fair bit of reading about linux and write barriers, I'm sure that it's an issue for traditional direct attach storage and for non-battery backed write cache in raid cards when cache is enabled. Is it actually an issue if you have a hardware raid controller w/ BBWC enabled and the cache disabled on the HDDs? (i.e. correctly configured for data safety) Should a correctly performing raid card be ignoring barrier write requests because it is already on stable storage? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote: It seems like you may have selected the wrong SSD product to use. There seems to be a huge variation in performance (and cost) with so- called "enterprise" SSDs. SSDs with capacitor-backed write caches seem to be fastest. Do you have any methods to "correctly" measure the performance of an SSD for the purpose of a slog and any information on others (other than anecdotal evidence)? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] surprisingly poor performance
On 04/07/2009, at 1:49 PM, Ross Walker wrote: I ran some benchmarks back when verifying this, but didn't keep them unfortunately. You can google: XFS Barrier LVM OR EVMS and see the threads about this. Interesting reading. Testing seems to show that either it's not relevant or there is something interesting going on with ext3 as a separate case. When you do send me a copy, try both on a straight partition then on a LVM volume and always use NFS sync, but when exporting use the no_wdelay option if you don't already that eliminates slow downs with NFS sync on Linux. The numbers below seem to indicate that either there is no barrier issues here, or the BBWC in the raid controller makes them more-or- less invisible as the ext3fs volume below is directly onto the exposed LUN while the xfs partition is on top of LVM2. It does, however, show that xfs is much faster for deletes. cheers, James bash-3.2# cd /nfs/xfs_on_LVM bash-3.2# ( date ; time tar xf zeroes-10k.tar ; date ; time rm -rf zeroes/ ; date ) 2>&1 Sat Jul 4 15:31:13 EST 2009 real0m18.145s user0m0.055s sys 0m0.500s Sat Jul 4 15:31:31 EST 2009 real0m4.585s user0m0.004s sys 0m0.261s Sat Jul 4 15:31:36 EST 2009 bash-3.2# cd /nfs/ext3 bash-3.2# ( date ; time tar xf zeroes-10k.tar ; date ; time rm -rf zeroes/ ; date ) Sat Jul 4 15:32:43 EST 2009 real0m15.509s user0m0.048s sys 0m0.508s Sat Jul 4 15:32:59 EST 2009 real0m37.793s user0m0.006s sys 0m0.225s Sat Jul 4 15:33:37 EST 2009 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
On 04/07/2009, at 2:08 PM, Miles Nordin wrote: iostat -xcnXTdz c3t31d0 1 on that device being used as a slog, a higher range of output looks like: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1477.80.0 2955.4 0.0 0.00.00.0 0 5 c7t2d0 Saturday, July 4, 2009 2:18:48 PM EST cpu us sy wt id 0 1 0 99 I started a second task from the first server while using only a single slog and the performance of the SSD got up to 1900 w/s extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1945.80.0 3891.7 0.0 0.10.00.0 0 6 c7t2d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c7t3d0 Saturday, July 4, 2009 2:23:11 PM EST cpu us sy wt id 0 1 0 99 Interestingly, adding a second SSD into the mix and a 3rd writer (on a second client system) showed no further increases: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 942.30.0 1884.4 0.0 0.00.00.0 0 3 c7t2d0 0.0 942.40.0 1884.4 0.0 0.00.00.0 0 3 c7t3d0 Add the ramdisk as a 3rd slog with 3 writers and only an increase in the speed of the slowest device extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 453.60.0 1814.4 0.0 0.00.00.0 0 1 ramdisk1 0.0 907.20.0 1814.4 0.0 0.00.00.0 0 3 c7t2d0 0.0 907.20.0 1814.4 0.0 0.00.00.0 0 3 c7t3d0 Saturday, July 4, 2009 2:29:08 PM EST cpu us sy wt id 0 2 0 98 When only the ramdisk is used as a slog, it gives the following results: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 3999.40.0 15997.8 0.0 0.00.00.0 0 2 ramdisk1 Saturday, July 4, 2009 2:36:58 PM EST cpu us sy wt id 0 3 0 96 Any insightful observations? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] surprisingly poor performance
On 04/07/2009, at 10:42 AM, Ross Walker wrote: XFS on LVM or EVMS volumes can't do barrier writes due to the lack of barrier support in LVM and EVMS, so it doesn't do a hard cache sync like it would on a raw disk partition which makes the numbers higher, BUT with battery backed write cache the risk is negligible, but the numbers are higher then those on file systems that do do a hard cache sync. Do you have any references for this? and perhaps some published numbers that you may have seen? Try XFS on a raw partition and NFS with sync writes enabled and see how it performs then. I cannot do this on the existing fileserver and do not have another system with a BBWC card to test against. The BBWC on the LSI MegaRaid is certainly the key factor here, I would expect. I can test this assumption on this new hardware next week when I do a number of other tests and compare linux/XFS and perhaps remove LVM (though, I don't see why you would remove LVM from the equation). cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
On 03/07/2009, at 10:37 PM, Victor Latushkin wrote: Slog in ramdisk is analogous to no slog at all and disable zil (well, it may be actually a bit worse). If you say that your old system is 5 years old difference in above numbers may be due to difference in CPU and memory speed, and so it suggests that your Linux NFS server appears to be working at the memory speed, hence the question. Because if it does not honor sync semantics you are really comparing apples with oranges here. The slog in ramdisk is in no way similar to disabling the ZIL. This is an NFS test, so if I had disabled the ZIL, writes would have to go direct to disk (not ZIL) before returning, which would potentially be even slower than ZIL on zpool. The appearance of the Linux NFS server appearing to perform at memory speed may just be the BBWC in the LSI MegaRaid SCSI card. One of the developers here had explicitly performed tests to check these similar assumptions and found no evidence that the Linux/XFS sync implementation to be lacking even though there were previous issues with it in one kernel revision. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
Hi Mertol, On 03/07/2009, at 6:49 PM, Mertol Ozyoney wrote: ZFS SSD usage behaviour heavly depends on access pattern and for asynch ops ZFS will not use SSD's. I'd suggest you to disable SSD's , create a ram disk and use it as SLOG device to compare the performance. If performance doesnt change, it means that the measurement method have some flaws or you havent configured Slog correctly. I did some tests with a ramdisk slog and the the write IOPS seemed to run about the 4k/s mark vs about 800/s when using the SSD as slog and 200/s without a slog. # osol b117 RAID10+ramdisk slog # bash-3.2# time tar xf zeroes.tar; rm -rf zeroes/; | tee /root/zeroes- test-scalzi-dell-ramdisk_slog.txt # tar real1m32.343s # rm real0m44.418s # linux+XFS on Hardware RAID bash-3.2# time tar xf zeroes.tar; time rm -rf zeroes/; | tee /root/ zeroes-test-linux-lsimegaraid_bbwc.txt #tar real2m27.791s #rm real0m46.112s Please note that SSD's are way slower then DRAM based write cache's. SSD's will show performance increase when you create load from multiple clients at the same time, as ZFS will be flushing the dirty cache sequantialy. So I'd suggest running the test from a lot of clients simultaneously I'm sure that it will be a more performant system in general, however, it is this explicit set of tests that I need to maintain or improve performance on. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
Hej Henrik, On 03/07/2009, at 8:57 PM, Henrik Johansen wrote: Have you tried running this locally on your OpenSolaris box - just to get an idea of what it could deliver in terms of speed ? Which NFS version are you using ? Most of the tests shown in my original message are local except the explicitly NFS based Metadata test shown at the very end (100k 0b files). The 100k/0b test is an atomic test locally due to caching semantics and a lack of 100k explicit SYNC requests so the transactions are able to be bundled together and written in one block. I've just been using NFSv3 so far for these tests as it it widely regarded as faster, even though less functional. cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] surprisingly poor performance
On 03/07/2009, at 5:03 PM, Brent Jones wrote: Are you sure the slog is working right? Try disabling the ZIL to see if that helps with your NFS performance. If your performance increases a hundred fold, I'm suspecting the slog isn't perming well, or even doing its job at all. The slog appears to be working fine - at ~800 IOPS it wasn't lighting up the light significantly and when a second was added both activity lights were even more dim. Without the slog, the pool was only providing ~200 IOPS for the NFS metadata test. Speaking of which, can anybody point me at a good, valid test to measure the IOPS of these SSDs? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] surprisingly poor performance
Hi All, We have recently acquired hardware for a new fileserver and my task, if I want to use OpenSolaris (osol or sxce) on it is for it to perform at least as well as Linux (and our 5 year old fileserver) in our environment. Our current file server is a whitebox Debian server with 8x 10,000 RPM SCSI drives behind an LSI MegaRaid controller with a BBU. The filesystem in use is XFS. The raw performance tests that I have to use to compare them are as follows: * Create 100,000 0 byte files over NFS * Delete 100,000 0 byte files over NFS * Repeat the previous 2 tasks with 1k files * Untar a copy of our product with object files (quite a nasty test) * Rebuild the product "make -j" * Delete the build directory The reason for the 100k files tests is that this has been proven to be a significant indicator of desktop performance on the desktop systems of the developers. Within the budget we had, we have purchased the following system to meet our goals - if the OpenSolaris tests do not meet our requirements, it is certain that the equivalent tests under Linux will. I'm the only person here who wants OpenSolaris specificially so it is in my interest to try to get it working at least on par if not better than our current system. So here I am begging for further help. Dell R710 2x 2.40 Ghz Xeon 5330 CPU 16GB RAM (4x 4GB) mpt0 SAS 6/i (LSI 1068E) 2x 1TB SATA-II drives (rpool) 2x 50GB Enterprise SSD (slog) - Samsung MCCOE50G5MPQ-0VAD3 mpt1 SAS 5/E (LSI 1068E) Dell MD1000 15-bay External storage chassis with 2 heads 10x 450GB Seagate Cheetah 15,000 RPM SAS We also have a PERC 6/E w/512MB BBWC to test with or fall back to if we go with a Linux solution. I have installed OpenSolaris 2009.06 and updated to b117 and used mdb to modify the kernel to work around a current bug in b117 with the newer Dell systems. http://bugs.opensolaris.org/bugdatabase/view_bug.do%3Bjsessionid=76a34f41df5bbbfc2578934eeff8?bug_id=6850943 Keeping in mind that with these tests, the external MD1000 chassis is connected with a single 4 lane SAS cable which should give 12Gbps or 1.2GBps of throughput. Individually, each disk exhibits about 170MB/s raw write performance. e.g. jam...@scalzi:~$ pfexec dd if=/dev/zero of=/dev/rdsk/c8t5d0 bs=65536 count=32768 2147483648 bytes (2.1 GB) copied, 12.4934 s, 172 MB/s A single spindle zpool seems to perform OK. jam...@scalzi:~$ pfexec zpool create single c8t20d0 jam...@scalzi:~$ pfexec dd if=/dev/zero of=/single/foo bs=65536 count=327680 21474836480 bytes (21 GB) copied, 127.201 s, 169 MB/s RAID10 tests seem to be quite slow (about half the speed I would have expected - 170*5 = 850, I would have expected to see around 800MB/s) jam...@scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 mirror c8t20d0 c8t21d0 jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 count=163840 21474836480 bytes (21 GB) copied, 50.3066 s, 427 MB/s a 5 disk stripe seemed to perform as expected jam...@scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0 c8t19d0 c8t21d0 jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 count=163840 21474836480 bytes (21 GB) copied, 27.7972 s, 773 MB/s but a 10 disk stripe did not increase significantly jam...@scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0 c8t16d0 c8t11d0 c8t9d0 jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 count=163840 21474836480 bytes (21 GB) copied, 26.1189 s, 822 MB/s The best sequential write test I could elicit with redundancy was a pool with 2x 5 disk RAIDZ's striped jam...@scalzi:~$ pfexec zpool create fastdata raidz c8t10d0 c8t15d0 c8t16d0 c8t11d0 c8t9d0 raidz c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0 jam...@scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 count=163840 21474836480 bytes (21 GB) copied, 31.3934 s, 684 MB/s Moving onto testing NFS and trying to perform the create 100,000 0 byte files (aka, the metadata and NFS sync test). The test seemed to be likely to take about half an hour without a slog as I worked out when I killed it. Painfully slow. So I added one of the SSDs to the system as a slog which improved matters. The client is a Red Hat Enterprise Linux server on modern hardware and has been used for all tests against our old fileserver. The time to beat: RHEL5 client to Debian4+XFS server: bash-3.2# time tar xf zeroes.tar real2m41.979s user0m0.420s sys 0m5.255s And on the currently configured system: jam...@scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 mirror c8t20d0 c8t21d0 log c7t2d0 jam...@scalzi:~$ pfexec zfs set sharenfs='rw,ro...@10.1.0/23' fastdata bash-3.2# time tar xf zeroes.tar real8m7.176s user0m0.438s
Re: [zfs-discuss] SPARC SATA, please.
On 25/06/2009, at 5:16 AM, Miles Nordin wrote: and mpt is the 1068 driver, proprietary, works on x86 and SPARC. then there is also itmpt, the third-party-downloadable closed-source driver from LSI Logic, dunno much about it but someone here used it. I'm confused. Why do you say the mpt driver is proprietary and the LSI provided tool is closed source? I thought they were both closed source and that the LSI chipset specifications were proprietary. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cutting up a SSD for read/log use...
Hi Erik, On 22/06/2009, at 1:15 PM, Erik Trimble wrote: I just looked at pricing for the higher-end MLC devices, and it looks like I'm better off getting a single drive of 2X capacity than two with X capacity. Leaving aside the issue that by using 2 drives I get 2 x 3.0Gbps SATA performance instead of 1 x 3.0Gbps, are there problems with using two slices instead of whole-drives? That is, one slice for Read and the other for ZIL? The benefit you will get using 2 drives instead of 1 will be doubling your IOPS which will improve your overall performance, especially when using those drives as ZILs. Are you planning on using these drives as primary data storage and ZIL for the same volumes or as primary storage for (say) your rpool and ZIL for a data pool on spinning metal? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to do backup
On 20/06/2009, at 9:55 PM, Charles Hedrick wrote: I have a USB disk, to which I want to do a backup. I've used send | receive. It works fine until I try to reboot. At that point the system fails to come up because the backup copy is set to be mounted at the original location so the system tries to mount two different things the same place. I guess I can have the script set mountpoint=none, but I'd think there would be a better approach. Would a "zpool export $backup_pool" do the trick? (and consequently, you import the USB zpool before you start your backups? cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss