Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?
On Wed, Feb 20, 2013 at 4:49 PM, Markus Grundmann wrote: > Whenever I modify zfs pools or filesystems it's possible to destroy [on a > bad day :-)] my data. A new > property "protected=on|off" in the pool and/or filesystem can help the > administrator for datalost > (e.g. "zpool destroy tank" or "zfs destroy " command will > be rejected > when "protected=on" property is set). > > It's anywhere here on this list their can discuss/forward this feature > request? I hope you have > understand my post ;-) I like the idea and it is likely not very hard to implement. This is very similar to how snapshot holds work. # zpool upgrade -v | grep -i hold 18 Snapshot user holds So long as you aren't using a really ancient zpool version, you could use this feature to protect your file systems. # zfs create a/b # zfs snapshot a/b@snap # zfs hold protectme a/b@snap # zfs destroy a/b cannot destroy 'a/b': filesystem has children use '-r' to destroy the following datasets: a/b@snap # zfs destroy -r a/b cannot destroy 'a/b@snap': snapshot is busy Of course, snapshots aren't free if you write to the file system. A way around that is to create an empty file system within the one that you are trying to protect. # zfs create a/1 # zfs create a/1/hold # zfs snapshot a/1/hold@hold # zfs hold 'saveme!' a/1/hold@hold # zfs holds a/1/hold@hold NAME TAG TIMESTAMP a/1/hold@hold saveme! Wed Feb 20 15:06:29 2013 # zfs destroy -r a/1 cannot destroy 'a/1/hold@hold': snapshot is busy Extending the hold mechanism to filesystems and volumes would be quite nice. Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
iozone doesn't vary the blocksize during the test, it's a very artificial test but it's useful for gauging performance under different scenarios. So for this test all of the writes would have been 64k blocks, 128k, etc. for that particular step. Just as another point of reference I reran the test with a Crucial M4 SSD and the results for 16G/64k were 35mB/s (x5 improvement). I'll rerun that part of the test with zpool iostat and see what it says. Mike On Thu, Jul 19, 2012 at 7:27 PM, Jim Klimov wrote: >> This is normal. The problem is that with zfs 128k block sizes, zfs >> needs to re-read the original 128k block so that it can compose and >> write the new 128k block. With sufficient RAM, this is normally avoided >> because the original block is already cached in the ARC. >> >> If you were to reduce the zfs blocksize to 64k then the performance dive >> at 64k would go away but there would still be write performance loss at >> sizes other than a multiple of 64k. > > > I am not sure if I misunderstood the question or Bob's answer, > but I have a gut feeling it is not fully correct: ZFS block > sizes for files (filesystem datasets) are, at least by default, > dynamically-sized depending on the contiguous write size as > queued by the time a ZFS transaction is closed and flushed to > disk. In case of RAIDZ layouts, this logical block is further > striped over several sectors on several disks in one of the > top-level vdev's, starting with parity sectors for each "row". > > So, if the test logically overwrites full blocks of test data > files, reads for recombination are not needed (but that can > be checked for with "iostat 1" or "zpool iostat" - to see how > many reads do happen during write-tests?) Note that some reads > will show up anyway, i.e. to update ZFS metadata (the block > pointer tree). > > However, if the test file was written in 128K blocks and then > is rewritten with 64K blocks, then Bob's answer is probably > valid - the block would have to be re-read once for the first > rewrite of its half; it might be taken from cache for the > second half's rewrite (if that comes soon enough), and may be > spooled to disk as a couple of 64K blocks or one 128K block > (if both changes come soon after each other - within one TXG). > > HTH, > //Jim Klimov > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
vfs.zfs.txg.synctime_ms: 1000 vfs.zfs.txg.timeout: 5 On Thu, Jul 19, 2012 at 8:47 PM, John Martin wrote: > On 07/19/12 19:27, Jim Klimov wrote: > >> However, if the test file was written in 128K blocks and then >> is rewritten with 64K blocks, then Bob's answer is probably >> valid - the block would have to be re-read once for the first >> rewrite of its half; it might be taken from cache for the >> second half's rewrite (if that comes soon enough), and may be >> spooled to disk as a couple of 64K blocks or one 128K block >> (if both changes come soon after each other - within one TXG). > > > What are the values for zfs_txg_synctime_ms and zfs_txg_timeout > on this system (FreeBSD, IIRC)? > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benefits of enabling compression in ZFS for the zones
On Tue, Jul 10, 2012 at 6:29 AM, Jordi Espasa Clofent wrote: > Thanks for you explanation Fajar. However, take a look on the next lines: > > # available ZFS in the system > > root@sct-caszonesrv-07:~# zfs list > > NAME USED AVAIL REFER MOUNTPOINT > opt 532M 34.7G 290M /opt > opt/zones243M 34.7G32K /opt/zones > opt/zones/sct-scw02-shared 243M 34.7G 243M /opt/zones/sct-scw02-shared > static 104K 58.6G34K /var/www/ > > # creating a file in /root (UFS) > > root@sct-caszonesrv-07:~# dd if=/dev/zero of=file.bin count=1024 bs=1024 > 1024+0 records in > 1024+0 records out > 1048576 bytes (1.0 MB) copied, 0.0545957 s, 19.2 MB/s > root@sct-caszonesrv-07:~# pwd > /root > > # enable compression in some ZFS zone > > root@sct-caszonesrv-07:~# zfs set compression=on opt/zones/sct-scw02-shared > > # copying the previos file to this zone > > root@sct-caszonesrv-07:~# cp /root/file.bin > /opt/zones/sct-scw02-shared/root/ > > # checking the file size in the origin dir (UFS) and the destination one > (ZFS with compression enabled) > > root@sct-caszonesrv-07:~# ls -lh /root/file.bin > -rw-r--r-- 1 root root 1.0M Jul 10 13:21 /root/file.bin > > root@sct-caszonesrv-07:~# ls -lh /opt/zones/sct-scw02-shared/root/file.bin > -rw-r--r-- 1 root root 1.0M Jul 10 13:22 > /opt/zones/sct-scw02-shared/root/file.bin > > # the both files has exactly the same cksum! > > root@sct-caszonesrv-07:~# cksum /root/file.bin > 3018728591 1048576 /root/file.bin > > root@sct-caszonesrv-07:~# cksum /opt/zones/sct-scw02-shared/root/file.bin > 3018728591 1048576 /opt/zones/sct-scw02-shared/root/file.bin > > So... I don't see any size variation with this test. ls(1) tells you how much data is in the file - that is, how many bytes of data that an application will see if it reads the whole file. du(1) tells you how many disk blocks are used. If you look at the stat structure in stat(2), ls reports st_size, du reports st_blocks. Blocks full of zeros are special to zfs compression - it recognizes them and stores no data. Thus, a file that contains only zeros will only require enough space to hold the file metadata. $ zfs list -o compression ./ COMPRESS on $ dd if=/dev/zero of=1gig count=1024 bs=1024k 1024+0 records in 1024+0 records out $ ls -l 1gig -rw-r--r-- 1 mgerdts staff1073741824 Jul 10 07:52 1gig $ du -k 1gig 0 1gig -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On Tue, Jun 12, 2012 at 11:17 AM, Sašo Kiselkov wrote: > On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote: >> find where your nics are bound too >> >> mdb -k >> ::interrupts >> >> create a processor set including those cpus [ so just the nic code will >> run there ] >> >> andy > > Tried and didn't help, unfortunately. I'm still seeing drops. What's > even funnier is that I'm seeing drops when the machine is sync'ing the > txg to the zpool. So looking at a little UDP receiver I can see the > following input stream bandwidth (the stream is constant bitrate, so > this shouldn't happen): If processing in interrupt context (use intrstat) is dominating cpu usage, you may be able to use pcitool to cause the device generating all of those expensive interrupts to be moved to another CPU. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Strange hang during snapshot receive
On Thu, May 10, 2012 at 5:37 AM, Ian Collins wrote: > I have an application I have been using to manage data replication for a > number of years. Recently we started using a new machine as a staging > server (not that new, an x4540) running Solaris 11 with a single pool built > from 7x6 drive raidz. No dedup and no reported errors. > > On that box and nowhere else is see empty snapshots taking 17 or 18 seconds > to write. Everywhere else they return in under a second. > > Using truss and the last published source code, it looks like the pause is > between a printf and the call to zfs_ioctl and there aren't any other > functions calls between them: For each snapshot in a stream, there is one zfs_ioctl() call. During that time, the kernel will read the entire substream (that is, for one snapshot) from the input file descriptor. > > 100.5124 0.0004 open("/dev/zfs", O_RDWR|O_EXCL) = 10 > 100.7582 0.0001 read(7, "\0\0\0\0\0\0\0\0ACCBBAF5".., 312) = 312 > 100.7586 0. read(7, 0x080464F8, 0) = 0 > 100.7591 0. time() = 1336628656 > 100.7653 0.0035 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040CF0) = 0 > 100.7699 0.0022 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900) = 0 > 100.7740 0.0016 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040580) = 0 > 100.7787 0.0026 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x080405B0) = 0 > 100.7794 0.0001 write(1, " r e c e i v i n g i n".., 75) = 75 > 118.3551 0.6927 ioctl(8, ZFS_IOC_RECV, 0x08042570) = 0 > 118.3596 0.0010 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900) = 0 > 118.3598 0. time() = 1336628673 > 118.3600 0. write(1, " r e c e i v e d 3 1 2".., 45) = 45 > > zpool iostat (1 second interval) for the period is: > > tank 12.5T 6.58T 175 0 271K 0 > tank 12.5T 6.58T 176 0 299K 0 > tank 12.5T 6.58T 189 0 259K 0 > tank 12.5T 6.58T 156 0 231K 0 > tank 12.5T 6.58T 170 0 243K 0 > tank 12.5T 6.58T 252 0 295K 0 > tank 12.5T 6.58T 179 0 200K 0 > tank 12.5T 6.58T 214 0 258K 0 > tank 12.5T 6.58T 165 0 210K 0 > tank 12.5T 6.58T 154 0 178K 0 > tank 12.5T 6.58T 186 0 221K 0 > tank 12.5T 6.58T 184 0 215K 0 > tank 12.5T 6.58T 218 0 248K 0 > tank 12.5T 6.58T 175 0 228K 0 > tank 12.5T 6.58T 146 0 194K 0 > tank 12.5T 6.58T 99 258 209K 1.50M > tank 12.5T 6.58T 196 296 294K 1.31M > tank 12.5T 6.58T 188 130 229K 776K > > Can anyone offer any insight or further debugging tips? I have yet to see a time when zpool iostat tells me something useful. I'd take a look at "iostat -xzn 1" or similar output. It could point to imbalanced I/O or a particular disk that has abnormally high service times. Have you installed any SRUs? If not, you could be seeing: 7060894 zfs recv is excruciatingly slow which is fixed in Solaris 11 SRU 5. If you are using zones and are using any https pkg(5) origins (such as https://pkg.oracle.com/solaris/support), I suggest reading https://forums.oracle.com/forums/thread.jspa?threadID=2380689&tstart=15 before updating to SRU 6 (SRU 5 is fine, however). The fix for the problem mentioned in that forums thread should show up in an upcoming SRU via CR 7157313. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
On Mon, Mar 26, 2012 at 6:18 PM, Bob Friesenhahn wrote: > On Mon, 26 Mar 2012, Andrew Gabriel wrote: > >> I just played and knocked this up (note the stunning lack of comments, >> missing optarg processing, etc)... >> Give it a list of files to check... > > > This is a cool program, but programmers were asking (and answering) this > same question 20+ years ago before there was anything like SEEK_HOLE. > > If file space usage is less than file directory size then it must contain a > hole. Even for compressed files, I am pretty sure that Solaris reports the > uncompressed space usage. That's not the case. # zfs create -o compression=on rpool/junk # perl -e 'print "foo" x 10'> /rpool/junk/foo # ls -ld /rpool/junk/foo -rw-r--r-- 1 root root 30 Mar 26 18:25 /rpool/junk/foo # du -h /rpool/junk/foo 16K /rpool/junk/foo # truss -t stat -v stat du /rpool/junk/foo ... lstat64("foo", 0x08047C40) = 0 d=0x02B90028 i=8 m=0100644 l=1 u=0 g=0 sz=30 at = Mar 26 18:25:25 CDT 2012 [ 1332804325.742827733 ] mt = Mar 26 18:25:25 CDT 2012 [ 1332804325.889143166 ] ct = Mar 26 18:25:25 CDT 2012 [ 1332804325.889143166 ] bsz=131072 blks=32fs=zfs Notice that it says it has 32 512 byte blocks. The mechanism you suggest does work for every other file system that I've tried it on. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
2012/3/26 ольга крыжановская : > How can I test if a file on ZFS has holes, i.e. is a sparse file, > using the C api? See SEEK_HOLE in lseek(2). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any rhyme or reason to disk dev names?
On Wed, Dec 21, 2011 at 1:58 AM, Matthew R. Wilson wrote: > Hello, > > I am curious to know if there is an easy way to guess or identify the device > names of disks. Previously the /dev/dsk/c0t0d0s0 system made sense to me... > I had a SATA controller card with 8 ports, and they showed up with the > numbers 1-8 in the "t" position of the device name. > > But I just built a new system with two LSI SAS HBAs in it, and my device > names are along the lines of: > /dev/dsk/c0t5000CCA228C0E488d0 > > I could not find any correlation between that identifier and the a) > controller the disk was plugged in to, or b) the port number on the > controller. The only way I could make a mapping of device name to controller > port was to add one drive at a time, reboot the system, and run "format" to > see which new disk name shows up. > > I'm guessing there's a better way, but I can't find any obvious answer as to > how to determine which port on my LSI controller card will correspond with > which seemingly random device name. Can anyone offer any suggestions on a > way to predict the device naming, or at least get the system to list the > disks after I insert one without rebooting? Depending on the hardware you are using, you may be able to benefit from croinfo. $ croinfo D:devchassis-path t:occupant-type c:occupant-compdev - --- - /dev/chassis//SYS/SASBP/HDD0/disk disk c0t5000CCA012B66E90d0 /dev/chassis//SYS/SASBP/HDD1/disk disk c0t5000CCA012B68AC8d0 The text in the left column represents text that should be printed on the corresponding disk slots. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gaining access to var from a live cd
On Tue, Nov 29, 2011 at 4:40 PM, Francois Dion wrote: > It is on openindiana 151a, no separate /var as far as But I'll have to > test this on solaris11 too when I get a chance. > > The problem is that if I > > zfs mount -o mountpoint=/tmp/rescue (or whatever) rpool/ROOT/openindiana > > i get a cannot mount /mnt/rpool: directory is not empty. > > The reason for that is that I had to do a zpool import -R /mnt/rpool > rpool (or wherever I mount it it doesnt matter) before I could do a > zfs mount, else I dont have access to the rpool zpool for zfs to do > its thing. > > chicken / egg situation? I miss the old fail safe boot menu... You can mount it pretty much anywhere: mkdir /tmp/foo zfs mount -o mountpoint=/tmp/foo ... I'm not sure when the temporary mountpoint option (-o mountpoint=...) came in. If it's not valid syntax then: mount -F zfs rpool/ROOT/solaris /tmp/foo -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gaining access to var from a live cd
On Tue, Nov 29, 2011 at 3:01 PM, Francois Dion wrote: > I've hit an interesting (not) problem. I need to remove a problematic > ld.config file (due to an improper crle...) to boot my laptop. This is > OI 151a, but fundamentally this is zfs, so i'm asking here. > > what I did after booting the live cd and su: > mkdir /tmp/disk > zpool import -R /tmp/disk -f rpool > > export shows up in there and rpool also, but in rpool there is only > boot and etc. > > zfs list shows rpool/ROOT/openindiana as mounted on /tmp/disk and I > see dump and swap, but no var. rpool/ROOT shows as legacy, so I > figured, maybe mount that. > > mount -F zfs rpool/ROOT /mnt/rpool That dataset (rpool/ROOT) should never have any files in it. It is just a "container" for boot environments. You can see which boot environments exist with: zfs list -r rpool/ROOT If you are running Solaris 11, the boot environment's root dataset will show a mountpoint property value of /. Assuming it is called "solaris" you can mount it with: zfs mount -o mountpoint=/mnt/rpool rpool/ROOT/solaris If the system is running Solaris 11 (and was not updated from Solaris 11 Express), it will have a separate /var dataset. zfs mount -o mountpoint=/mnt/rpool/var rpool/ROOT/solaris/var -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FS Reliability WAS: about btrfs and zfs
On Fri, Oct 21, 2011 at 8:02 PM, Fred Liu wrote: > >> 3. Do NOT let a system see drives with more than one OS zpool at the >> same time (I know you _can_ do this safely, but I have seen too many >> horror stories on this list that I just avoid it). >> > > Can you elaborate #3? In what situation will it happen? Some people have trained their fingers to use the -f option on every command that supports it to force the operation. For instance, how often do you do rm -rf vs. rm -r and answer questions about every file? If various zpool commands (import, create, replace, etc.) are used against the wrong disk with a force option, you can clobber a zpool that is in active use by another system. In a previous job, my lab environment had a bunch of LUNs presented to multiple boxes. This was done for convenience in an environment where there would be little impact if an errant command were issued. I'd never do that in production without some form of I/O fencing in place. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Kernel panic on zpool import. 200G of data inaccessible!
On Thu, Aug 4, 2011 at 2:47 PM, Stuart James Whitefish wrote: > # zpool import -f tank > > http://imageshack.us/photo/my-images/13/zfsimportfail.jpg/ I encourage you to open a support case and ask for an escalation on CR 7056738. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs rename query
On Wed, Jul 27, 2011 at 6:37 AM, Nishchaya Bahuguna wrote: > Hi, > > I have a query regarding the zfs rename command. > > There are 5 zones and my requirement is to change the zone paths using zfs > rename. > > + zoneadm list -cv > ID NAME STATUS PATH BRAND IP > 0 global running / native > shared > 34 public running /txzone/public native > shared > 35 internal running /txzone/internal native > shared > 36 restricted running /txzone/restricted native > shared > 37 needtoknow running /txzone/needtoknow native shared > 38 sandbox running /txzone/sandbox native shared > > A whole root zone was configured and installed. Rest of the 4 zones > were cloned from . > > zoneadm -z clone public > > zfs get origin lists the origin as for all 4 zones. > > I run zfs rename on 4 of these clone'd zones and it throws a device busy > error because of parent-child relationship. I think you are getting the device busy error for a different reason. I just did the following: zfs create -o mountpoint=/zones rpool/zones zonecfg -z z1 'create; set zonepath=/zones/z1' zoneadm -z z1 install zonecfg -z z1c1 'create -t z1; set zonepath=/zones/z1c1' zonecfg -z z1c2 'create -t z1; set zonepath=/zones/z1c2' zoneadm -z z1c1 clone z1 zoneadm -z z1c2 clone z2 At this point, I have the following: bash-3.2# zfs list -r -o name,origin rpool/zones NAME ORIGIN rpool/zones - rpool/zones/z1- rpool/zones/z1@SUNWzone1 - rpool/zones/z1@SUNWzone2 - rpool/zones/z1c1 rpool/zones/z1@SUNWzone1 rpool/zones/z1c2 rpool/zones/z1@SUNWzone2 Next, I decide that I would like z1c1 to be rpool/new/z1c1 instead of it's current place. Note that this will also change the mountpoint which breaks the zone. bash-3.2# zfs create -o mountpoint=/new rpool/new bash-3.2# zfs rename rpool/zones/z1c1 rpool/new/z1c1 bash-3.2# zfs list -o name,origin -r /new NAMEORIGIN rpool/new - rpool/new/z1c1 rpool/zones/z1@SUNWzone1 To get a "device busy" error, I need to cause a situation where the zonepath cannot be unmounted. Having the zone running is a good way to do that: bash-3.2# zoneadm -z z1c2 boot WARNING: zone z1c1 is installed, but its zonepath /zones/z1c1 does not exist. bash-3.2# zfs rename rpool/zones/z1c2 rpool/new/z1c2 cannot unmount '/zones/z1c2': Device busy > I guess that can be handled with zfs promote because promote would swap the > parent and child. You would need to do this to rename a dataset that the origin (one that is cloned) not the clones. That is, if you wanted to rename the dataset for your public zone or I wanted to rename the dataset for z1, then you would need to promote the datasets for all of the clones. This is a known issue. 6472202 'zfs rollback' and 'zfs rename' require that clones be unmounted > So, how do I make it work when there are multiple zones cloned from a single > parent? Is there a way that zfs rename can work for ALL the zones rather > than working with two zones at a time? As I said above. > > Also, is there a command line option available for sorting the datasets in > correct dependency order? "zfs list -r -o name,origin" is a good starting point. I suspect that it doesn't give you exactly the output you are looking for. FWIW, the best way to achieve what you are after without breaking the zones is going to be along the lines of: zlogin z1c1 init 0 zoneadm -z z1c1 detach zfs rename rpool/zones/z1c1 rpool/new/z1c1 zoneadm -z z1c1 'set zonepath=/new/z1c1' zoneadm -z z1c1 attach zoneadm -z z1c1 boot -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is ".$EXTEND/$QUOTA" ?
On Tue, Jul 19, 2011 at 2:39 PM, Orvar Korvar wrote: > I am using S11E, and have created a zpool on a single disk as storage. In > several directories, I can see a directory called ".$EXTEND/$QUOTA". What is > it for? Can I delete it? > -- Perhaps this is of help. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/smbsrv/smb_pathname.c#752 752 /* 753 * smb_pathname_preprocess_quota 754 * 755 * There is a special file required by windows so that the quota 756 * tab will be displayed by windows clients. This is created in 757 * a special directory, $EXTEND, at the root of the shared file 758 * system. To hide this directory prepend a '.' (dot). 759 */ -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
n, > not only on on hardware built for dedicated storage. > > Sparse-root vs. full-root zones, or disk images of VMs; > are they stuffed in one rpool or spread between rpool and > data pools - that detail is not actually the point of the thread. > > Actual useability of dedup for savings and gains on these > tasks (preferably working also on low-mid-range boxes, > where adding a good enterprise SSD would double the > server cost - not only on those big good systems with > tens of GB of RAM), and hopefully simplifying the system > configuration and maintenance - that is indeed the point > in question. > > //Jim > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Non-Global zone recovery
On Thu, Jul 7, 2011 at 2:41 PM, Ram kumar wrote: > > Hi Cindy, > > Thanks for the email. > > We are using Solaris 10 with out Live Upgrade. > > Tested following in the sandbox environment: > > 1) We have one non-global zone (TestZone) which is running on Test > zpool (SAN) > > 2) Don’t see zpool or non-global zone after re-image of Global zone. > > 3) Imported zpool Test > > Now I am trying to create Non-global zone and it is giving error > > bash-3.00# zonecfg -z Test > Test: No such zone configured > Use 'create' to begin configuring a new zone. > zonecfg:Test> create -a /zones/Test > invalid path to detached zone If you use create -a, it requires that SUNWdetached.xml exist as a means for configuring the various properties (e.g. zonepath, brand, etc.) and resources (inherit-pkg-dir, net, fs, device, etc.) for the zone. Since you don't have the SUNWdetached.xml, you can't use it. Assuming you have a backup of the system, you could restore a copy of /etc/zones/ to /etc/zones/restored-.xml, then run: zonecfg -z create -t restored- If that's not an option or is just too inconvenient, use zonecfg to configure the zone just like you did initially. That is, do not use "create -a", use "create", "create -b", or "create -t " followed by whatever property settings and added resources are appropriate. After you get past zonecfg, you should be able to: zoneadm -z attach If the package and patch levels don't match up (the global zone perhaps was installed from a newer update or has newer patches): zoneadm -z attach -U or zoneadm -z attach -u Since you seem to be doing this in a test environment to prepare for bad things to happen, I'd suggest that you make it a standard practice when you are done configuring a zone to do: zonecfg -z export > /zonecfg.export Then if you need to recover the zone using only the things that are on the SAN, you can do: zpool import ... zonecfg -z -f /zonecfg.export zoneadm -z attach [-u|-U] Any follow-ups should probably go to Oracle Support or zones-discuss. Your problems are not related to zfs. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FW: Solaris panic
ippy genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 > Version snv_151a 64-bit > Mar 17 15:28:51 zippy genunix: [ID 877030 kern.notice] Copyright (c) 1983, > 2010, Oracle and/or its affiliates. All rights reserved. > > Can anyone help? > > Regards > Karl > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs-nfs-sun 7000 series
Hello, I have a Sun 7000 series NAS device, I am trying to back it up via NFS mount on a Solaris 10 server running Networker 7.6.1. It works but it is extremely slow, I have tested other mounts and they work much faster. The only difference (that I can see) between the two mounts are the underlying file system zfs vs ufs. Any thoughts to speed up the backup of the Sun 7000 nfs mount? Thanks you. Mike MacNeil Global IT Infrastructure [cid:image001.gif@01CBDF3D.6192F090] 4281 Harvester Rd. Burlington, ON l7l 5m4 Canada Phone: 905 632 2999 ext.2920 Fax: 905 632 2055 Email: mike.macn...@gennum.com www.gennum.com This communication contains confidential information intended only for the addressee(s). If you have received this communication in error, please notify us immediately and delete this communication from your mail box. <>___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] External SATA drive enclosures + ZFS?
On 2/25/2011 7:34 PM, Rich Teer wrote: > > One product that seems to fit the bill is the StarTech.com S352U2RER, > an external dual SATA disk enclosure with USB and eSATA connectivity > (I'd be using the USB port). Here's a link to the specific product > I'm considering: > > http://ca.startech.com/product/S352U2RER-35in-eSATA-USB-Dual-SATA-Hot-Swap-External-RAID-Hard-Drive-Enclosure I have had mixed results with their 4 bay version. When they work, they are great, but we have had a number of DOA/almost DOA units. I have had good luck with products from http://www.addonics.com/ (They ship to Canada as well without issue) Why use USB ? You wll get much better performance/throughput on eSata (if you have good drivers of course). I use their sil3124 eSata controller on FreeBSD as well as a number of PM units and they work great. ---Mike -- --- Mike Tancsa, tel +1 519 651 3400 Sentex Communications, m...@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada http://www.tancsa.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] multiple disk failure (solved?)
On 1/31/2011 4:19 PM, Mike Tancsa wrote: > On 1/31/2011 3:14 PM, Cindy Swearingen wrote: >> Hi Mike, >> >> Yes, this is looking much better. >> >> Some combination of removing corrupted files indicated in the zpool >> status -v output, running zpool scrub and then zpool clear should >> resolve the corruption, but its depends on how bad the corruption is. >> >> First, I would try least destruction method: Try to remove the >> files listed below by using the rm command. >> >> This entry probably means that the metadata is corrupted or some >> other file (like a temp file) no longer exists: >> >> tank1/argus-data:<0xc6> > > > Hi Cindy, > I removed the files that were listed, and now I am left with > > errors: Permanent errors have been detected in the following files: > > tank1/argus-data:<0xc5> > tank1/argus-data:<0xc6> > tank1/argus-data:<0xc7> > > I have started a scrub > scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go Looks like that was it! The scrub finished in the time it estimated and that was all I needed to do. I did not have to to do zpool clear or any other commands. Is there anything beyond scrub to check the integrity of the pool ? 0(offsite)# zpool status -v pool: tank1 state: ONLINE scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46 2011 config: NAMESTATE READ WRITE CKSUM tank1 ONLINE 0 0 0 raidz1ONLINE 0 0 0 ad0 ONLINE 0 0 0 ad1 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 raidz1ONLINE 0 0 0 ada0ONLINE 0 0 0 ada1ONLINE 0 0 0 ada2ONLINE 0 0 0 ada3ONLINE 0 0 0 raidz1ONLINE 0 0 0 ada5ONLINE 0 0 0 ada8ONLINE 0 0 0 ada7ONLINE 0 0 0 ada6ONLINE 0 0 0 errors: No known data errors 0(offsite)# ---Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] multiple disk failure (solved?)
On 1/31/2011 3:14 PM, Cindy Swearingen wrote: > Hi Mike, > > Yes, this is looking much better. > > Some combination of removing corrupted files indicated in the zpool > status -v output, running zpool scrub and then zpool clear should > resolve the corruption, but its depends on how bad the corruption is. > > First, I would try least destruction method: Try to remove the > files listed below by using the rm command. > > This entry probably means that the metadata is corrupted or some > other file (like a temp file) no longer exists: > > tank1/argus-data:<0xc6> Hi Cindy, I removed the files that were listed, and now I am left with errors: Permanent errors have been detected in the following files: tank1/argus-data:<0xc5> tank1/argus-data:<0xc6> tank1/argus-data:<0xc7> I have started a scrub scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go I will report back once the scrub is done! ---Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] multiple disk failure (solved?)
On 1/29/2011 6:18 PM, Richard Elling wrote: > > On Jan 29, 2011, at 12:58 PM, Mike Tancsa wrote: > >> On 1/29/2011 12:57 PM, Richard Elling wrote: >>>> 0(offsite)# zpool status >>>> pool: tank1 >>>> state: UNAVAIL >>>> status: One or more devices could not be opened. There are insufficient >>>> replicas for the pool to continue functioning. >>>> action: Attach the missing device and online it using 'zpool online'. >>>> see: http://www.sun.com/msg/ZFS-8000-3C >>>> scrub: none requested >>>> config: >>>> >>>> NAMESTATE READ WRITE CKSUM >>>> tank1 UNAVAIL 0 0 0 insufficient replicas >>>> raidz1ONLINE 0 0 0 >>>> ad0 ONLINE 0 0 0 >>>> ad1 ONLINE 0 0 0 >>>> ad4 ONLINE 0 0 0 >>>> ad6 ONLINE 0 0 0 >>>> raidz1ONLINE 0 0 0 >>>> ada4ONLINE 0 0 0 >>>> ada5ONLINE 0 0 0 >>>> ada6ONLINE 0 0 0 >>>> ada7ONLINE 0 0 0 >>>> raidz1UNAVAIL 0 0 0 insufficient replicas >>>> ada0UNAVAIL 0 0 0 cannot open >>>> ada1UNAVAIL 0 0 0 cannot open >>>> ada2UNAVAIL 0 0 0 cannot open >>>> ada3UNAVAIL 0 0 0 cannot open >>>> 0(offsite)# >>> >>> This is usually easily solved without data loss by making the >>> disks available again. Can you read anything from the disks using >>> any program? >> >> Thats the strange thing, the disks are readable. The drive cage just >> reset a couple of times prior to the crash. But they seem OK now. Same >> order as well. >> >> # camcontrol devlist >> at scbus0 target 0 lun 0 >> (pass0,ada0) >> at scbus0 target 1 lun 0 >> (pass1,ada1) >> at scbus0 target 2 lun 0 >> (pass2,ada2) >> at scbus0 target 3 lun 0 >> (pass3,ada3) >> >> >> # dd if=/dev/ada2 of=/dev/null count=20 bs=1024 >> 20+0 records in >> 20+0 records out >> 20480 bytes transferred in 0.001634 secs (12534561 bytes/sec) >> 0(offsite)# > > The next step is to run "zdb -l" and look for all 4 labels. Something like: > zdb -l /dev/ada2 > > If all 4 labels exist for each drive and appear intact, then look more closely > at how the OS locates the vdevs. If you can't solve the "UNAVAIL" problem, > you won't be able to import the pool. > -- richard On 1/29/2011 10:13 PM, James R. Van Artsdalen wrote: > On 1/28/2011 4:46 PM, Mike Tancsa wrote: >> >> I had just added another set of disks to my zfs array. It looks like the >> drive cage with the new drives is faulty. I had added a couple of files >> to the main pool, but not much. Is there any way to restore the pool >> below ? I have a lot of files on ad0,1,4,6 and ada4,5,6,7 and perhaps >> one file on the new drives in the bad cage. > > Get another enclosure and verify it works OK. Then move the disks from > the suspect enclosure to the tested enclosure and try to import the pool. > > The problem may be cabling or the controller instead - you didn't > specify how the disks were attached or which version of FreeBSD you're > using. > First off thanks to all who responded on and offlist! Good news (for me) it seems. New cage and all seems to be recognized correctly. The history is ... 2010-04-22.14:27:38 zpool add tank1 raidz /dev/ada4 /dev/ada5 /dev/ada6 /dev/ada7 2010-06-11.13:49:33 zfs create tank1/argus-data 2010-06-11.13:49:41 zfs create tank1/argus-data/previous 2010-06-11.13:50:38 zfs set compression=off tank1/argus-data 2010-08-06.12:20:59 zpool replace tank1 ad1 ad1 2010-09-16.10:17:51 zpool upgrade -a 2011-01-28.11:45:43 zpool add tank1 raidz /dev/ada0 /dev/ada1 /dev/ada2 /dev/ada3 FreeBSD RELENG_8 from last week, 8G of RAM, amd64. zpool status -v pool: tank1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM tank1 ONLINE 0 0 0
Re: [zfs-discuss] multiple disk failure
On 1/30/2011 12:39 AM, Richard Elling wrote: >> Hmmm, doesnt look good on any of the drives. > > I'm not sure of the way BSD enumerates devices. Some clever person thought > that hiding the partition or slice would be useful. I don't find it useful. > On a Solaris > system, ZFS can show a disk something like c0t1d0, but that doesn't exist. The > actual data is in slice 0, so you need to use c0t1d0s0 as the argument to zdb. I think its the right syntax. On the older drives, 0(offsite)# zdb -l /dev/ada0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 0(offsite)# zdb -l /dev/ada4 LABEL 0 version=15 name='tank1' state=0 txg=44593174 pool_guid=7336939736750289319 hostid=3221266864 hostname='offsite.sentex.ca' top_guid=6980939370923808328 guid=16144392433229115618 vdev_tree type='raidz' id=1 guid=6980939370923808328 nparity=1 metaslab_array=38 metaslab_shift=35 ashift=9 asize=4000799784960 is_log=0 children[0] type='disk' id=0 guid=16144392433229115618 path='/dev/ada4' whole_disk=0 DTL=341 children[1] type='disk' id=1 guid=1210677308003674848 path='/dev/ada5' whole_disk=0 DTL=340 children[2] type='disk' id=2 guid=2517076601231706249 path='/dev/ada6' whole_disk=0 DTL=339 children[3] type='disk' id=3 guid=16621760039941477713 path='/dev/ada7' whole_disk=0 DTL=338 LABEL 1 version=15 name='tank1' state=0 txg=44592523 pool_guid=7336939736750289319 hostid=3221266864 hostname='offsite.sentex.ca' top_guid=6980939370923808328 guid=16144392433229115618 vdev_tree type='raidz' id=1 guid=6980939370923808328 nparity=1 metaslab_array=38 metaslab_shift=35 ashift=9 asize=4000799784960 is_log=0 children[0] type='disk' id=0 guid=16144392433229115618 path='/dev/ada4' whole_disk=0 DTL=341 children[1] type='disk' id=1 guid=1210677308003674848 path='/dev/ada5' whole_disk=0 DTL=340 children[2] type='disk' id=2 guid=2517076601231706249 path='/dev/ada6' whole_disk=0 DTL=339 children[3] type='disk' id=3 guid=16621760039941477713 path='/dev/ada7' whole_disk=0 DTL=338 LABEL 2 version=15 name='tank1' state=0 txg=44593174 pool_guid=7336939736750289319 hostid=3221266864 hostname='offsite.sentex.ca' top_guid=6980939370923808328 guid=16144392433229115618 vdev_tree type='raidz' id=1 guid=6980939370923808328 nparity=1 metaslab_array=38 metaslab_shift=35 ashift=9 asize=4000799784960 is_log=0 children[0] type='disk' id=0 guid=16144392433229115618 path='/dev/ada4' whole_disk=0 DTL=341 children[1] type='disk' id=1 guid=1210677308003674848 path='/dev/ada5' whole_disk=0 DTL=340 children[2] type='disk' id=2 guid=2517076601231706249 path='/dev/ada6' whole_disk=0 DTL=339 children[3] type='disk' id=3 guid=16621760039941477713 path='/dev/ada7' whole_disk=0 DTL=338 -
Re: [zfs-discuss] multiple disk failure
On 1/29/2011 6:18 PM, Richard Elling wrote: >> 0(offsite)# > > The next step is to run "zdb -l" and look for all 4 labels. Something like: > zdb -l /dev/ada2 > > If all 4 labels exist for each drive and appear intact, then look more closely > at how the OS locates the vdevs. If you can't solve the "UNAVAIL" problem, > you won't be able to import the pool. Hmmm, doesnt look good on any of the drives. Before I give up, I will try the drives in a different cage Monday. Unfortunately, its a 150km away from me at our DR site # zdb -l /dev/ada0 LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] multiple disk failure
On 1/29/2011 11:38 AM, Edward Ned Harvey wrote: > > That is precisely the reason why you always want to spread your mirror/raidz > devices across multiple controllers or chassis. If you lose a controller or > a whole chassis, you lose one device from each vdev, and you're able to > continue production in a degraded state... Thanks. These are backups of backups. It would be nice to restore them as it will take a while to sync up once again. But if I need to start fresh, is there a resource you can point me to with the current best practices for laying out large storage like this ? Its just for backups of backups in a DR site ---Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] multiple disk failure
On 1/29/2011 12:57 PM, Richard Elling wrote: >> 0(offsite)# zpool status >> pool: tank1 >> state: UNAVAIL >> status: One or more devices could not be opened. There are insufficient >>replicas for the pool to continue functioning. >> action: Attach the missing device and online it using 'zpool online'. >> see: http://www.sun.com/msg/ZFS-8000-3C >> scrub: none requested >> config: >> >>NAMESTATE READ WRITE CKSUM >>tank1 UNAVAIL 0 0 0 insufficient replicas >> raidz1ONLINE 0 0 0 >>ad0 ONLINE 0 0 0 >>ad1 ONLINE 0 0 0 >>ad4 ONLINE 0 0 0 >>ad6 ONLINE 0 0 0 >> raidz1ONLINE 0 0 0 >>ada4ONLINE 0 0 0 >>ada5ONLINE 0 0 0 >>ada6ONLINE 0 0 0 >>ada7ONLINE 0 0 0 >> raidz1UNAVAIL 0 0 0 insufficient replicas >>ada0UNAVAIL 0 0 0 cannot open >>ada1UNAVAIL 0 0 0 cannot open >>ada2UNAVAIL 0 0 0 cannot open >>ada3UNAVAIL 0 0 0 cannot open >> 0(offsite)# > > This is usually easily solved without data loss by making the > disks available again. Can you read anything from the disks using > any program? Thats the strange thing, the disks are readable. The drive cage just reset a couple of times prior to the crash. But they seem OK now. Same order as well. # camcontrol devlist at scbus0 target 0 lun 0 (pass0,ada0) at scbus0 target 1 lun 0 (pass1,ada1) at scbus0 target 2 lun 0 (pass2,ada2) at scbus0 target 3 lun 0 (pass3,ada3) # dd if=/dev/ada2 of=/dev/null count=20 bs=1024 20+0 records in 20+0 records out 20480 bytes transferred in 0.001634 secs (12534561 bytes/sec) 0(offsite)# ---Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] multiple disk failure
Hi, I am using FreeBSD 8.2 and went to add 4 new disks today to expand my offsite storage. All was working fine for about 20min and then the new drive cage started to fail. Silly me for assuming new hardware would be fine :( The new drive cage started to fail, it hung the server and the box rebooted. After it rebooted, the entire pool is gone and in the state below. I had only written a few files to the new larger pool and I am not concerned about restoring that data. However, is there a way to get back the original pool data ? Going to http://www.sun.com/msg/ZFS-8000-3C gives a 503 error on the web page listed BTW. 0(offsite)# zpool status pool: tank1 state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-3C scrub: none requested config: NAMESTATE READ WRITE CKSUM tank1 UNAVAIL 0 0 0 insufficient replicas raidz1ONLINE 0 0 0 ad0 ONLINE 0 0 0 ad1 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 raidz1ONLINE 0 0 0 ada4ONLINE 0 0 0 ada5ONLINE 0 0 0 ada6ONLINE 0 0 0 ada7ONLINE 0 0 0 raidz1UNAVAIL 0 0 0 insufficient replicas ada0UNAVAIL 0 0 0 cannot open ada1UNAVAIL 0 0 0 cannot open ada2UNAVAIL 0 0 0 cannot open ada3UNAVAIL 0 0 0 cannot open 0(offsite)# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool import crashes system
I am trying to bring in my zpool from build 121 into build 134 and every time I do a zpool import the system crashes. I have read other posts for this and have tried setting zfs_recover = 1 and aok = 1 in /etc/system I have used mdb to verify that they are in the kernel but the system still crashes as soon as import is called. On this system I can rebuild the entire pool from scratch but my next system is 4Tbytes and I don't have space on any other system to store that much data. Anyone have a way to import and upgrade a older pool to a newer OS? TIA mic -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hardware going bad
On Wed, Oct 27, 2010 at 3:41 PM, Harry Putnam wrote: > I'm guessing it was probably more like 60 to 62 c under load. The > temperature I posted was after something like 5minutes of being > totally shutdown and the case been open for a long while. (mnths if > not yrs) What happens if the case is closed (and all PCI slot, disk, etc. slots are closed)? Having the case open likely changes the way that air flows across the various components. Also, if there is tobacco smoke near the machine, it will cause a sticky build-up that likely contributes to heat dissipation problems. Perhaps this belongs somewhere other than zfs-discuss - it has nothing to do with zfs. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN
On Wed, Oct 27, 2010 at 9:27 AM, bhanu prakash wrote: > Hi Mike, > > > Thanks for the information... > > Actually the requirement is like this. Please let me know whether it matches > for the below requirement or not. > > Question: > > The SAN team will assign the new LUN’s on EMC DMX4 (currently IBM Hitache is > there). We need to move the 17 containers which are existed on the > server Host1 to new LUN’s”. > > > Please give me the steps to do this activity. Without knowing the layout of the storage, it is impossible to give you precise instructions. This sounds like it is a production Solaris 10 system in an enterprise environment. In most places that I've worked, I would be hesitant to provide the required level of detail on a public mailing list. Perhaps you should open a service call to get the assistance you need. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN
On Tue, Oct 26, 2010 at 9:40 AM, bhanu prakash wrote: > Hi Team, > > > There 17 zones on the machine T5120. I want to move all the zones which are > ZFS filesystem to another new LUN. > > Can you give me the steps to proceed this. If the only thing on the source lun is the pool that contains the zones and the new LUN is at least as big as the old LUN: zpool replace The above can be done while the zones are booted. Depending on the characteristics of the server and workloads, the workloads may feel a bit sluggish during this time due to increased I/O activity. If that works for you, stop reading now. In the event that the scenario above doesn't apply, read on. Assuming all the zones are under oldpool/zones, oldpool/zones is mounted at /zones, and you have done "zpool create newpool " Be sure to test this procedure - I didn't! zfs create newlun/zones # optionally, shut down the zones zfs snapshot -r oldpool/zo...@phase1 zfs send -r oldpool/zo...@phase1 | zfs receive newpool/zo...@phase1 # If you did not shut down the zones above, shut them down now. # If the zones were shut down, skip the next two commands zfs snapshot -r oldpool/zo...@phase2 zfs send -rI oldpool/zo...@phase1 oldpool/zo...@phase2 \ | zfs receive newpool/zo...@phase2 # Adjust mount points and restart the zones zfs set mountpoint=none oldpool/zones zfs set mountpoint=/zones newpool/zones for zone in $zonelist zoneadm -z $zone boot ; done At such a time that you are comfortable that the zone data moved over ok... zfs destroy -r oldpool/zones Again, verify the procedure works on a test/lab/whatever box before trying it for real. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] making sense of arcstat.pl output
przemol, Thanks for the feedback. I had incorrectly assumed that any machine running the script would have L2ARC implemented (which is not the case with Solaris 10). I've added a check for this that allows the script to work on non-L2ARC machines as long as you don't specify L2ARC stats on the command line. http://github.com/mharsch/arcstat -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] making sense of arcstat.pl output
Hello Christian, Thanks for bringing this to my attention. I believe I've fixed the rounding error in the latest version. http://github.com/mharsch/arcstat -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] making sense of arcstat.pl output
For posterity, I'd like to point out the following: neel's original arcstat.pl uses a crude scaling routine that results in a large loss of precision as numbers cross from Kilobytes to Megabytes to Gigabytes. The 1G reported arc size case described here, could actually be anywhere between 1,000,000MB and 1,999,999MB. Use 'kstat zfs::arcstats' to read the arc size directly from the kstats (for comparison). I've updated arcstat.pl with a better scaling routine that returns more appropriate results (similar to df -h human-readable output). I've also added support for L2ARC stats. The updated version can be found here: http://github.com/mharsch/arcstat -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file level clones
On Mon, Sep 27, 2010 at 6:23 AM, Robert Milkowski wrote: > Also see http://www.symantec.com/connect/virtualstoreserver And http://blog.scottlowe.org/2008/12/03/2031-enhancements-to-netapp-cloning-technology/ -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))
On 9/23/2010 at 12:38 PM Erik Trimble wrote: | [snip] |If you don't really care about ultra-low-power, then there's absolutely |no excuse not to buy a USED server-class machine which is 1- or 2- |generations back. They're dirt cheap, readily available, | [snip] = Anyone have a link or two to a place where I can buy some dirt-cheap, readily available last gen servers? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mac OS X clients with ZFS server
On Thu, Sep 16, 2010 at 08:15:53AM -0700, Rich Teer wrote: > On Thu, 16 Sep 2010, Erik Ableson wrote: > > > OpenSolaris snv129 > > Hmm, SXCE snv_130 here. Did you have to do any server-side tuning > (e.g., allowing remote connections), or did it just work out of the > box? I know that Sendmail needs some gentle persuasion to accept > remote connections out of the box; perhaps lockd is the same? So, you've been having this problem since April. Did you ever try getting packet traces to see where the problem is? As I previously stated, if you want, you can forward the traces to me to look at. Let me know if you need the directions on how to capture them. --macko ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] recordsize
What are the ramifications to changing the recordsize of a zfs filesystem that already has data on it? I want to tune down the recordsize to speed up very small reads to a size that is more in line with the read size. can I do this on a filestystem that has data already on it and how does it effect that data? zpool consists of 8 SANs Luns. Thanks mike -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mac OS X clients with ZFS server
On Wed, Sep 15, 2010 at 12:08:20PM -0700, Nabil wrote: > any resolution to this issue? I'm experiencing the same annoying > lockd thing with mac osx 10.6 clients. I am at pool ver 14, fs ver > 3. Would somehow going back to the earlier 8/2 setup make things > better? As noted in the earlier thread, the "annoying lockd thing" is not a ZFS issue, but rather a networking issue. FWIW, I never saw a resolution. But the suggestions for how to debug situations like this still stand: > So, it looks like you need to investigate why the client isn't > getting responses from the server's "lockd". > > This is usually caused by a firewall or NAT getting in the way. > I would also check /var/log/system.log and /var/log/kernel.log on the Mac to > see if any other useful messages are getting logged. > > Then I'd grab packet traces with wireshark/tcpdump/snoop *simultaneously* on > the client and the server, reproduce the problem, and then determine which > packets are being sent and which packets are being received. HTH --macko ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to migrate to 4KB sector drives?
On Sun, Sep 12, 2010 at 5:42 PM, Richard Elling wrote: > On Sep 12, 2010, at 10:11 AM, Brandon High wrote: > >> On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar >> wrote: >>> No replies. Does this mean that you should avoid large drives with 4KB >>> sectors, that is, new drives? ZFS does not handle new drives? >> >> Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of >> osol. > > OSol source yes, binaries no :-( You will need another distro besides > OpenSolaris. > The needed support in sd was added around the b137 timeframe. OpenIndiana, to be released on Tuesday, is based on b146 or later. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
On Sat, Aug 28, 2010 at 8:19 AM, Ray Van Dolson wrote: > On Sat, Aug 28, 2010 at 05:50:38AM -0700, Eff Norwood wrote: >> I can't think of an easy way to measure pages that have not been consumed >> since it's really an SSD controller function which is obfuscated from the >> OS, and add the variable of over provisioning on top of that. If anyone >> would like to really get into what's going on inside of an SSD that makes it >> a bad choice for a ZIL, you can start here: >> >> http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29 >> >> and >> >> http://en.wikipedia.org/wiki/Write_amplification >> >> Which will be more than you might have ever wanted to know. :) > > So has anyone on this list actually run into this issue? Tons of > people use SSD-backed slog devices... > > The theory sounds "sound", but if it's not really happening much in > practice then I'm not too worried. Especially when I can replace a > drive from my slog mirror for a $400 or so if problems do arise... (the > alternative being much more expensive DRAM backed devices) Presumably this problem is being worked... http://hg.genunix.org/onnv-gate.hg/rev/d560524b6bb6 Notice that it implements: 866610 Add SATA TRIM support With this in place, I would imagine a next step is for zfs to issue TRIM commands as zil entries have been committed to the data disks. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)
Update: version 3.2.5 out now, with changes to better support snv_134: http://forums.halcyoninc.com/showthread.php?t=368 If you've downloaded v3.2.4 and are on 09/06, there is no reason to upgrade. Regards, mike.k...@halcyoninc.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] shrink zpool
Is it currently or near future possible to shrink a zpool "remove a disk" -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)
Hi zfs user, > Is the beta free? for how long? if not how much for 5 machines? Everything on our web site (including the beta) runs for 30 days with the baked-in license. After 30 days it will stop collecting fresh numbers, unless you add a license key, or a demo extension file from the sales team (or reinstall it and start over again). > If you are going to post about your commercial products - please include > some price points, so people know whether to ignore the info based on > their budget. You're right, it would be nice if people could just go to our version of "shop.oracle.com", but we're not there yet, and I don't have the price sheets the sales guys do to put those numbers in the forum. If you're still interested, please email me and I can put you in touch with someone who can directly deal with your pricing questions, without going through our web page or the sales alias etc. Thanks! mike.k...@halcyoninc.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)
Hi wonslung, Thanks for posting to our forum: I'll respond there and take things off-list. Sounds like it's the same bug that appeared with the Sol10 July EIS: (which snv_134 obviously got the changes for first, and that wasn't in 09/06). Fixing it now... mike.k...@halcyoninc.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)
Hi all, Halcyon recently started to add ZFS pool stats to our Solaris Agent, and because many people were interested in the previous OpenSolaris beta* we've rolled it into our OpenSolaris build as well. I've already heard some great feedback about supporting ZIL and ARC stats, which we're hoping to add soon. If you'd like to see what we have now, and maybe try it on your OpenSolaris system, please see the download/screenshot page here: http://forums.halcyoninc.com/showthread.php?p=1018 I know this isn't the best time to be posting about legacy OpenSolaris: we're keeping our eyes on Solaris 11 Express / Illumos and aim to support the more advanced features of Solaris 11 the day it's pushed out the door. Thanks for your time! Regards, Mike dot Kirk at HalcyonInc dot com * previous build: http://opensolaris.org/jive/thread.jspa?threadID=130507 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New Supermicro SAS/SATA controller: AOC-USAS2-L8e in SOHO NAS and HD HTPC
What I would really like to know is why do pci-e raid controller cards cost more than an entire motherboard with processor. Some cards can cost over $1,000 dollars, for what. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] EMC migration and zfs
Bump this up. Anyone? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS development moving behind closed doors
On 8/13/2010 at 8:56 PM Eric D. Mudama wrote: |On Fri, Aug 13 at 19:06, Frank Cusack wrote: |>Interesting POV, and I agree. Most of the many "distributions" of |>OpenSolaris had very little value-add. Nexenta was the most interesting |>and why should Oracle enable them to build a business at their expense? | |These distributions are, in theory, the "gateway drug" where people |can experiment inexpensively to try out new technologies (ZFS, dtrace, |crossbow, comstar, etc.) and eventually step up to Oracle's "big iron" |as their business grows. = Think: strategic business advantage. Oracle are not stupid, they recognize a jewel when they see one. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving /export to another zpool
On Fri, Aug 13, 2010 at 1:07 PM, Handojo wrote: >> Are the old /opt and /expore still listed in your >> vfstab(4) file? > > I cant access /etc/vfstab because I can't even log in as my username. I can't > even log in as root from the Login Screen > > And when I boot on using LiveCD, how can I mount my first drive that has > opensolaris installed ? To list the zpools it can see: zpool import To import one called rpool at an alternate root: zpool import -R /mnt rpool -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] EMC migration and zfs
We are going to be migrating to a new EMC frame using Open Replicator. ZFS is sitting on volumes that are running MPXIO. So the controller number/disk number is going to change when we reboot the server. I would like to konw if anyone has done this and will the zfs filesystems "just work" and find the new disk id numbers when we go to zfs import the pool. Our process would be: zfs export any and all pools on the server shutdown the server re-zone the storage to the new EMC frame. EMC on the backend will present the old drives through the new frame/drives using Open Replicator. boot the server to single user mode zfs import the pools reboot the server. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs allow does not work for rpool
That looks like that will work. Won't be able to test until late tonight. Thanks mike -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs allow does not work for rpool
Thanks adding mount did allow me to create it but does not allow me to create the mountpoint. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs allow does not work for rpool
I am trying to give a general user permissions to create zfs filesystems in the rpool. zpool set=delegation=on rpool zfs allow create rpool both run without any issues. zfs allow rpool reports the user does have create permissions. zfs create rpool/test cannot create rpool/test : permission denied. Can you not allow to the rpool? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
On Mon, Jul 26, 2010 at 2:56 PM, Miles Nordin wrote: >>>>>> "mg" == Mike Gerdts writes: > mg> it is rather common to have multiple 1 Gb links to > mg> servers going to disparate switches so as to provide > mg> resilience in the face of switch failures. This is not unlike > mg> (at a block diagram level) the architecture that you see in > mg> pretty much every SAN. In such a configuation, it is > mg> reasonable for people to expect that load balancing will > mg> occur. > > nope. spanning tree removes all loops, which means between any two > points there will be only one enabled path. An L2-switched network > will look into L4 headers for splitting traffic across an aggregated > link (as long as it's been deliberately configured to do that---by > default probably only looks to L2), but it won't do any multipath > within the mesh. I was speaking more of IPMP, which is at layer 3. > Even with an L3 routing protocol it usually won't do multipath unless > the costs of the paths match exactly, so you'd want to build the > topology to achieve this and then do all switching at layer 3 by > making sure no VLAN is larger than a switch. By default, IPMP does outbound load spreading. Inbound load spreading is not practical with a single (non-test) IP address. If you have multiple virtual IP's you can spread them across all of the NICs in the IPMP group and get some degree of inbound spreading as well. This is the default behavior of the OpenSolaris IPMP implementation, last I looked. I've not seen any examples (although I can't say I've looked real hard either) of the Solaris 10 IPMP configuration set up with multipe IP's to encourage inbound load spreading as well. > > There's actually a cisco feature to make no VLAN larger than a *port*, > which I use a little bit. It's meant for CATV networks I think, or > DSL networks aggregated by IP instead of ATM like maybe some European > ones? but the idea is not to put edge ports into vlans any more but > instead say 'ip unnumbered loopbackN', and then some black magic they > have built into their DHCP forwarder adds /32 routes by watching the > DHCP replies. If you don't use DHCP you can add static /32 routes > yourself, and it will work. It does not help with IPv6, and also you > can only use it on vlan-tagged edge ports (what? arbitrary!) but > neat that it's there at all. > > http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.html Interesting... however this seems to limit you to < 4096 edge ports per VTP domain, as the VID field in the 802.1q header is only 12 bits. It is also unclear how this works when you have one physical host with many guests. And then there is the whole thing that I don't really see how this helps with resilience in the face of a switch failure. Cool technology, but I'm not certain that it addresses what I was talking about. > > The best thing IMHO would be to use this feature on the edge ports, > just as I said, but you will have to teach the servers to VLAN-tag > their packets. not such a bad idea, but weird. > > You could also use it one hop up from the edge switches, but I think > it might have problems in general removing the routes when you unplug > a server, and using it one hop up could make them worse. I only use > it with static routes so far, so no mobility for me: I have to keep > each server plugged into its assigned port, and reconfigure switches > if I move it. Once you have ``no vlan larger than 1 switch,'' if you > actually need a vlan-like thing that spans multiple switches, the new > word for it is 'vrf'. There was some other Cisco dark magic that our network guys were touting a while ago that would make each edge switch look like a blade in a 6500 series. This would then allow them to do link aggregation across edge switches. At least two of "organizational changes", "personnel changes", and "roadmap changes" happened so I've not seen this in action. > > so, yeah, it means the server people will have to take over the job of > the networking people. The good news is that networking people don't > like spanning tree very much because it's always going wrong, so > AFAICT most of them who are paying attention are already moving in > this direction. > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
On Mon, Jul 26, 2010 at 1:27 AM, Garrett D'Amore wrote: > On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote: >> On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore wrote: >> > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: >> >> >> >> I think there may be very good reason to use iSCSI, if you're limited >> >> to gigabit but need to be able to handle higher throughput for a >> >> single client. I may be wrong, but I believe iSCSI to/from a single >> >> initiator can take advantage of multiple links in an active-active >> >> multipath scenario whereas NFS is only going to be able to take >> >> advantage of 1 link (at least until pNFS). >> > >> > There are other ways to get multiple paths. First off, there is IP >> > multipathing. which offers some of this at the IP layer. There is also >> > 802.3ad link aggregation (trunking). So you can still get high >> > performance beyond single link with NFS. (It works with iSCSI too, >> > btw.) >> >> With both IPMP and link aggregation, each TCP session will go over the >> same wire. There is no guarantee that load will be evenly balanced >> between links when there are multiple TCP sessions. As such, any >> scalability you get using these configurations will be dependent on >> having a complex enough workload, wise cconfiguration choices, and and >> a bit of luck. > > If you're really that concerned, you could use UDP instead of TCP. But > that may have other detrimental performance impacts, I'm not sure how > bad they would be in a data center with generally lossless ethernet > links. Heh. My horror story with reassembly was actually with connectionless transports (LLT, then UDP). Oracle RAC's cache fusion sends 8 KB blocks via UDP by default, or LLT when used in the Veritas + Oracle RAC certified configuration from 5+ years ago. The use of Sun trunking with round robin hashing and the lack of use of jumbo packets made every cache fusion block turn into 6 LLT or UDP packets that had to be reassembled on the other end. This was on a 15K domain with the NICs spread across IO boards. I assume that interrupts for a NIC are handled by a CPU on the closest system board (Solaris 8, FWIW). If that assumption is true then there would also be a flurry of inter-system board chatter to put the block back together. In any case, performance was horrible until we got rid of round robin and enabled jumbo frames. > Btw, I am not certain that the multiple initiator support (mpxio) is > necessarily any better as far as guaranteed performance/balancing. (It > may be; I've not looked closely enough at it.) I haven't paid close attention to how mpxio works. The Veritas analog, vxdmp, does a very good job of balancing traffic down multiple paths, even when only a single LUN is accessed. The exact mode that dmp will use is dependent on the capabilities of the array it is talking to - many arrays work in an active/passive mode. As such, I would expect that with vxdmp or mpxio the balancing with iSCSI would be at least partially dependent on what the array said to do. > I should look more closely at NFS as well -- if multiple applications on > the same client are access the same filesystem, do they use a single > common TCP session, or can they each have separate instances open? > Again, I'm not sure. It's worse than that. A quick experiment with two different automounted home directories from the same NFS server suggests that both home directories share one TCP session to the NFS server. The latest version of Oracle's RDBMS supports a userland NFS client option. It would be very interesting to see if this does a separate session per data file, possibly allowing for better load spreading. >> Note that with Sun Trunking there was an option to load balance using >> a round robin hashing algorithm. When pushing high network loads this >> may cause performance problems with reassembly. > > Yes. Reassembly is Evil for TCP performance. > > Btw, the iSCSI balancing act that was described does seem a bit > contrived -- a single initiator and a COMSTAR server, both client *and > server* with multiple ethernet links instead of a single 10GbE link. > > I'm not saying it doesn't happen, but I think it happens infrequently > enough that its reasonable that this scenario wasn't one that popped > immediately into my head. :-) It depends on whether the people that control the network gear are the same ones that control servers. My experience suggests that if there is a disconnect, it seems rather likely that each group's standardization efforts, procurement cycles, and capacity plans will work against any attempt t
Re: [zfs-discuss] NFS performance?
On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore wrote: > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: >> >> I think there may be very good reason to use iSCSI, if you're limited >> to gigabit but need to be able to handle higher throughput for a >> single client. I may be wrong, but I believe iSCSI to/from a single >> initiator can take advantage of multiple links in an active-active >> multipath scenario whereas NFS is only going to be able to take >> advantage of 1 link (at least until pNFS). > > There are other ways to get multiple paths. First off, there is IP > multipathing. which offers some of this at the IP layer. There is also > 802.3ad link aggregation (trunking). So you can still get high > performance beyond single link with NFS. (It works with iSCSI too, > btw.) With both IPMP and link aggregation, each TCP session will go over the same wire. There is no guarantee that load will be evenly balanced between links when there are multiple TCP sessions. As such, any scalability you get using these configurations will be dependent on having a complex enough workload, wise cconfiguration choices, and and a bit of luck. Note that with Sun Trunking there was an option to load balance using a round robin hashing algorithm. When pushing high network loads this may cause performance problems with reassembly. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hashing files rapidly on ZFS
On Tue, Jul 6, 2010 at 10:29 AM, Arne Jansen wrote: > Daniel Carosone wrote: >> Something similar would be useful, and much more readily achievable, >> from ZFS from such an application, and many others. Rather than a way >> to compare reliably between two files for identity, I'ld liek a way to >> compare identity of a single file between two points in time. If my >> application can tell quickly that the file content is unaltered since >> last time I saw the file, I can avoid rehashing the content and use a >> stored value. If I can achieve this result for a whole directory >> tree, even better. > > This would be great for any kind of archiving software. Aren't zfs checksums > already ready to solve this? If a file changes, it's dnodes' checksum changes, > the checksum of the directory it is in and so forth all the way up to the > uberblock. > There may be ways a checksum changes without a real change in the files > content, > but the other way round should hold. If the checksum didn't change, the file > didn't change. > So the only missing link is a way to determine zfs's checksum for a > file/directory/dataset. Am I missing something here? Of course atime update > should be turned off, otherwise the checksum will get changed by the archiving > agent. What is the likelihood that the same data is re-written to the file? If that is unlikely, it looks as though znode_t's z_seq may be useful. While it isn't a checksum, it seems to be incremented on every file change. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expected throughput
On Sun, Jul 4, 2010 at 2:08 PM, Ian D wrote: > Mem: 74098512k total, 73910728k used, 187784k free, 96948k buffers > Swap: 2104488k total, 208k used, 2104280k free, 63210472k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 17652 mysql 20 0 3553m 3.1g 5472 S 38 4.4 247:51.80 mysqld > 16301 mysql 20 0 4275m 3.3g 5980 S 4 4.7 5468:33 mysqld > 16006 mysql 20 0 4434m 3.3g 5888 S 3 4.6 5034:06 mysqld > 12822 root 15 -5 0 0 0 S 2 0.0 22:00.50 scsi_wq_39 Is that 38% of one CPU or 38% of all CPU's? How many CPU's does the Linux box have? I don't mean the number of sockets, I mean number of sockets * number of cores * number of threads per core. My recollection of top is that the CPU percentage is: (pcpu_t2 - pcpu_t1) / (interval * ncpus) Where pcpu_t* is the process CPU time at a particular time. If you have a two socket quad core box with hyperthreading enabled, that is 2 * 4 * 2 = 16 CPU's. 38% of 16 CPU's can be roughly 6 CPU's running as fast as they can (and 10 of them idle) or 16 CPU's each running at about 38%. In the "I don't have a CPU bottleneck" argument, there is a big difference. If PID 16301 has a single thread that is doing significant work, on the hypothetical 16 CPU box this means that it is spending about 2/3 of the time on CPU. If the workload does: while ( 1 ) { issue I/O request get response do cpu-intensive work work } It is only trying to do I/O 1/3 of the time. Further, it has put a single high latency operation between its bursts of CPU activity. One other area of investigation that I didn't mention before: Your stats imply that the Linux box is getting data 32 KB at a time. How does 32 KB compare to the database block size? How does 32 KB compare to the block size on the relevant zfs filesystem or zvol? Are blocks aligned at the various layers? http://blogs.sun.com/dlutz/entry/partition_alignment_guidelines_for_unified -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expected throughput
On Sun, Jul 4, 2010 at 10:08 AM, Ian D wrote: > What I don't understand is why, when I run a single query I get <100 IOPS > and <3MB/sec. The setup can obviously do better, so where is the > bottleneck? I don't see any CPU core on any side being maxed out so it > can't be it... In what way is CPU contention being monitored? "prstat" without options is nearly useless for a multithreaded app on a multi-CPU (or multi-core/multi-thread) system. mpstat is only useful if threads never migrate between CPU's. "prstat -mL" gives a nice picture of how busy each LWP (thread) is. When viewed with "prstat -mL", A thread that has usr+sys at 100% cannot go any faster, unless you can get the CPU to go faster, as I suggest below. From my understanding (perhaps not 100% correct on the rest of this paragraph): The time spent in TRP may be reclaimed by running the application in a processor set with interrupts disabled on all of its processors. If TFL or DFL are high, optimizing the use of cache may be beneficial. Examples of how you can optimize the use of cache include using the FX scheduler with a priority that gives relatively long time slices, using processor sets to keep other processes off of the same caches (which are often shared by multiple cores), or perhaps disabling CPU's (threads) to ensure that only a single core is using each cache. With current generation Intel CPU's, this can allow the CPU clock rate to increase, thereby allowing more work to get done. > The database is MySQL, it runs on a Linux box that connects to the Nexenta Oh, since the database runs on Linux I guess you need to dig up top's equivalent of "prstat -mL". Unfortunately, I don't think that Linux has microstate accounting and as such you may not have visibility into time spent on traps, text faults, and data faults on a per-process basis. > server through 10GbE using iSCSI. Have you done any TCP tuning? Based on the numbers you cite above, it looks like you are doing about 32 KB I/O's. I think you can perform a test that involves mainly the network if you use netperf with options like: netperf -H $host -t TCP_RR -r 32768 -l 30 That is speculation based on reading http://www.netperf.org/netperf/training/Netperf.html. Someone else (perhaps on networking or performance lists) may have better tests to run. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expected throughput
On Sun, Jul 4, 2010 at 11:28 AM, Bob Friesenhahn wrote: >> >> Ok... so we've rebuilt the pool as 14 pairs of mirrors, each pair having >> one disk in each of the two JBODs. Now we're getting about 500-1000 IOPS >> (according to zpool iostat) and 20-30MB/sec in random read on a big >> database. Does that sounds right? > > I am not sure who wrote the above text since the attribution quoting is all > botched up (Gmail?) in this thread. Regardless, it is worth pointing out > that 'zpool iostat' only reports the I/O operations which were actually > performed. It will not report the operations which did not need to be > performed due to already being in cache. A quite busy system can still > report very little via 'zpool iostat' if it has enough RAM to cache the > requested data. > > Bob Very good point. You can use a combination of "zpool iostat" and fsstat to see the effect of reads that didn't turn into physical I/Os. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Use of blocksize (-b) during zfs zvol create, poor performance
Hi Eff, There are a significant number of variables to work through with dedup and compression enabled. So the first suggestion I have is to disable those features for now so your not working with too many elements. With those features set aside an NTFS cluster operation does not = a 64k raw I/O block. As well the ZFS 64k blocksize does not = one I/O operation. We may also need to consider the overall network performance behavior and iSCSI protocol characteristics and the Windows network stack. iperf is a good tool to rule that out. What I primarily suspect the issue may be is that write I/O operations are not aligned and are waiting for a I/O completion over multiple vdevs. Alignment is important for write I/O optimization and how the I/O maps at the software raid mode will make a significant impact to the DMU and SPA operations on a specific vdev layout. You may also have an issue with write cache operations, by default large I/O calls such as 64K will not use a ZIL cache vdev, if you have one defined, but will be written directly to your array vdevs which will also include a transaction group write operation. To ensure ZIL log usage with 64k I/O's you can apply the following: edit the /etc/system file with set zfs:zfs_immediate_write_sz = 131071 a reboot is required to activate the system file You have also not indicated what your zpool configuration looks like, that would helpful in the discussion area. It appears that your applying the x4500 as a backup target which means you should (if not already) enable write caching on the COMSTAR LU properties for this type of application. e.g stmfadm modify-lu -p wcd=false 600144F02F2280004C1D62010001 To help triage the perf issue further you could post 2 'kstat zfs' + 2 'kstat stmf' outputs on a 5 min interval and a 'zpool iostat -v 30 5' which would help visualize the I/O behavior. Regards, Mike http://blog.laspina.ca/ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] COMSTAR ISCSI - configuration export/import
I havnt tried it yet, but supposedly this will backup/restore the comstar config: $ svccfg export -a stmf > comstar.bak.${DATE} If you ever need to restore the configuration, you can attach the storage and run an import: $ svccfg import comstar.bak.${DATE} - Mike On 6/28/10, bso...@epinfante.com wrote: > Hi all, > > Having osol b134 exporting a couple of iscsi targets to some hosts,how can > the COMSTAR configuration be migrated to other host? > I can use the ZFS send/receive to replicate the luns but how can I > "replicate" the target,views from serverA to serverB ? > > Is there any best procedures to follow to accomplish this? > Thanks for all your time, > > Bruno > > Sent from my HTC > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- Sent from my mobile device ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VXFS to ZFS Quota
On Fri, Jun 18, 2010 at 8:09 AM, David Magda wrote: > You could always split things up into groups of (say) 50. A few jobs ago, > I was in an environment where we have a /home/students1/ and > /home/students2/, along with a separate faculty/ (using Solaris and UFS). > This had more to do with IOps than anything else. A decade or so ago when I managed similar environments and had (I think) 6 file systems handling about 5000 students. Each file system had about 1/6 of the students. Challenges I found in this were: - Students needed to work on projects together. The typical way to do this was for them to request a group, then create a group writable directory in one of their home directories. If all students in the group had home directories on the same file system, there was nothing special to consider. If they were on different file systems then at least one would need to have a non-zero quota (that is, not 0 blocks soft, 1 block hard) quota on the file system where the group directory resides. - Despite your best efforts things will get imbalanced. If you are tight on space, this means that you will need to migrate users. This will become apparent only at the times of the semester where even per-user outages are most inconvenient (i.e. at 6 and 13 weeks when big projects tend to be due). Its probably a good idea to consider these types of situations in the transition plan, or at least determine they don't apply. I was working in a college of engineering where group projects were common and CAD, EDA, and simulation tools could generate big files very quickly. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On Tue, Jun 15, 2010 at 7:28 PM, David Magda wrote: > On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote: > >> I think dedup may have its greatest appeal in VDI environments (think >> about a environment with 85% if the data that the virtual machine needs is >> into ARC or L2ARC... is like a dream...almost instantaneous response... and >> you can boot a new machine in a few seconds)... > > This may also be accomplished by using snapshots and clones of data sets. At > least for OS images: user profiles and documents could be something else > entirely. It all depends on the nature of the VDI environment. If the VMs are regenerated on each login, the snapshot + clone mechanism is sufficient. Deduplication is not needed. However, if VMs have a long life and get periodic patches and other software updates, deduplication will be required if you want to remain at somewhat constant storage utilization. It probably makes a lot of sense to be sure that swap or page files are on a non-dedup dataset. Executables and shared libraries shouldn't be getting paged out to it and the likelihood that multiple VMs page the same thing to swap or a page file is very small. > Another situation that comes to mind is perhaps as the back-end to a mail > store: if you send out a message(s) with an attachment(s) to a lot of > people, the attachment blocks could be deduped (and perhaps compressed as > well, since base-64 adds 1/3 overhead). It all depends on how this is stored. If the attachments are stored like they were in 1990 as part of an mbox format, you will be very unlikely to get the proper block alignment. Even storing the message body (including headers) in the same file as the attachment may not align the attachments because the mail headers may be different (e.g. different recipients messages took different paths, some were forwarded, etc.). If the attachments are stored in separate files or a database format is used that stores attachments separate from the message (with matching database + zfs block size) things may work out favorably. However, a system that detaches messages and stores them separately may just as well store them in a file that matches the SHA256 hash, assuming that file doesn't already exist. If does exist, it can just increment a reference count. In other words, an intelligent mail system should already dedup. Or at least that is how I would have written it for the last decade or so... -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
On Thu, Jun 10, 2010 at 9:39 AM, Andrey Kuzmin wrote: > On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski wrote: >> >> On 21/10/2009 03:54, Bob Friesenhahn wrote: >>> >>> I would be interested to know how many IOPS an OS like Solaris is able to >>> push through a single device interface. The normal driver stack is likely >>> limited as to how many IOPS it can sustain for a given LUN since the driver >>> stack is optimized for high latency devices like disk drives. If you are >>> creating a driver stack, the design decisions you make when requests will be >>> satisfied in about 12ms would be much different than if requests are >>> satisfied in 50us. Limitations of existing software stacks are likely >>> reasons why Sun is designing hardware with more device interfaces and more >>> independent devices. >> >> >> Open Solaris 2009.06, 1KB READ I/O: >> >> # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& > > /dev/null is usually a poor choice for a test lie this. Just to be on the > safe side, I'd rerun it with /dev/random. > Regards, > Andrey (aside from other replies about read vs. write and /dev/random...) Testing performance of disk by reading from /dev/random and writing to disk is misguided. From random(7d): Applications retrieve random bytes by reading /dev/random or /dev/urandom. The /dev/random interface returns random bytes only when sufficient amount of entropy has been collected. In other words, when the kernel doesn't think that it can give high quality random numbers, it stops providing them until it has gathered enough entropy. It will pause your reads. If instead you use /dev/urandom, the above problem doesn't exist, but the generation of random numbers is CPU-intensive. There is a reasonable chance (particularly with slow CPU's and fast disk) that you will be testing the speed of /dev/urandom rather than the speed of the disk or other I/O components. If your goal is to provide data that is not all 0's to prevent ZFS compression from making the file sparse or want to be sure that compression doesn't otherwise make the actual writes smaller, you could try something like: # create a file just over 100 MB dd if=/dev/random of=/tmp/randomdata bs=513 count=204401 # repeatedly feed that file to dd while true ; do cat /tmp/randomdataa ; done | dd of=/my/test/file bs=... count=... The above should make it so that it will take a while before there are two blocks that are identical, thus confounding deduplication as well. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
Sorry, turned on html mode to avoid gmail's line wrapping. On Mon, May 31, 2010 at 4:58 PM, Sandon Van Ness wrote: > On 05/31/2010 02:52 PM, Mike Gerdts wrote: > > On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness > wrote: > > > >> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote: > >> > >>> There are multiple factors at work. Your OpenSolaris should be new > >>> enough to have the fix in which the zfs I/O tasks are run in in a > >>> scheduling class at lower priority than normal user processes. > >>> However, there is also a throttling mechanism for processes which > >>> produce data faster than can be consumed by the disks. This > >>> throttling mechanism depends on the amount of RAM available to zfs and > >>> the write speed of the I/O channel. More available RAM results in > >>> more write buffering, which results in a larger chunk of data written > >>> at the next transaction group write interval. The maximum size of a > >>> transaction group may be configured in /etc/system similar to: > >>> > >>> * Set ZFS maximum TXG group size to 2684354560 > >>> set zfs:zfs_write_limit_override = 0xa000 > >>> > >>> If the transaction group is smaller, then zfs will need to write more > >>> often. Processes will still be throttled but the duration of the > >>> delay should be smaller due to less data to write in each burst. I > >>> think that (with multiple writers) the zfs pool will be "healthier" > >>> and less fragmented if you can offer zfs more RAM and accept some > >>> stalls during writing. There are always tradeoffs. > >>> > >>> Bob > >>> > >> well it seems like when messing with the txg sync times and stuff like > >> that it did make the transfer more smooth but didn't actually help with > >> speeds as it just meant the hangs happened for a shorter time but at a > >> smaller interval and actually lowering the time between writes just > >> seemed to make things worse (slightly). > >> > >> I think I have came to the conclusion that the problem here is CPU due > >> to the fact that its only doing this with parity raid. I would think if > >> it was I/O based then it would be the same as if anything its heavier on > >> I/O on non parity raid due to the fact that it is no longer CPU > >> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with > >> parity raidz2). > >> > > To see if the CPU is pegged, take a look at the output of: > > > > mpstat 1 > > prstat -mLc 1 > > > > If mpstat shows that the idle time reaches 0 or the process' latency > > column is more then a few tenths of a percent, you are probably short > > on CPU. > > > > It could also be that interrupts are stealing cycles from rsync. > > Placing it in a processor set with interrupts disabled in that > > processor set may help. > > > > > > Unfortunately none of these utilies make it possible to ge values for <1 > second which is what the hang is (its happening for about 1/2 of a second). > > Here is with mpstat: > > > Here is what i get with prstat: > > Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG > PROCESS/LWPID > 604 root 0.0 33 0.0 0.0 0.0 0.0 42 25 18 13 0 0 > zpool-data/13 > 604 root 0.0 30 0.0 0.0 0.0 0.0 41 29 12 12 0 0 > zpool-data/15 > 1326 root 12 2.9 0.0 0.0 0.0 0.0 85 0.4 1K 12 11K 0 rsync/1 > 604 root 0.0 15 0.0 0.0 0.0 0.0 41 44 111 9 0 0 > zpool-data/27 > 604 root 0.0 14 0.0 0.0 0.0 0.0 43 42 72 3 0 0 > zpool-data/33 > 604 root 0.0 5.9 0.0 0.0 0.0 0.0 41 53 109 6 0 0 > zpool-data/19 > 604 root 0.0 5.4 0.0 0.0 0.0 0.0 42 53 106 8 0 0 > zpool-data/25 > 604 root 0.0 5.3 0.0 0.0 0.0 0.0 43 51 107 7 0 0 > zpool-data/21 > 604 root 0.0 4.5 0.0 0.0 0.0 0.0 41 54 110 4 0 0 > zpool-data/31 > 604 root 0.0 3.9 0.0 0.0 0.0 0.0 41 55 109 3 0 0 > zpool-data/23 > 604 root 0.0 3.7 0.0 0.0 0.0 0.0 44 52 111 2 0 0 > zpool-data/29 > 1322 root 0.0 0.4 0.0 0.0 0.0 0.0 98 2.0 1K 0 1 0 rsync/1 > 22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 16 13 255 0 prstat/1 > 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 5 3 69 0 sshd/1 > 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 2 105 0 nscd/17 > In the interval abo
Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness wrote: > On 05/31/2010 01:51 PM, Bob Friesenhahn wrote: >> There are multiple factors at work. Your OpenSolaris should be new >> enough to have the fix in which the zfs I/O tasks are run in in a >> scheduling class at lower priority than normal user processes. >> However, there is also a throttling mechanism for processes which >> produce data faster than can be consumed by the disks. This >> throttling mechanism depends on the amount of RAM available to zfs and >> the write speed of the I/O channel. More available RAM results in >> more write buffering, which results in a larger chunk of data written >> at the next transaction group write interval. The maximum size of a >> transaction group may be configured in /etc/system similar to: >> >> * Set ZFS maximum TXG group size to 2684354560 >> set zfs:zfs_write_limit_override = 0xa000 >> >> If the transaction group is smaller, then zfs will need to write more >> often. Processes will still be throttled but the duration of the >> delay should be smaller due to less data to write in each burst. I >> think that (with multiple writers) the zfs pool will be "healthier" >> and less fragmented if you can offer zfs more RAM and accept some >> stalls during writing. There are always tradeoffs. >> >> Bob > well it seems like when messing with the txg sync times and stuff like > that it did make the transfer more smooth but didn't actually help with > speeds as it just meant the hangs happened for a shorter time but at a > smaller interval and actually lowering the time between writes just > seemed to make things worse (slightly). > > I think I have came to the conclusion that the problem here is CPU due > to the fact that its only doing this with parity raid. I would think if > it was I/O based then it would be the same as if anything its heavier on > I/O on non parity raid due to the fact that it is no longer CPU > bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with > parity raidz2). To see if the CPU is pegged, take a look at the output of: mpstat 1 prstat -mLc 1 If mpstat shows that the idle time reaches 0 or the process' latency column is more then a few tenths of a percent, you are probably short on CPU. It could also be that interrupts are stealing cycles from rsync. Placing it in a processor set with interrupts disabled in that processor set may help. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is it safe to disable the swap partition?
On Sun, May 9, 2010 at 7:40 PM, Edward Ned Harvey wrote: > > > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > > boun...@opensolaris.org] On Behalf Of Richard Elling > > > > For a storage server, swap is not needed. If you notice swap being used > > then your storage server is undersized. > > Indeed, I have two solaris 10 fileservers that have uptime in the range of a > few months. I just checked swap usage, and they're both zero. > > So, Bob, rub it in if you wish. ;-) I was wrong. I knew the behavior in > Linux, which Roy seconded as "most OSes," and apparently we both assumed the > same here, but that was wrong. I don't know if solaris and opensolaris both > have the same swap behavior. I don't know if there's *ever* a situation > where solaris/opensolaris would swap idle processes. But there's at least > evidence that my two servers have not, or do not. If Solaris is under memory pressure, pages may be paged to swap. Under severe memory pressure, entire processes may be swapped. This will happen after freeing up the memory used for file system buffers, ARC, etc. If the processes never page in the pages that have been paged out (or the processes that have been swapped out are never scheduled) then those pages will not consume RAM. The best thing to do with processes that can be swapped out forever is to not run them. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mac OS X clients with ZFS server
On Thu, Apr 22, 2010 at 02:53:37PM -0700, Rich Teer wrote: > On Thu, 22 Apr 2010, Mike Mackovitch wrote: > > > I would also check /var/log/system.log and /var/log/kernel.log on the Mac to > > see if any other useful messages are getting logged. > > Ah, we're getting closer. The latter shows nothing interesting, but > system.log > has this line appended the minute I try the copy: > > sandboxd[78312]: portmap(78311) deny network-outbound > /private/var/tmp/launchd/sock > > Then, when the attempt times out, these appear: > > KernelEventAgent[36]: tid received event(s) VQ_NOTRESP (1) > KernelEventAgent[36]: tid type 'nfs', mounted on > '/net/zen/export/home' from 'zen:/export/home', not responding > KernelEventAgent[36]: tid found 1 filesystem(s) with problem(s) > > Does that shed any morelight on this? Nope. The first message is a known annoyance that gets logged whenever portmap starts. It can be ignored. The second message is just the daemon responsible for inducing the "disconnect dialog" noticing that there is an unresponsive file system. Oh, and the kernel.log should at least have the "lockd not responding" messages in it. So, I presume you meant nothing *else* interesting. I think it's time to look at the packets... (...and perhaps time to move this off of zfs-discuss seeing as this is really an NFS/networking issue and not a ZFS issue.) HTH --macko ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mac OS X clients with ZFS server
On Thu, Apr 22, 2010 at 01:54:26PM -0700, Rich Teer wrote: > On Thu, 22 Apr 2010, Mike Mackovitch wrote: > > Hi Mike, > > > So, it looks like you need to investigate why the client isn't > > getting responses from the server's "lockd". > > > > This is usually caused by a firewall or NAT getting in the way. > > Great idea--I was indeed connected to my network using the AirPort interface, > thorugh a Wifi router. So as an experiment, I tried using a hard-wired, > manually set up Ethernet connection. Same result: no dice. :-( > > I checked the firewall settings on my laptop, and the firewall is turned off. > > Do you have any other ideas? It'd be really nice to get this working! I would also check /var/log/system.log and /var/log/kernel.log on the Mac to see if any other useful messages are getting logged. Then I'd grab packet traces with wireshark/tcpdump/snoop *simultaneously* on the client and the server, reproduce the problem, and then determine which packets are being sent and which packets are being received. HTH --macko ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mac OS X clients with ZFS server
On Thu, Apr 22, 2010 at 12:40:37PM -0700, Rich Teer wrote: > On Thu, 22 Apr 2010, Tomas Ögren wrote: > > > Copying via terminal (and cp) works. > > Interesting: if I copy a file *which has no extended attributes* using cp in > a terminal, it works fine. If I try to cp a file that has EA (to the same > destination), it hangs. But I get this error message after a few seconds: > > cp file_without_EA /net/zen/export/home/rich > cp file_with_EA /net/zen/export/home/rich > nfs server zen:/export/home: lockd not responding So, it looks like you need to investigate why the client isn't getting responses from the server's "lockd". This is usually caused by a firewall or NAT getting in the way. HTH --macko ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Why does ARC grow above hard limit?
I would appreciate if somebody can clarify a few points. I am doing some random WRITES (100% writes, 100% random) testing and observe that ARC grows way beyond the "hard" limit during the test. The hard limit is set 512 MB via /etc/system and I see the size going up to 1 GB - how come is it happening? mdb's ::memstat reports 1.5 GB used - does this include ARC as well or is it separate? I see on the backed only reads (205 MB/s) and almost no writes (1.1 MB/s) - any ides what is being read? --- BEFORE TEST # ~/bin/arc_summary.pl System Memory: Physical RAM: 12270 MB Free Memory : 7108 MB LotsFree: 191 MB ZFS Tunables (/etc/system): set zfs:zfs_prefetch_disable = 1 set zfs:zfs_arc_max = 0x2000 set zfs:zfs_arc_min = 0x1000 ARC Size: Current Size: 136 MB (arcsize) Target Size (Adaptive): 512 MB (c) Min Size (Hard Limit):256 MB (zfs_arc_min) Max Size (Hard Limit):512 MB (zfs_arc_max) ... > ::memstat Page SummaryPagesMB %Tot Kernel 800895 3128 25% ZFS File Data 394450 1540 13% Anon 106813 4173% Exec and libs4178160% Page cache 14333550% Free (cachelist)22996891% Free (freelist) 1797511 7021 57% Total 3141176 12270 Physical 3141175 12270 --- DURING THE TEST # ~/bin/arc_summary.pl System Memory: Physical RAM: 12270 MB Free Memory : 6687 MB LotsFree: 191 MB ZFS Tunables (/etc/system): set zfs:zfs_prefetch_disable = 1 set zfs:zfs_arc_max = 0x2000 set zfs:zfs_arc_min = 0x1000 ARC Size: Current Size: 1336 MB (arcsize) Target Size (Adaptive): 512 MB (c) Min Size (Hard Limit):256 MB (zfs_arc_min) Max Size (Hard Limit):512 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 87%446 MB (p) Most Frequently Used Cache Size:12%65 MB (c-p) ARC Efficency: Cache Access Total: 51681761 Cache Hit Ratio: 52% 27056475 [Defined State for buffer] Cache Miss Ratio: 47% 24625286 [Undefined State for Buffer] REAL Hit Ratio: 52% 27056475 [MRU/MFU Hits Only] Data Demand Efficiency:35% Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable) CACHE HITS BY CACHE LIST: Anon: --%Counter Rolled. Most Recently Used: 13%3627289 (mru) [ Return Customer ] Most Frequently Used: 86%23429186 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 17%4657584 (mru_ghost)[ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 32%8712009 (mfu_ghost)[ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:30%8308866 Prefetch Data: 0%0 Demand Metadata:69%18747609 Prefetch Metadata: 0%0 CACHE MISSES BY DATA TYPE: Demand Data:61%15113029 Prefetch Data: 0%0 Demand Metadata:38%9511898 Prefetch Metadata: 0%359 - -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS behavior under limited resources
I am trying to see how ZFS behaves under resource starvation - corner cases in embedded environments. I see some very strange behavior. Any help/explanation would really be appreciated. My current setup is : OpenSolaris 111b (iSCSI seems to be broken in 132 - unable to get multiple connections/mutlipathing) iSCSI Storage Array that is capable of 20 MB/s random writes @ 4k and 70 MB random reads @ 4k 150 MB/s random writes @ 128k and 180 MB/S random reads @ 128K 180+ MB/S for sequntial reads and write at both 4k and 128k. 8 Intel CPU and 12 GB of RAM (DELL poweredge 610) The ARC size is limited to 512MB (hard limit). No L2 Cache. In both test below the file system size is about 300 GB. This file system conatins a single directory with about 15'000 files totalling to 200 GB (so the file system is 2/3 full). The tests are run within the same directory. Test 1: Random writes @ 4k to 1000 1MB files (1000 threads, 1 per file). First I observe that ARC size grows (momentarily) above 512 MB limit (via kstat and arcstat.pl). Q: It seems that zfs:zfs_arc_max is not really a hard limit? I tried setting primarycache to none, metadata and all. The I/O reported is similar in the NONE and METADATA case (17 MB/S) while when set to ALL, I/O is 3 - 4 time less (4-5 MB/S). Q: Any explanation would be useful. In this test I observe for backend on average I/O is 132 MB/s for READs and 51 MB/s WRITES Q: Why is more read than wtritten? Test 2: Random writes @ 4k to 10'000 1MB files (10'000 threads, 1 per file). - ARC size now goes to 1 GB during the entire test (way above the hard limit) - ::memstat reports that zfs grew from the original 430 MB to about 1.5 GB Q: Does mdb memstat reporting include ARC? Q: On the backend I see 170 MB/s reads and 0.5 MB.s writes -- What is happening here? SOME sample output ... --- > ::memstat Page SummaryPagesMB %Tot Kernel 800933 3128 25% ZFS File Data 394450 1540 13% Anon 128909 5034% Exec and libs4172160% Page cache 14749570% Free (cachelist)21884851% Free (freelist) 1776079 6937 57% Total 3141176 12270 Physical 3141175 12270 -- System Memory: Physical RAM: 12270 MB Free Memory : 6966 MB LotsFree: 191 MB ZFS Tunables (/etc/system): set zfs:zfs_prefetch_disable = 1 set zfs:zfs_arc_max = 0x2000 set zfs:zfs_arc_min = 0x1000 ARC Size: Current Size: 669 MB (arcsize) Target Size (Adaptive): 512 MB (c) Min Size (Hard Limit):256 MB (zfs_arc_min) Max Size (Hard Limit):512 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 6%32 MB (p) Most Frequently Used Cache Size:93%480 MB (c-p) ARC Efficency: Cache Access Total: 47002757 Cache Hit Ratio: 52% 24657634 [Defined State for buffer] Cache Miss Ratio: 47% 22345123 [Undefined State for Buffer] REAL Hit Ratio: 52% 24657634 [MRU/MFU Hits Only] Data Demand Efficiency:36% Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable) CACHE HITS BY CACHE LIST: Anon: --%Counter Rolled. Most Recently Used: 13%3420349 (mru) [ Return Customer ] Most Frequently Used: 86%21237285 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 16%4057965 (mru_ghost)[ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 31%7837353 (mfu_ghost)[ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data:31%7793822 Prefetch Data: 0%0 Demand Metadata:68%16863812 Prefetch Metadata: 0%0 CACHE MISSES BY DATA TYPE: Demand Data:60%13573358 Prefetch Data: 0%0 Demand Metadata:39%8771406 Prefetch Metadata: 0%359 - -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs diff
On Mon, Mar 29, 2010 at 5:39 PM, Nicolas Williams wrote: > One really good use for zfs diff would be: as a way to index zfs send > backups by contents. Or to generate the list of files for incremental backups via NetBackup or similar. This is especially important for file systems will millions of files with relatively few changes. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
On Fri, Mar 19, 2010 at 11:57 PM, Edward Ned Harvey wrote: >> 1. NDMP for putting "zfs send" streams on tape over the network. So > > Tell me if I missed something here. I don't think I did. I think this > sounds like crazy talk. > > I used NDMP up till November, when we replaced our NetApp with a Solaris Sun > box. In NDMP, to choose the source files, we had the ability to browse the > fileserver, select files, and specify file matching patterns. My point is: > NDMP is file based. It doesn't allow you to spawn a process and backup a > data stream. > > Unless I missed something. Which I doubt. ;-) 5+ years ago the variety of NDMP that was available with the combination of NetApp's OnTap and Veritas NetBackup did backups at the volume level. When I needed to go to tape to recover a file that was no longer in snapshots, we had to find space on a NetApp to restore the volume. It could not restore the volume to a Sun box, presumably because the contents of the backup used a data stream format that was proprietary to NetApp. An expired Internet Draft for NDMPv4 says: butype_name Specifies the name of the backup method to be used for the transfer (dump, tar, cpio, etc). Backup types are NDMP Server implementation dependent and MUST match one of the Data Server implementation specific butype_name strings accessible via the NDMP_CONFIG_GET_BUTYPE_INFO request. http://www.ndmp.org/download/sdk_v4/draft-skardal-ndmp4-04.txt It seems pretty clear from this that an NDMP data stream can contain most anything and is dependent on the device being backed up. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
On Wed, Mar 17, 2010 at 9:15 AM, Edward Ned Harvey wrote: >> I think what you're saying is: Why bother trying to backup with "zfs >> send" >> when the recommended practice, fully supportable, is to use other tools >> for >> backup, such as tar, star, Amanda, bacula, etc. Right? >> >> The answer to this is very simple. >> #1 ... >> #2 ... > > Oh, one more thing. "zfs send" is only discouraged if you plan to store the > data stream and do "zfs receive" at a later date. > > If instead, you are doing "zfs send | zfs receive" onto removable media, or > another server, where the data is immediately fed through "zfs receive" then > it's an entirely viable backup technique. Richard Elling made an interesting observation that suggests that storing a zfs send data stream on tape is a quite reasonable thing to do. Richard's background makes me trust his analysis of this much more than I trust the typical person that says that zfs send output is poison. http://opensolaris.org/jive/thread.jspa?messageID=465973&tstart=0#465861 I think that a similar argument could be made for storing the zfs send data streams on a zfs file system. However, it is not clear why you would do this instead of just zfs send | zfs receive. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests
On Mon, Feb 8, 2010 at 9:04 PM, grarpamp wrote: > PS: Is there any way to get a copy of the list since inception > for local client perusal, not via some online web interface? You can get monthly .gz archives in mbox format from http://mail.opensolaris.org/pipermail/zfs-discuss/. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On Mon, Jan 25, 2010 at 2:32 AM, Kjetil Torgrim Homme wrote: > Mike Gerdts writes: > >> John Hoogerdijk wrote: >>> Is there a way to zero out unused blocks in a pool? I'm looking for >>> ways to shrink the size of an opensolaris virtualbox VM and using the >>> compact subcommand will remove zero'd sectors. >> >> I've long suspected that you should be able to just use mkfile or "dd >> if=/dev/zero ..." to create a file that consumes most of the free >> space then delete that file. Certainly it is not an ideal solution, >> but seems quite likely to be effective. > > you'll need to (temporarily) enable compression for this to have an > effect, AFAIK. > > (dedup will obviously work, too, if you dare try it.) You are missing the point. Compression and dedup will make it so that the blocks in the devices are not overwritten with zeroes. The goal is to overwrite the blocks so that a back-end storage device or back-end virtualization platform can recognize that the blocks are not in use and as such can reclaim the space. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On Sat, Jan 23, 2010 at 11:55 AM, John Hoogerdijk wrote: > Mike Gerdts wrote: >> >> On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk >> wrote: >> >>> >>> Is there a way to zero out unused blocks in a pool? I'm looking for ways >>> to >>> shrink the size of an opensolaris virtualbox VM and >>> using the compact subcommand will remove zero'd sectors. >>> >> >> I've long suspected that you should be able to just use mkfile or "dd >> if=/dev/zero ..." to create a file that consumes most of the free >> space then delete that file. Certainly it is not an ideal solution, >> but seems quite likely to be effective. >> > > I tried this with mkfile - no joy. Let me ask a couple of the questions that come just after "are you sure your computer is plugged in?" Did you wait enough time for the data to be flushed to disk (or do sync and wait for it to complete) prior to removing the file? You did "mkfile $huge /var/tmp/junk" not "mkfile -n $huge /var/tmp/junk", right? If not, I suspect that "zpool replace" to a thin provisioned disk is going to be your best bet (as suggested in another message). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk wrote: > Is there a way to zero out unused blocks in a pool? I'm looking for ways to > shrink the size of an opensolaris virtualbox VM and > using the compact subcommand will remove zero'd sectors. I've long suspected that you should be able to just use mkfile or "dd if=/dev/zero ..." to create a file that consumes most of the free space then delete that file. Certainly it is not an ideal solution, but seems quite likely to be effective. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive as backup - reliability?
On Thu, Jan 21, 2010 at 11:28 AM, Richard Elling wrote: > On Jan 21, 2010, at 3:55 AM, Julian Regel wrote: >> >> Until you try to pick one up and put it in a fire safe! >> >> >Then you backup to tape from x4540 whatever data you need. >> >In case of enterprise products you save on licensing here as you need a one >> >client license per x4540 but in fact can >backup data from many clients >> >which are there. >> >> Which brings up full circle... >> >> What do you then use to backup to tape bearing in mind that the Sun-provided >> tools all have significant limitations? > > Poor choice of words. Sun resells NetBackup and (IIRC) that which was > formerly called NetWorker. Thus, Sun does provide enterprise backup > solutions. (Symantec nee Veritas) NetBackup and (EMC nee Legato) Networker are different products that compete in the enterprise backup space. Under the covers NetBackup uses gnu tar to gather file data for the backup stream. At one point (maybe still the case), one of the claimed features of netbackup is that if a tape is written without multiplexing, you can use gnu tar to extract data. This seems to be most useful when you need to recover master and/or media servers and to be able to extract your data after you no longer use netbackup. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup memory overhead
On Thu, Jan 21, 2010 at 2:51 PM, Andrey Kuzmin wrote: > Looking at dedupe code, I noticed that on-disk DDT entries are > compressed less efficiently than possible: key is not compressed at > all (I'd expect roughly 2:1 compression ration with sha256 data), A cryptographic hash such as sha256 should not be compressible. A trivial example shows this to be the case: for i in {1..1} ; do echo $i | openssl dgst -sha256 -binary done > /tmp/sha256 $ gzip -c sha256.gz $ compress -c sha256.Z $ bzip2 -c sha256.bz2 $ ls -go sha256* -rw-r--r-- 1 32 Jan 22 04:13 sha256 -rw-r--r-- 1 428411 Jan 22 04:14 sha256.Z -rw-r--r-- 1 321846 Jan 22 04:14 sha256.bz2 -rw-r--r-- 1 320068 Jan 22 04:14 sha256.gz -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive as backup - reliability?
I use zfs send/recv in the enterprise and in smaller environments all time and it's is excellent. Have a look at how awesome the functionally is in this example. http://blog.laspina.ca/ubiquitous/provisioning_disaster_recovery_with_zfs Regards, Mike -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive as backup - reliability?
On Sat, Jan 16, 2010 at 5:31 PM, Toby Thain wrote: > On 16-Jan-10, at 7:30 AM, Edward Ned Harvey wrote: > >>> I am considering building a modest sized storage system with zfs. Some >>> of the data on this is quite valuable, some small subset to be backed >>> up "forever", and I am evaluating back-up options with that in mind. >> >> You don't need to store the "zfs send" data stream on your backup media. >> This would be annoying for the reasons mentioned - some risk of being able >> to restore in future (although that's a pretty small risk) and inability >> to >> restore with any granularity, i.e. you have to restore the whole FS if you >> restore anything at all. >> >> A better approach would be "zfs send" and pipe directly to "zfs receive" >> on >> the external media. This way, in the future, anything which can read ZFS >> can read the backup media, and you have granularity to restore either the >> whole FS, or individual things inside there. > > There have also been comments about the extreme fragility of the data stream > compared to other archive formats. In general it is strongly discouraged for > these purposes. > Yet it is used in ZFS flash archives on Solaris 10 and are slated for use in the successor to flash archives. This initial proposal seems to imply using the same mechanism for a system image backup (instead of just system provisioning). http://mail.opensolaris.org/pipermail/caiman-discuss/2010-January/015909.html -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 12:28 PM, Torrey McMahon wrote: > On 1/8/2010 10:04 AM, James Carlson wrote: >> >> Mike Gerdts wrote: >> >>> >>> This unsupported feature is supported with the use of Sun Ops Center >>> 2.5 when a zone is put on a "NAS Storage Library". >>> >> >> Ah, ok. I didn't know that. >> >> > > Does anyone know how that works? I can't find it in the docs, no one inside > of Sun seemed to have a clue when I asked around, etc. RTFM gladly taken. Storage libraries are discussed very briefly at: http://wikis.sun.com/display/OC2dot5/Storage+Libraries Creation of zones is discussed at: http://wikis.sun.com/display/OC2dot5/Creating+Zones I've found no documentation that explains the implementation details. >From looking at a test environment that I have running, it seems to go like: 1. The storage admin carves out some NFS space and exports it with the appropriate options to the various hosts (global zones). 2. In the Ops Center BUI, the ops center admin creates a new storage library. He selects type NFS and specifies the hostname and path that was allocated. 3. The ops center admin associates the storage library with various hosts. This causes it to be be mounted at /var/mnt/virtlibs/ on those hosts. I'll call this $libmnt. 4. When the sysadmin provisions a zone through ops center, a UUID is allocated and associated with this zone. I'll call it $zuuid. A directory $libmnt/$zuuid is created with a set of directories under it. 5. As the sysadmin provisions ops center prompts for the virtual disk size. A file of that size is created at $libmnt/$zuuid/virtdisk/data. 6. Ops center creates a zpool: zpool create -m /var/mnt/oc-zpools/$zuuid/ z$zuuid \ $libmnt/$zuuid/virtdisk/data 7. The zonepath is created using a uuid that is unique to the zonepath ($puuid) z$zuuid/$puuid. It has a quota and a reservation set (8G each in the zpool history I am looking at). 8. The zone is configured with zonepath=/var/mnt/oc-zpools/$zuuid/$puuid, then installed Just in case anyone sees this as the right way to do things, I think it is generally OK with a couple caveats. The key areas that I would suggest for improvement are: - Mount the NFS space with -o forcedirectio. There is no need to cache data twice. - Never use UUID's in paths. This makes it nearly impossible for a sysadmin or a support person to look at the output of commands on the system and understand what it is doing. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 9:11 AM, Mike Gerdts wrote: > I've seen similar errors on Solaris 10 in the primary domain and on a > M4000. Unfortunately Solaris 10 doesn't show the checksums in the > ereport. There I noticed a mixture between read errors and checksum > errors - and lots more of them. This could be because the S10 zone > was a full root SUNWCXall compared to the much smaller default ipkg > branded zone. On the primary domain running Solaris 10... I've written a dtrace script to get the checksums on Solaris 10. Here's what I see with NFSv3 on Solaris 10. # zoneadm -z zone1 halt ; zpool export pool1 ; zpool import -d /mnt/pool1 pool1 ; zoneadm -z zone1 boot ; sleep 30 ; pkill dtrace # ./zfs_bad_cksum.d Tracing... dtrace: error on enabled probe ID 9 (ID 43443: fbt:zfs:zio_checksum_error:return): invalid address (0x301b363a000) in action #4 at DIF offset 20 dtrace: error on enabled probe ID 9 (ID 43443: fbt:zfs:zio_checksum_error:return): invalid address (0x3037f746000) in action #4 at DIF offset 20 cccdtrace: error on enabled probe ID 9 (ID 43443: fbt:zfs:zio_checksum_error:return): invalid address (0x3026e7b) in action #4 at DIF offset 20 cc Checksum errors: 3 : 0x130e01011103 0x20108 0x0 0x400 (fletcher_4_native) 3 : 0x220125cd8000 0x62425980c08 0x16630c08296c490c 0x82b320c082aef0c (fletcher_4_native) 3 : 0x2f2a0a202a20436f 0x7079726967687420 0x2863292032303031 0x2062792053756e20 (fletcher_4_native) 3 : 0x3c21444f43545950 0x452048544d4c2050 0x55424c494320222d 0x2f2f5733432f2f44 (fletcher_4_native) 3 : 0x6005a8389144 0xc2080e6405c200b6 0x960093d40800 0x9eea007b9800019c (fletcher_4_native) 3 : 0xac044a6903d00163 0xa138c8003446 0x3f2cd1e100b10009 0xa37af9b5ef166104 (fletcher_4_native) 3 : 0xbaddcafebaddcafe 0xc 0x0 0x0 (fletcher_4_native) 3 : 0xc4025608801500ff 0x1018500704528210 0x190103e50066 0xc34b90001238f900 (fletcher_4_native) 3 : 0xfe00fc01fc42fc42 0xfc42fc42fc42fc42 0xfffc42fc42fc42fc 0x42fc42fc42fc42fc (fletcher_4_native) 4 : 0x4b2a460a 0x0 0x4b2a460a 0x0 (fletcher_4_native) 4 : 0xc00589b159a00 0x543008a05b673 0x124b60078d5be 0xe3002b2a0b605fb3 (fletcher_4_native) 4 : 0x130e010111 0x32000b301080034 0x10166cb34125410 0xb30c19ca9e0c0860 (fletcher_4_native) 4 : 0x130e010111 0x3a201080038 0x104381285501102 0x418016996320408 (fletcher_4_native) 4 : 0x130e010111 0x3a201080038 0x1043812c5501102 0x81802325c080864 (fletcher_4_native) 4 : 0x130e010111 0x3a0001c01080038 0x1383812c550111c 0x818975698080864 (fletcher_4_native) 4 : 0x1f81442e9241000 0x2002560880154c00 0xff10185007528210 0x19010003e566 (fletcher_4_native) 5 : 0xbab10c 0xf 0x53ae 0xdd549ae39aa1ba20 (fletcher_4_native) 5 : 0x130e010111 0x3ab01080038 0x1163812c550110b 0x8180a7793080864 (fletcher_4_native) 5 : 0x61626300 0x0 0x0 0x0 (fletcher_4_native) 5 : 0x8003 0x3df0d6a1 0x0 0x0 (fletcher_4_native) 6 : 0xbab10c 0xf 0x5384 0xdd549ae39aa1ba20 (fletcher_4_native) 7 : 0xbab10c 0xf 0x0 0x9af5e5f61ca2e28e (fletcher_4_native) 7 : 0x130e010111 0x3a201080038 0x104381265501102 0xc18c7210c086006 (fletcher_4_native) 7 : 0x275c222074650a2e 0x5c222020436f7079 0x7269676874203139 0x38392041540a2e5c (fletcher_4_native) 8 : 0x130e010111 0x3a0003101080038 0x1623812c5501131 0x8187f66a4080864 (fletcher_4_native) 9 : 0x8a000801010c0682 0x2eed0809c1640513 0x70200ff00026424 0x18001d16101f0059 (fletcher_4_native) 12 : 0xbab10c 0xf 0x0 0x45a9e1fc57ca2aa8 (fletcher_4_native) 30 : 0xbaddcafebaddcafe 0xbaddcafebaddcafe 0xbaddcafebaddcafe 0xbaddcafebaddcafe (fletcher_4_native) 47 : 0x0 0x0 0x0 0x0 (fletcher_4_native) 92 : 0x130e01011103 0x10108 0x0 0x200 (fletcher_4_native) Since I had to guess at what the Solaris 10 source looks like, some extra eyeballs on the dtrace script is in order. Mike -- Mike Gerdts http://mgerdts.blogspot.com/ zfs_bad_cksum.d Description: Binary data ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 5:28 AM, Frank Batschulat (Home) wrote: [snip] > Hey Mike, you're not the only victim of these strange CHKSUM errors, I hit > the same during my slightely different testing, where I'm NFS mounting an > entire, pre-existing remote file living in the zpool on the NFS server and use > that to create a zpool and install zones into it. What does your overall setup look like? Mine is: T5220 + Sun System Firmware 7.2.4.f 2009/11/05 18:21 Primary LDom Solaris 10u8 Logical Domains Manager 1.2,REV=2009.06.25.09.48 + 142840-03 Guest Domain 4 vcpus + 15 GB memory OpenSolaris snv_130 (this is where the problem is observed) I've seen similar errors on Solaris 10 in the primary domain and on a M4000. Unfortunately Solaris 10 doesn't show the checksums in the ereport. There I noticed a mixture between read errors and checksum errors - and lots more of them. This could be because the S10 zone was a full root SUNWCXall compared to the much smaller default ipkg branded zone. On the primary domain running Solaris 10... (this command was run some time ago) primary-domain# zpool status myzone pool: myzone state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAMESTATE READ WRITE CKSUM myzone DEGRADED 0 0 0 /foo/20g DEGRADED 4.53K 0 671 too many errors errors: No known data errors (this was run today, many days after previous command) primary-domain# fmdump -eV | egrep zio_err | uniq -c | head 1zio_err = 5 1zio_err = 50 1zio_err = 5 1zio_err = 50 1zio_err = 5 1zio_err = 50 2zio_err = 5 1zio_err = 50 3zio_err = 5 1zio_err = 50 Note that even though I had thousands of read errors the zone worked just fine. I would have never known (suspected?) there was a problem if I hadn't run "zpool status" or the various FMA commands. > I've filed today: > > 6915265 zpools on files (over NFS) accumulate CKSUM errors with no apparent > reason Thanks. I'll open a support call to help get some funding on it... > here's the relevant piece worth investigating out of it (leaving out the > actual setup etc..) > as in your case, creating the zpool and installing the zone into it still > gives > a healthy zpool, but immediately after booting the zone, the zpool served > over NFS > accumulated CHKSUM errors. > > of particular interest are the 'cksum_actual' values as reported by Mike for > his > test case here: > > http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html > > if compared to the 'chksum_actual' values I got in the fmdump error output on > my test case/system: > > note, the NFS servers zpool that is serving and sharing the file we use is > healthy. > > zone halted now on my test system, and checking fmdump: > > osoldev.batschul./export/home/batschul.=> fmdump -eV | grep cksum_actual | > sort | uniq -c | sort -n | tail > 2 cksum_actual = 0x4bea1a77300 0xf6decb1097980 0x217874c80a8d9100 > 0x7cd81ca72df5ccc0 > 2 cksum_actual = 0x5c1c805253 0x26fa7270d8d2 0xda52e2079fd74 > 0x3d2827dd7ee4f21 > 6 cksum_actual = 0x28e08467900 0x479d57f76fc80 0x53bca4db5209300 > 0x983ddbb8c4590e40 > *A 6 cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 > 0x89715e34fbf9cdc0 > *B 7 cksum_actual = 0x0 0x0 0x0 0x0 > *C 11 cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 > 0x280934efa6d20f40 > *D 14 cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 > 0x7e0aef335f0c7f00 > *E 17 cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 > 0xd4f1025a8e66fe00 > *F 20 cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 > 0x7f84b11b3fc7f80 > *G 25 cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 > 0x82804bc6ebcfc0 > > osoldev.root./export/home/batschul.=> zpool status -v > pool: nfszone > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: none requested > config: > > NAME
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 6:51 AM, James Carlson wrote: > Frank Batschulat (Home) wrote: >> This just can't be an accident, there must be some coincidence and thus >> there's a good chance >> that these CHKSUM errors must have a common source, either in ZFS or in NFS ? > > One possible cause would be a lack of substantial exercise. The man > page says: > > A regular file. The use of files as a backing store is > strongly discouraged. It is designed primarily for > experimental purposes, as the fault tolerance of a file > is only as good as the file system of which it is a > part. A file must be specified by a full path. > > Could it be that "discouraged" and "experimental" mean "not tested as > thoroughly as you might like, and certainly not a good idea in any sort > of production environment?" > > It sounds like a bug, sure, but the fix might be to remove the option. This unsupported feature is supported with the use of Sun Ops Center 2.5 when a zone is put on a "NAS Storage Library". -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 6:55 AM, Darren J Moffat wrote: > Frank Batschulat (Home) wrote: >> >> This just can't be an accident, there must be some coincidence and thus >> there's a good chance >> that these CHKSUM errors must have a common source, either in ZFS or in >> NFS ? > > What are you using for on the wire protection with NFS ? Is it shared using > krb5i or do you have IPsec configured ? If not I'd recommend trying one of > those and see if your symptoms change. Shouldn't a scrub pick that up? Why would there be no errors from "zoneadm install", which under the covers does a pkg image create followed by *multiple* pkg install invocations. No checksum errors pop up there. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zones on shared storage - a warning
[removed zones-discuss after sending heads-up that the conversation will continue at zfs-discuss] On Mon, Jan 4, 2010 at 5:16 PM, Cindy Swearingen wrote: > Hi Mike, > > It is difficult to comment on the root cause of this failure since > the several interactions of these features are unknown. You might > consider seeing how Ed's proposal plays out and let him do some more > testing... Unfortunately Ed's proposal is not funded last I heard. Ops Center uses many of the same mechanisms for putting zones on ZFS. This is where I saw the problem initially. > If you are interested in testing this with NFSv4 and it still fails > the same way, then also consider testing this with a local file > instead of a NFS-mounted file and let us know the results. I'm also > unsure of using the same path for the pool and the zone root path, > rather than one path for pool and a pool/dataset path for zone > root path. I will test this myself if I get some time. I have been unable to reproduce with a local file. I have been able to reproduce with NFSv4 on build 130. Rather surprisingly the actual checksums found in the ereports are sometimes "0x0 0x0 0x0 0x0" or "0xbaddcafe00 ...". Here's what I did: - Install OpenSolaris build 130 (ldom on T5220) - Mount some NFS space at /nfszone: mount -F nfs -o vers=4 $file:/path /nfszone - Create a 10gig sparse file cd /nfszone mkfile -n 10g root - Create a zpool zpool create -m /zones/nfszone nfszone /nfszone/root - Configure and install a zone zonecfg -z nfszone set zonepath = /zones/nfszone set autoboot = false verify commit exit chmod 700 /zones/nfszone zoneadm -z nfszone install - Verify that the nfszone pool is clean. First, pkg history in the zone shows the timestamp of the last package operation 2010-01-07T20:27:07 install pkg Succeeded At 20:31 I ran: # zpool status nfszone pool: nfszone state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM nfszone ONLINE 0 0 0 /nfszone/root ONLINE 0 0 0 errors: No known data errors I booted the zone. By 20:32 it had accumulated 132 checksum errors: # zpool status nfszone pool: nfszone state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM nfszone DEGRADED 0 0 0 /nfszone/root DEGRADED 0 0 132 too many errors errors: No known data errors fmdump has some very interesting things to say about the actual checksums. The 0x0 and 0xbaddcafe00 seem to shout that these checksum errors are not due to a couple bits flipped # fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail 2cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62 0x290cbce13fc59dce 3cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 0x7e0aef335f0c7f00 3cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 0xd4f1025a8e66fe00 4cksum_actual = 0x0 0x0 0x0 0x0 4cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900 0x330107da7c4bcec0 5cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73 0x4e0b3a8747b8a8 6cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 0x280934efa6d20f40 6cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 0x89715e34fbf9cdc0 16cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 0x7f84b11b3fc7f80 48cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 0x82804bc6ebcfc0 I halted the zone, exported the pool, imported the pool, then did a scrub. Everything seemed to be OK: # zpool export nfszone # zpool import -d /nfszone nfszone # zpool status nfszone pool: nfszone state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM nfszone ONLINE 0 0 0 /nfszone/root ONLINE 0 0 0 errors: No known data errors # zpool scrub nfszone # zpool status nfszone pool: nfszone state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Thu Jan 7 21:56:47 2010 config: NAME STATE READ WRITE CKSUM nfszone ONLINE 0 0 0 /nfszone/root ONLINE 0 0 0 errors: No known data errors But then I booted the zone... # zoneadm -z nfszone boot # zpool status nfszone pool: nfszone state: ONLINE status: One or more devices has experienced an unrecoverabl
Re: [zfs-discuss] Clearing a directory with more than 60 million files
On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi wrote: > Hello, > > As a result of one badly designed application running loose for some time, > we now seem to have over 60 million files in one directory. Good thing > about ZFS is that it allows it without any issues. Unfortunatelly now that > we need to get rid of them (because they eat 80% of disk space) it seems > to be quite challenging. > > Traditional approaches like "find ./ -exec rm {} \;" seem to take forever > - after running several days, the directory size still says the same. The > only way how I've been able to remove something has been by giving "rm > -rf" to problematic directory from parent level. Running this command > shows directory size decreasing by 10,000 files/hour, but this would still > mean close to ten months (over 250 days) to delete everything! > > I also tried to use "unlink" command to directory as a root, as a user who > created the directory, by changing directory's owner to root and so forth, > but all attempts gave "Not owner" error. > > Any commands like "ls -f" or "find" will run for hours (or days) without > actually listing anything from the directory, so I'm beginning to suspect > that maybe the directory's data structure is somewhat damaged. Is there > some diagnostics that I can run with e.g "zdb" to investigate and > hopefully fix for a single directory within zfs dataset? In situations like this, ls will be exceptionally slow partially because it will sort the output. Find is slow because it needs to call lstat() on every entry. In similar situations I have found the following to work. perl -e 'opendir(D, "."); while ( $d = readdir(D) ) { print "$d\n" }' Replace print with unlink if you wish... > > To make things even more difficult, this directory is located in rootfs, > so dropping the zfs filesystem would basically mean reinstalling the > entire system, which is something that we really wouldn't wish to go. > > > OS is Solaris 10, zpool version is 10 (rather old, I know, but is there > easy path for upgrade that might solve this problem?) and the zpool > consists two 146 GB SAS drivers in a mirror setup. > > > Any help would be appreciated. > > Thanks, > Mikko > > -- > Mikko Lammi | l...@lmmz.net | http://www.lmmz.net > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool creation best practices
Thanks for the response Marion. I'm glad that I"m not the only one. :) Message was edited by: mijohnst -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Zpool creation best practices
I'm just wondering what some of you might do with your systems. We have an EMC Clariion unit that I connect several sun machines to. I allow the EMC to do it's hardware raid5 for several luns and then I stripe them together. I considered using raidz and just configuring the EMC as a JBOD, but I thought it would defeat the purpose paying so much for a system with the advanced redundancy system. I also like to add luns on the fly when a system needs more file space and I know you can't do that with raidz. I've never had a lun go bad but bad things do happen. Does anyone else use ZFS in this way? Is this an unrecommended setup? It's too late to change my setup, but in the future when I'm planning new systems, should I consider the effort to allow zfs fully control all the disks? Message was edited by: mijohnst -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing ZFS drive pathing
Just thought I would let you all know that I followed what Alex suggested along with what many of you pointed out and it worked! Here are the steps I followed: 1. Break root drive mirror 2. zpool export filesystem 3. run the command to start MPIOX and reboot the machine 4. zpool import filesystem 5. Check the system 6. Recreate the mirror. Thank you all for the help! I feel much better and it worked without a single problem! I'm very impressed with MPXIO and wish I had known about it before spending thousands of dollars on PowerPath. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling wrote: > If the allocator can change, what sorts of policies should be > implemented? Examples include: > + should the allocator stick with best-fit and encourage more > gangs when the vdev is virtual? > + should the allocator be aware of an SSD's page size? Is > said page size available to an OS? > + should the metaslab boundaries align with virtual storage > or SSD page boundaries? Wandering off topic a little bit... Should the block size be a tunable so that page size of SSD (typically 4K, right?) and upcoming hard disks that sport a sector size > 512 bytes? http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt > And, perhaps most important, how can this be done automatically > so that system administrators don't have to be rocket scientists > to make a good choice? Didn't you read the marketing literature? ZFS is easy because you only need to know two commands: zpool and zfs. If you just ignore all the subcommands, options to those subcommands, evil tuning that is sometimes needed, and effects of redundancy choices then there is no need for any rocket scientists. :) -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling wrote: > On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote: > >> Devzero, >> >> Unfortunately that was my assumption as well. I don't have source level >> knowledge of ZFS, though based on what I know it wouldn't be an easy way to >> do it. I'm not even sure it's only a technical question, but a design >> question, which would make it even less feasible. > > It is not hard, because ZFS knows the current free list, so walking that > list > and telling the storage about the freed blocks isn't very hard. > > What is hard is figuring out if this would actually improve life. The > reason > I say this is because people like to use snapshots and clones on ZFS. > If you keep snapshots, then you aren't freeing blocks, so the free list > doesn't grow. This is a very different use case than UFS, as an example. It seems as though the oft mentioned block rewrite capabilities needed for pool shrinking and changing things like compression, encryption, and deduplication would also show benefit here. That is, blocks would be re-written in such a way to minimize the number of chunks of storage that is allocated. The current HDS chunk size is 42 MB. The most benefit would seem to be to have ZFS make a point of reusing old but freed blocks before doing an allocation that causes the back-end storage to allocate another chunk of disk to the thin-provisioned. While it is important to be able to roll back a few transactions in the event of some widely discussed failure modes, it is probably reasonable to reuse a block freed by a txg that is 3,000 txg's old (about 1 day old if 1 txg per 30 seconds). Such a threshold could be used to determine whether to reuse a block or venture into previously untouched regions of the disk. This strategy would allow the SAN administrator (who is a different person than the sysadmin) to allocate extra space to servers and the sysadmin can control the amount of space really used by quotas. In the event that there is an emergency need for more space, the sysadmin can increase the quota and allow more of the allocate SAN space to be used. Assuming the block rewrite feature comes to fruition, this emergency growth could be shrunk back down to the original size once the surge in demand (or errant process) subsides. > > There are a few minor bumps in the road. The ATA PASSTHROUGH > command, which allows TRIM to pass through the SATA drivers, was > just integrated into b130. This will be more important to small servers > than SANs, but the point is that all parts of the software stack need to > support the effort. As such, it is not clear to me who, if anyone, inside > Sun is champion for the effort -- it crosses multiple organizational > boundaries. > >> >> Apart from the technical possibilities, this feature looks really >> inevitable to me in the long run especially for enterprise customers with >> high-end SAN as cost is always a major factor in a storage design and it's a >> huge difference if you have to pay based on the space used vs space >> allocated (for example). > > If the high cost of SAN storage is the problem, then I think there are > better ways to solve that :-) The "SAN" could be an OpenSolaris device serving LUNs through COMSTAR. If those LUNs are used to hold a zpool, the zpool could notify the LUN that blocks are no longer used and the "SAN" could reclaim those blocks. This is just a variant of the same problem faced with expensive SAN devices that have thin provisioning allocation units measured in the tens of megabytes instead of hundreds to thousands of kilobytes. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zones on shared storage - a warning
On Tue, Dec 22, 2009 at 8:02 PM, Mike Gerdts wrote: > I've been playing around with zones on NFS a bit and have run into > what looks to be a pretty bad snag - ZFS keeps seeing read and/or > checksum errors. This exists with S10u8 and OpenSolaris dev build > snv_129. This is likely a blocker for anything thinking of > implementing parts of Ed's Zones on Shared Storage: > > http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss > > The OpenSolaris example appears below. The order of events is: > > 1) Create a file on NFS, turn it into a zpool > 2) Configure a zone with the pool as zonepath > 3) Install the zone, verify that the pool is healthy > 4) Boot the zone, observe that the pool is sick [snip] An off list conversation and a bit of digging into other tests I have done shows that this is likely limited to NFSv3. I cannot say that this problem has been seen with NFSv4. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Zones on shared storage - a warning
I've been playing around with zones on NFS a bit and have run into what looks to be a pretty bad snag - ZFS keeps seeing read and/or checksum errors. This exists with S10u8 and OpenSolaris dev build snv_129. This is likely a blocker for anything thinking of implementing parts of Ed's Zones on Shared Storage: http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss The OpenSolaris example appears below. The order of events is: 1) Create a file on NFS, turn it into a zpool 2) Configure a zone with the pool as zonepath 3) Install the zone, verify that the pool is healthy 4) Boot the zone, observe that the pool is sick r...@soltrain19# mount filer:/path /mnt r...@soltrain19# cd /mnt r...@soltrain19# mkdir osolzone r...@soltrain19# mkfile -n 8g root r...@soltrain19# zpool create -m /zones/osol osol /mnt/osolzone/root r...@soltrain19# zonecfg -z osol osol: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:osol> create zonecfg:osol> info zonename: osol zonepath: brand: ipkg autoboot: false bootargs: pool: limitpriv: scheduling-class: ip-type: shared hostid: zonecfg:osol> set zonepath=/zones/osol zonecfg:osol> set autoboot=false zonecfg:osol> verify zonecfg:osol> commit zonecfg:osol> exit r...@soltrain19# chmod 700 /zones/osol r...@soltrain19# zoneadm -z osol install Publisher: Using opensolaris.org (http://pkg.opensolaris.org/dev/ http://pkg-na-2.opensolaris.org/dev/). Publisher: Using contrib (http://pkg.opensolaris.org/contrib/). Image: Preparing at /zones/osol/root. Cache: Using /var/pkg/download. Sanity Check: Looking for 'entire' incorporation. Installing: Core System (output follows) DOWNLOAD PKGS FILESXFER (MB) Completed46/46 12334/1233493.1/93.1 PHASEACTIONS Install Phase18277/18277 No updates necessary for this image. Installing: Additional Packages (output follows) DOWNLOAD PKGS FILESXFER (MB) Completed36/36 3339/333921.3/21.3 PHASEACTIONS Install Phase 4466/4466 Note: Man pages can be obtained by installing SUNWman Postinstall: Copying SMF seed repository ... done. Postinstall: Applying workarounds. Done: Installation completed in 2139.186 seconds. Next Steps: Boot the zone, then log into the zone console (zlogin -C) to complete the configuration process. 6.3 Boot the OpenSolaris zone r...@soltrain19# zpool status osol pool: osol state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM osol ONLINE 0 0 0 /mnt/osolzone/root ONLINE 0 0 0 errors: No known data errors r...@soltrain19# zoneadm -z osol boot r...@soltrain19# zpool status osol pool: osol state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM osol DEGRADED 0 0 0 /mnt/osolzone/root DEGRADED 0 0 117 too many errors errors: No known data errors r...@soltrain19# zlogin osol uptime 5:31pm up 1 min(s), 0 users, load average: 0.69, 0.38, 0.52 -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss