Re: [zfs-discuss] Directory is not accessible
unlink(1M)? cheers, --justin From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) To: Sami Tuominen ; " zfs-discuss@opensolaris.org" Sent: Monday, 26 November 2012, 14:57 Subject: Re: [zfs-discuss] Directory is not accessible > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Sami Tuominen > > How can one remove a directory containing corrupt files or a corrupt file > itself? For me rm just gives input/output error. I was hoping to see somebody come up with an answer for this ... I would expect rm to work... Maybe you have to rm the parent of the thing you're trying to rm? But I kinda doubt it. Maybe you need to verify you're rm'ing the right thing? I believe, if you scrub the pool, it should tell you the name of the corrupt things. Or maybe you're not experiencing a simple cksum mismatch, maybe you're experiencing a legitimate IO error. The "rm" solution could only possibly work to clear up a cksum mismatch. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ok for single disk dev box?
> would be very annoying if ZFS barfed on a technicality and I had to reinstall > the whole OS because of a kernel panic and an unbootable system. Is this a known scenario with ZFS then? I can't recall hearing of this happening. I've seen plenty of UFS filesystems dieing with "panic: freeing free" and then the ensuing fsck-athon convinces the user to just rebuild the fs in question. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ok for single disk dev box?
> has only one drive. If ZFS detects something bad it might kernel panic and > lose the whole system right? What do you mean by "lose the whole system"? A panic is not a bad thing, and also does not imply that the machine will not reboot successfully. It certainly doesn't guarantee your OS will be trashed. > I realize UFS /might/ be ignorant of any corruption but it might be more > usable and go happily on it's way without noticing? UFS has a mount option "onerror" which defines what the OS will do if there is a problem detected with a given filesystem. I think the default is "panic" anyway. Check mount_ufs manpage for details. Your answer is to take regular backups, rather than bury your head in the sand. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] number of blocks changes
> I think for the cleanness of the experiment, you should also include "sync" after the dd's, to actually commit your file to the pool. OK that 'fixes' it: finsdb137@root> dd if=/dev/random of=ob bs=128k count=1 && sync && while true > do > ls -s ob > sleep 1 > done 0+1 records in 0+1 records out 4 ob 4 ob 4 ob .. etc. I guess I knew this had something to do with stuff being flushed to disk, I don't know why I didn't think of it myself. > What is the pool's redundancy setting? copies=1. Full zfs get below, but in short, it's a basic mirrored root with default settings. Hmm, maybe I should mirror root with copies=2. ;) > I am not sure what "ls -s" actually accounts for file's FS-block usage, but I wonder if it might include metadata (relevant pieces of the block pointer tree individual to the file). Also check if the disk usage reported by "du -k ob" varies similarly, for the fun of it? Yes, it varies too. finsdb137@root> dd if=/dev/random of=ob bs=128k count=1 && while true > do > ls -s ob > du -k ob > sleep 1 > done 0+1 records in 0+1 records out 1 ob 0 ob 1 ob 0 ob 1 ob 0 ob 1 ob 0 ob 4 ob 2 ob 4 ob 2 ob 4 ob 2 ob 4 ob 2 ob 4 ob 2 ob finsdb137@root> zfs get all rpool/ROOT/s10s_u9wos_14a NAME PROPERTY VALUE SOURCE rpool/ROOT/s10s_u9wos_14a type filesystem - rpool/ROOT/s10s_u9wos_14a creation Tue Mar 1 15:09 2011 - rpool/ROOT/s10s_u9wos_14a used 20.6G - rpool/ROOT/s10s_u9wos_14a available 37.0G - rpool/ROOT/s10s_u9wos_14a referenced 20.6G - rpool/ROOT/s10s_u9wos_14a compressratio 1.00x - rpool/ROOT/s10s_u9wos_14a mounted yes - rpool/ROOT/s10s_u9wos_14a quota none default rpool/ROOT/s10s_u9wos_14a reservation none default rpool/ROOT/s10s_u9wos_14a recordsize 128K default rpool/ROOT/s10s_u9wos_14a mountpoint / local rpool/ROOT/s10s_u9wos_14a sharenfs off default rpool/ROOT/s10s_u9wos_14a checksum on default rpool/ROOT/s10s_u9wos_14a compression off default rpool/ROOT/s10s_u9wos_14a atime on default rpool/ROOT/s10s_u9wos_14a devices on default rpool/ROOT/s10s_u9wos_14a exec on default rpool/ROOT/s10s_u9wos_14a setuid on default rpool/ROOT/s10s_u9wos_14a readonly off default rpool/ROOT/s10s_u9wos_14a zoned off default rpool/ROOT/s10s_u9wos_14a snapdir hidden default rpool/ROOT/s10s_u9wos_14a aclmode groupmask default rpool/ROOT/s10s_u9wos_14a aclinherit restricted default rpool/ROOT/s10s_u9wos_14a canmount noauto local rpool/ROOT/s10s_u9wos_14a shareiscsi off default rpool/ROOT/s10s_u9wos_14a xattr on default rpool/ROOT/s10s_u9wos_14a copies 1 default rpool/ROOT/s10s_u9wos_14a version 3 - rpool/ROOT/s10s_u9wos_14a utf8only off - rpool/ROOT/s10s_u9wos_14a normalization none - rpool/ROOT/s10s_u9wos_14a casesensitivity sensitive - rpool/ROOT/s10s_u9wos_14a vscan off default rpool/ROOT/s10s_u9wos_14a nbmand off default rpool/ROOT/s10s_u9wos_14a sharesmb off default rpool/ROOT/s10s_u9wos_14a refquota none default rpool/ROOT/s10s_u9wos_14a refreservation none default rpool/ROOT/s10s_u9wos_14a primarycache all default rpool/ROOT/s10s_u9wos_14a secondarycache all default rpool/ROOT/s10s_u9wos_14a usedbysnapshots 0 - rpool/ROOT/s10s_u9wos_14a usedbydataset 20.6G - rpool/ROOT/s10s_u9wos_14a usedbychildren 0 - rpool/ROOT/s10s_u9wos_14a usedbyrefreservation 0 - rpool/ROOT/s10s_u9wos_14a logbias latency default ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] number of blocks changes
>Can you check whether this happens from /dev/urandom as well? It does: finsdb137@root> dd if=/dev/urandom of=oub bs=128k count=1 && while true > do > ls -s oub > sleep 1 > done 0+1 records in 0+1 records out 1 oub 1 oub 1 oub 1 oub 1 oub 4 oub 4 oub 4 oub 4 oub 4 oub ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] number of blocks changes
While this isn't causing me any problems, I'm curious as to why this is happening...: $ dd if=/dev/random of=ob bs=128k count=1 && while true > do > ls -s ob > sleep 1 > done 0+1 records in 0+1 records out 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 1 ob 4 ob < changes here 4 ob 4 ob ^C $ ls -l ob -rw-r--r-- 1 justin staff 1040 Aug 3 09:28 ob I was expecting the '1', since this is a zfs with recordsize=128k. Not sure I understand the '4', or why it happens ~30s later. Can anyone distribute clue in my direction? s10u10, running 144488-06 KU. zfs is v4, pool is v22. cheers, --justin -bash-3.00$ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
> Since there is a finite number of bit patterns per block, have you tried to > just calculate the SHA-256 or SHA-512 for every possible bit pattern to see > if there is ever a collision? If you found an algorithm that produced no > collisions for any possible block bit pattern, wouldn't that be the win? Perhaps I've missed something, but if there was *never* a collision, you'd have stumbled across a rather impressive lossless compression algorithm. I'm pretty sure there's some Big Mathematical Rules (Shannon?) that mean this cannot be. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
> This assumes you have low volumes of deduplicated data. As your dedup > ratio grows, so does the performance hit from dedup=verify. At, say, > dedupratio=10.0x, on average, every write results in 10 reads. Well you can't make an omelette without breaking eggs! Not a very nice one, anyway. Yes dedup is expensive but much like using O_SYNC, it's a conscious decision here to take a performance hit in order to be sure about our data. Moving the actual reads to a async thread as I suggested should improve things. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
> The point is that hash functions are many to one and I think the point > was about that verify wasn't really needed if the hash function is good > enough. This is a circular argument really, isn't it? Hash algorithms are never perfect, but we're trying to build a perfect one? It seems to me the obvious fix is to use hash to identify candidates for dedup, and then do the actual verify and dedup asynchronously. Perhaps a worker thread doing this at low priority? Did anyone consider this? cheers, --justin___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
>>You do realize that the age of the universe is only on the order of >>around 10^18 seconds, do you? Even if you had a trillion CPUs each >>chugging along at 3.0 GHz for all this time, the number of processor >>cycles you will have executed cumulatively is only on the order 10^40, >>still 37 orders of magnitude lower than the chance for a random hash >>collision. Here we go, boiling the oceans again :) >Suppose you find a weakness in a specific hash algorithm; you use this >to create hash collisions and now imagined you store the hash collisions >in a zfs dataset with dedup enabled using the same hash algorithm. Sorry, but isn't this what dedup=verify solves? I don't see the problem here. Maybe all that's needed is a comment in the manpage saying hash algorithms aren't perfect. cheers, --justin___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SPARC SATA, please.
Richard Elling wrote: Miles Nordin wrote: "ave" == Andre van Eyssen writes: "et" == Erik Trimble writes: "ea" == Erik Ableson writes: "edm" == "Eric D. Mudama" writes: ave> The LSI SAS controllers with SATA ports work nicely with ave> SPARC. I think what you mean is ``some LSI SAS controllers work nicely with SPARC''. It would help if you tell exactly which one you're using. I thought the LSI 1068 do not work with SPARC (mfi driver, x86 only). Sun has been using the LSI 1068[E] and its cousin, 1064[E] in SPARC machines for many years. In fact, I can't think of a SPARC machine in the current product line that does not use either 1068 or 1064 (I'm sure someone will correct me, though ;-) -- richard Might be worth having a look at the T1000 to see what's in there. We used to ship those with SATA drives in. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Concat'ed pool vs. striped pool
But, if mypool was a concatenation, things would get written onto the c0t1d0 first, and if any one of the subsequent disks were to fail, I should be able to recover everything off of mypool, as long as I have not filled up c0t1d0, since things were written sequentially, rather than across all disks like striping. I think the circumstances where this would work are very unlikely, and I don't know that ZFS gives you any guarantee that it's going to write to the front of a given device and then work back from there does it? Even if it did, what about if the pool filled up, and then emptied out again while you weren't looking. Some data might be left in the last device. Neither simple concat or stripe have any resilience to disk failure, you must use mirrors or raidz to achieve that. Is my understanding correct, or am I totally off the wall here? And, if I AM correct, how do you create a concatenated zpool? You can't. ZFS dynamically stripes across top-level vdevs. Whichever order you add them into the pool, they will be effectively treated as a stripe. regards, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
> with other Word files. You will thus end up seeking all over the disk > to read _most_ Word files. Which really sucks. > very limited, constrained usage. Disk is just so cheap, that you > _really_ have to have an enormous amount of dup before the performance > penalties of dedup are countered. Neither of these hold true for SSDs though, do they? Seeks are essentially free, and the devices are not cheap. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
> Does anyone know a tool that can look over a dataset and give > duplication statistics? I'm not looking for something incredibly > efficient but I'd like to know how much it would actually benefit our Check out the following blog..: http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
> Raw storage space is cheap. Managing the data is what is expensive. Not for my customer. Internal accounting means that the storage team gets paid for each allocated GB on a monthly basis. They have stacks of IO bandwidth and CPU cycles to spare outside of their daily busy period. I can't think of a better spend of their time than a scheduled dedup. > Perhaps deduplication is a response to an issue which should be solved > elsewhere? I don't think you can make this generalisation. For most people, yes, but not everyone. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
> UFS == Ultimate File System > ZFS == Zettabyte File System it's a nit, but.. UFS != Ultimate File System ZFS != Zettabyte File System cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS not utilizing all disks
Simple test - mkfile 8gb now and see where the data goes... :) Unless you've got compression=on, in which case you won't see anything! cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] need some explanation
zpool list doesn't reflect pool usage stats instantly. Why? This is no different to how UFS behaves. If you rm a file, this uses the system call unlink(2) to do the work which is asynchronous. In other words, unlink(2) almost immediately returns a successful return code to rm (which can then exit, and return the user to a shell prompt), while leaving a kernel thread running to actually finish off freeing up the used space. Normally you don't see this because it happens very quickly, but once in a while you blow a 100GB file away which may well have a significant amount of metadata associated with it that needs clearing down. I guess if you wanted to force this to be synchronous you could do something like this: rm /tank/myfs/bigfile && lockfs /tank/myfs Which would not return until the whole filesystem was flushed back to disk. I don't think you can force a flush at a finer granularity than that. Anyone? regards, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs. rmvolmgr
Is there a more elegant approach that tells rmvolmgr to leave certain devices alone on a per disk basis? I was expecting there to be something in rmmount.conf to allow a specific device or pattern to be excluded but there appears to be nothing. Maybe this is an RFE? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS mount fails at boot
Matt, Can't see anything wrong with that procedure. However, could the problem be that you're trying to mount on /home which is usually used by the automounter? e.g. $ grep home /etc/auto_master /home auto_home -nobrowse Maybe you need to deconfigure this from your automounter or change your mountpoint. cheers, --justin I have about a dozen two disk systems that were all setup the same using a combination of SVM and ZFS. s0 = / SMV Mirror s1 = swap s3 = /tmp s4 = metadb s5 = zfs mirror The system does boot, but once it gets to zfs, zfs fails and all subsequent services fail as well (including ssh) /home,/tmp, and /data are on the zfs mirror. /var is on it's own UFS/SVM mirror as well as root and swap. I included the errors I am getting as well as the exact commands I used to build both the SVM and ZFS mirrors. (All of which appeared to work flawlessly) I am guessing there is just something really simple that needs to be set. Any Ideas? --Errors-- vfcufs01# cat /var/svc/log/system-filesystem-local:default.log [ Mar 16 11:02:58 Rereading configuration. ] [ Mar 16 11:03:37 Executing start method ("/lib/svc/method/fs-local") ] bootadm: no matching entry found: Solaris_reboot_transient [ Mar 16 11:03:37 Method "start" exited with status 0 ] [ Mar 16 13:25:58 Executing start method ("/lib/svc/method/fs-local") ] bootadm: no matching entry found: Solaris_reboot_transient [ Mar 16 13:25:58 Method "start" exited with status 0 ] [ Mar 20 15:26:32 Executing start method ("/lib/svc/method/fs-local") ] bootadm: no matching entry found: Solaris_reboot_transient WARNING: /usr/sbin/zfs mount -a failed: exit status 1 [ Mar 20 15:26:32 Method "start" exited with status 95 ] [ Mar 21 08:27:37 Leaving maintenance because disable requested. ] [ Mar 21 08:27:37 Disabled. ] [ Mar 21 08:32:22 Executing start method ("/lib/svc/method/fs-local") ] bootadm: no matching entry found: Solaris_reboot_transient WARNING: /usr/sbin/zfs mount -a failed: exit status 1 [ Mar 21 08:32:23 Method "start" exited with status 95 ] [ Mar 21 08:50:20 Leaving maintenance because disable requested. ] [ Mar 21 08:50:20 Disabled. ] [ Mar 21 08:55:07 Executing start method ("/lib/svc/method/fs-local") ] bootadm: no matching entry found: Solaris_reboot_transient WARNING: /usr/sbin/zfs mount -a failed: exit status 1 [ Mar 21 08:55:07 Method "start" exited with status 95 ] --Commands Run to make SVM and ZFS mirror--- prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2 installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c0t1d0s0 metadb -a -f -c 2 c0t0d0s4 c0t1d0s4 metainit -f d10 1 1 c0t0d0s0 metainit -f d11 1 1 c0t0d0s1 metainit -f d13 1 1 c0t0d0s3 metainit -f d20 1 1 c0t1d0s0 metainit -f d21 1 1 c0t1d0s1 metainit -f d23 1 1 c0t1d0s3 metainit d0 -m d10 metainit d1 -m d11 metainit d3 -m d13 metaroot d0 Update /etc/vfstab so that the swap partition points to the d1 just as root was modified by the last command to point to d0 Swap line in vfstab should look like this /dev/md/dsk/d1 - - swap- no - lockfs -fa Reboot After reboot… metattach d0 d20 metattach d1 d21 metattach d3 d23 Then do this to check the status of the mirroring metastat | grep "%" Wait until the syncs are complete zpool create zpool mirror c0t0d0s5 c0t1d0s5 Create the filesystem umount /home umount /tmp rm -rf /data rm -rf /home rm -rf /tmp zfs create zpool/data zfs create zpool/home zfs create zpool/tmp sleep 10 Make the directory for the mountpoint mkdir /data mkdir /home mkdir /tmp Make the mountpoint zfs set mountpoint=/data zpool/data zfs set mountpoint=/home zpool/home zfs set mountpoint=/tmp zpool/tmp Now you should have the regular roots for these Turn ZFS compression on zfs set compression=on zpool/data zfs set compression=on zpool/home zfs set compression=on zpool/tmp Set the quotas zfs set quota=4G zpool/home zfs set quota=1G zpool/tmp This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS needs a viable backup mechanism
> Why aren't you using amanda or something else that uses > tar as the means by which you do a backup? Using something like tar to take a backup forgoes the ability to do things like the clever incremental backups that ZFS can achieve though; e.g. only backing the few blocks that have changed in a very large file rather than the whole file regardless. If 'zfs send' doesn't do something we need to fix it rather than avoid it, IMO. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss