Re: [zfs-discuss] oddity of slow zfs destroy
First, a disclaimer: I do not know how the zfs dataset destruction is implemented in reality,but I can guess at least a couple of legal variants for a slow destruction. 2012-06-25 21:55, Philip Brown wrote: I ran into something odd today: zfs destroy -r random/filesystem is mindbogglingly slow. But seems to me, it shouldnt be. It's slow, because the filesystem has two snapshots on it. Presumably, it's busy "rolling back" the snapshots. but I've already declared by my command line, that I DONT CARE about the contents of the filesystem! Why doesnt zfs simply do: 1. unmount filesystem, if possible (it was possible) (1.5 possibly note "intent to delete" somewhere in the pool records) 2. zero out/free the in-kernel-memory in one go 3. update the pool, "hey I deleted the filesystem, all these blocks are now clear" Basically, your ideal fast destruction would be the pruning of the dataset tree (the node under which the snapshots' and the live dataset's blocks are rooted and accounted for). In this case "everything not allocated is free", or at least it might be made this way. The slow part is, likely, a walk of the block pointer tree (through all the random on-disk locations) and some sort of revision in order to release the blocks. So, what can be done at this step (speculation follows)? * Blocks might have been written as deduped; in this case we have to decrease the reference counters in DDT - but first we have to walk the dataset's branch of the block-pointer tree and see if any have the "dedup" bit-flag set. * A simpler case is the presence of cloned datasets based on snapshots of this dataset. Unless you're destroying the whole family of sibling datasets, the clones have to be promoted and referenced blocks are to be reassigned to these datasets (including reassignment of the snapshot "ownership"). * Even for the "trivial" step (2) of yours, the freeing of memory, we need to know which ARC-cached blocks to free. How can we know that without walking the BP tree first? I listed just a few reasons off the top of my head why a walk of the whole BP-tree branch is required to free the blocks referenced by this tree. If any further operations are needed, such as modifications to DDT, they may delay the result. In particular, this may be why recent versions of zfs/zpool worked towards asynchronous destructions and "deferred free" capability. The destroyed branch can be quickly marked as deleted, then the kernel works in the background to do its processing. In my (and not only mine) problematic cases it could have required prodigous amounts of RAM, especially with dedup procesing in play, and cause computer freezes. However, sometime after ZFSv22, the deferred freeing in such cases just takes several hard-resets to complete, instead of taking truly forever with no progress ;) Basically, the steps you outlined should be there already, in some manner, at least for ZFSv28. So, the practical questions are: * your version of zpool/zfs; OS version? * presence of deduplication on this dataset (and dedup support in the OS version - lack of it may have less code paths to follow and check, and be faster just due to that; i.e. Solaris 10 nominally has ZFSv29(?), but not all features are implemented as in Solaris 11 or OpenSolaris of similar ZFS versions); * did you use clones? * fragmentation (or how busy is the pool while processing the deletion, in terms of iops)? HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] oddity of slow zfs destroy
On Jun 25, 2012, at 10:55 AM, Philip Brown wrote: > I ran into something odd today: > > zfs destroy -r random/filesystem > > is mindbogglingly slow. But seems to me, it shouldnt be. > It's slow, because the filesystem has two snapshots on it. Presumably, it's > busy "rolling back" the snapshots. > but I've already declared by my command line, that I DONT CARE about the > contents of the filesystem! > Why doesnt zfs simply do: > > 1. unmount filesystem, if possible (it was possible) > (1.5 possibly note "intent to delete" somewhere in the pool records) > 2. zero out/free the in-kernel-memory in one go > 3. update the pool, "hey I deleted the filesystem, all these blocks are now > clear" > > > Having this kind of operation take more than even 10 seconds, seems like a > huge bug to me. yet it can take many minutes. An order of magnitude off. yuck. Agree. Asynchronous destroy has been integrated into illumos. Look for it soon in the distributions derived from illumos soon. For more information, see Chris Siden and Matt Ahrens discussions on async destroy and ZFS feature flags at the ZSF Meetup in January 2012 here: http://blog.delphix.com/ahl/2012/zfs10-illumos-meetup/ -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] oddity of slow zfs destroy
I ran into something odd today: zfs destroy -r random/filesystem is mindbogglingly slow. But seems to me, it shouldnt be. It's slow, because the filesystem has two snapshots on it. Presumably, it's busy "rolling back" the snapshots. but I've already declared by my command line, that I DONT CARE about the contents of the filesystem! Why doesnt zfs simply do: 1. unmount filesystem, if possible (it was possible) (1.5 possibly note "intent to delete" somewhere in the pool records) 2. zero out/free the in-kernel-memory in one go 3. update the pool, "hey I deleted the filesystem, all these blocks are now clear" Having this kind of operation take more than even 10 seconds, seems like a huge bug to me. yet it can take many minutes. An order of magnitude off. yuck. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?
Eric Schrock wrote: > On Mon, Jun 25, 2012 at 11:19 AM, wrote: > > > > > > In the very beginning, mkdir(1) was a set-uid application; it used > > "mknod" to make a directory and then created a link from > >newdir to newdir/. > > and from > >"." to newdir/.. > > > > Interesting, guess you learn something new every day :-) > > http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/mkdir.c This was a nice way to become superuser those days. Just run a loop to make a directory in /tmp and run another program that tries to remove the directory and replace it by a hadlink to /etc/passwd. Mkdir(1) then did a "chown /etc/passwd"... We tried this and it took aprox. 30 minutes to become super user this way. And BSD introduced the syscall mkdir(2) to fix this and this is is why UFS was not designed to support link(2) in directories. BTW: to implement mkdir(2), there was a new struct dirtemplate in the kernel with the following comment: /* * A virgin directory (no blushing please). */ struct dirtemplate mastertemplate = { 0, 12, 1, ".", 0, DIRBLKSIZ - 12, 2, ".." }; This is the first time where Sun verified not to have humor, as Sun removed that comment... Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?
On Mon, Jun 25, 2012 at 11:19 AM, wrote: > > > In the very beginning, mkdir(1) was a set-uid application; it used > "mknod" to make a directory and then created a link from >newdir to newdir/. > and from >"." to newdir/.. > Interesting, guess you learn something new every day :-) http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/mkdir.c Thanks, - Eric -- Eric Schrock Delphix http://blog.delphix.com/eschrock 275 Middlefield Road, Suite 50 Menlo Park, CA 94025 http://www.delphix.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?
Eric Schrock wrote: > The decision to not support link(2) of directories was very deliberate - it > is an abomination that never should have been allowed in the first place. > My guess is that the behavior of unlink(2) on directories is a direct > side-effect of that (if link isn't supported, then why support unlink?). > Also worth noting that ZFS also doesn't let you open(2) directories and > read(2) from them, something (I believe) UFS does allow. Link/unlink on directories is not a property of UFS. UFS has been designed without that feature, but it has been added by AT&T with SVr4. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?
>The decision to not support link(2) of directories was very deliberate - it >is an abomination that never should have been allowed in the first place. >My guess is that the behavior of unlink(2) on directories is a direct >side-effect of that (if link isn't supported, then why support unlink?). >Also worth noting that ZFS also doesn't let you open(2) directories and >read(2) from them, something (I believe) UFS does allow. In the very beginning, mkdir(1) was a set-uid application; it used "mknod" to make a directory and then created a link from newdir to newdir/. and from "." to newdir/.. Traditionally, we was only allowed for the superuser and when we added privileges a special privileges was added. I think we should remove it for the other filesystems. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?
The decision to not support link(2) of directories was very deliberate - it is an abomination that never should have been allowed in the first place. My guess is that the behavior of unlink(2) on directories is a direct side-effect of that (if link isn't supported, then why support unlink?). Also worth noting that ZFS also doesn't let you open(2) directories and read(2) from them, something (I believe) UFS does allow. - Eric On Mon, Jun 25, 2012 at 10:40 AM, Garrett D'Amore wrote: > I don't know the precise history, but I think its a mistake to permit > direct link() or unlink() of directories. I do note that on BSD (MacOS at > least) unlink returns EPERM if the executing user is not superuser. I do > see that the man page for unlink() says this on illumos: > > The named file is a directory and > {PRIV_SYS_LINKDIR} is not asserted in the > effective set of the calling process, or the > filesystem implementation does not support > unlink() or unlinkat() on directories. > > I can't imagine why you'd *ever* want to support unlink() of a *directory* > -- what's the use case for it anyway (outside of filesystem repair)? > > Garrett D'Amore > garr...@damore.org > > > > On Jun 25, 2012, at 2:23 AM, Lionel Cons wrote: > > > Does someone know the history which led to the EPERM for unlink() of > > directories on ZFS? Why was this done this way, and not something like > > allowing the unlink and execute it on the next scrub or remount? > > > > Lionel > > > > > > --- > > illumos-developer > > Archives: https://www.listbox.com/member/archive/182179/=now > > RSS Feed: > https://www.listbox.com/member/archive/rss/182179/21239177-c925e33f > > Modify Your Subscription: https://www.listbox.com/member/?&; > > Powered by Listbox: http://www.listbox.com > > > > --- > illumos-developer > Archives: https://www.listbox.com/member/archive/182179/=now > RSS Feed: > https://www.listbox.com/member/archive/rss/182179/21175057-f8151d0d > Modify Your Subscription: > https://www.listbox.com/member/?member_id=21175057&id_secret=21175057-02786781 > Powered by Listbox: http://www.listbox.com > -- Eric Schrock Delphix http://blog.delphix.com/eschrock 275 Middlefield Road, Suite 50 Menlo Park, CA 94025 http://www.delphix.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?
I don't know the precise history, but I think its a mistake to permit direct link() or unlink() of directories. I do note that on BSD (MacOS at least) unlink returns EPERM if the executing user is not superuser. I do see that the man page for unlink() says this on illumos: The named file is a directory and {PRIV_SYS_LINKDIR} is not asserted in the effective set of the calling process, or the filesystem implementation does not support unlink() or unlinkat() on directories. I can't imagine why you'd *ever* want to support unlink() of a *directory* -- what's the use case for it anyway (outside of filesystem repair)? Garrett D'Amore garr...@damore.org On Jun 25, 2012, at 2:23 AM, Lionel Cons wrote: > Does someone know the history which led to the EPERM for unlink() of > directories on ZFS? Why was this done this way, and not something like > allowing the unlink and execute it on the next scrub or remount? > > Lionel > > > --- > illumos-developer > Archives: https://www.listbox.com/member/archive/182179/=now > RSS Feed: https://www.listbox.com/member/archive/rss/182179/21239177-c925e33f > Modify Your Subscription: > https://www.listbox.com/member/?member_id=21239177&id_secret=21239177-4dba8197 > Powered by Listbox: http://www.listbox.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (fwd) Re: ZFS NFS service hanging on Sunday
in solaris zfs cache many things, you should have more ram if you setup 18gb swap , imho, ram should be high than 4gb regards Sent from my iPad On Jun 25, 2012, at 5:58, tpc...@mklab.ph.rhul.ac.uk wrote: >> >> 2012-06-14 19:11, tpc...@mklab.ph.rhul.ac.uk wrote: In message <201206141413.q5eedvzq017...@mklab.ph.rhul.ac.uk>, tpc...@mklab.ph.r hul.ac.uk writes: > Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap My WAG is that your "zpool history" is hanging due to lack of RAM. >>> >>> Interesting. In the problem state the system is usually quite responsive, >>> eg. not memory trashing. Under Linux which I'm more >>> familiar with the 'used memory' = 'total memory - 'free memory', refers to >>> physical memory being used for data caching by >>> the kernel which is still available for processes to allocate as needed >>> together with memory allocated to processes, as opposed to >>> only physical memory already allocated and therefore really 'used'. Does >>> this mean something different under Solaris ? >> >> Well, it is roughly similar. In Solaris there is a general notion > > [snipped] > > Dear Jim, >Thanks for the detailed explanation of ZFS memory usage. Special > thanks also to John D Groenveld for the initial suggestion of a lack of RAM > problem. Since up-ing the RAM from 2GB to 4GB the machine has sailed though > the last two Sunday mornings w/o problem. I was interested to > subsequently discover the Solaris command 'echo ::memstat | mdb -k' which > reveals just how much memory ZFS can use. > > Best regards > Tom. > > -- > Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill, > Egham, Surrey, TW20 0EX, England. > Email: T.Crane@rhul dot ac dot uk > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (fwd) Re: ZFS NFS service hanging on Sunday
> > 2012-06-14 19:11, tpc...@mklab.ph.rhul.ac.uk wrote: > >> > >> In message <201206141413.q5eedvzq017...@mklab.ph.rhul.ac.uk>, > >> tpc...@mklab.ph.r > >> hul.ac.uk writes: > >>> Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap > >> My WAG is that your "zpool history" is hanging due to lack of > >> RAM. > > > > Interesting. In the problem state the system is usually quite responsive, > > eg. not memory trashing. Under Linux which I'm more > > familiar with the 'used memory' = 'total memory - 'free memory', refers to > > physical memory being used for data caching by > > the kernel which is still available for processes to allocate as needed > > together with memory allocated to processes, as opposed to > > only physical memory already allocated and therefore really 'used'. Does > > this mean something different under Solaris ? > > Well, it is roughly similar. In Solaris there is a general notion [snipped] Dear Jim, Thanks for the detailed explanation of ZFS memory usage. Special thanks also to John D Groenveld for the initial suggestion of a lack of RAM problem. Since up-ing the RAM from 2GB to 4GB the machine has sailed though the last two Sunday mornings w/o problem. I was interested to subsequently discover the Solaris command 'echo ::memstat | mdb -k' which reveals just how much memory ZFS can use. Best regards Tom. -- Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 0EX, England. Email: T.Crane@rhul dot ac dot uk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?
>Does someone know the history which led to the EPERM for unlink() of >directories on ZFS? Why was this done this way, and not something like >allowing the unlink and execute it on the next scrub or remount? It's not about the unlink(), it's about the link() and unlink(). But not allowing link & unlink, you force the filesystem to contain only trees and not graphs. It also allows you to create directories were ".." points to a directory were the inode cannot be found, simply because it was just removed. The support for link() on directories in ufs has always given issues and would create problems fsck couldn't fix. To be honest, I think we should also remove this from all other filesystems and I think ZFS was created this way because all modern filesystems do it that way. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] History of EPERM for unlink() of directories on ZFS?
Does someone know the history which led to the EPERM for unlink() of directories on ZFS? Why was this done this way, and not something like allowing the unlink and execute it on the next scrub or remount? Lionel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss