Re: [zfs-discuss] oddity of slow zfs destroy

2012-06-25 Thread Jim Klimov

First, a disclaimer: I do not know how the zfs dataset destruction
is implemented in reality,but I can guess at least a couple of
legal variants for a slow destruction.

2012-06-25 21:55, Philip Brown wrote:

I ran into something odd today:

zfs destroy -r  random/filesystem

is mindbogglingly slow. But seems to me, it shouldnt be.
It's slow, because the filesystem has two snapshots on it. Presumably,
it's busy "rolling back" the snapshots.
but I've already declared by my command line, that I DONT CARE about the
contents of the filesystem!
Why doesnt zfs simply do:

1. unmount filesystem, if possible (it was possible)
(1.5 possibly note "intent to delete" somewhere in the pool records)
2. zero out/free the in-kernel-memory in one go
3. update the pool, "hey I deleted the filesystem, all these blocks are
now clear"


Basically, your ideal fast destruction would be the pruning of
the dataset tree (the node under which the snapshots' and the
live dataset's blocks are rooted and accounted for). In this
case "everything not allocated is free", or at least it might
be made this way. The slow part is, likely, a walk of the block
pointer tree (through all the random on-disk locations) and
some sort of revision in order to release the blocks. So, what
can be done at this step (speculation follows)?

* Blocks might have been written as deduped; in this case we
  have to decrease the reference counters in DDT - but first
  we have to walk the dataset's branch of the block-pointer
  tree and see if any have the "dedup" bit-flag set.

* A simpler case is the presence of cloned datasets based on
  snapshots of this dataset. Unless you're destroying the whole
  family of sibling datasets, the clones have to be promoted
  and referenced blocks are to be reassigned to these datasets
  (including reassignment of the snapshot "ownership").

* Even for the "trivial" step (2) of yours, the freeing of
  memory, we need to know which ARC-cached blocks to free.
  How can we know that without walking the BP tree first?

I listed just a few reasons off the top of my head why a
walk of the whole BP-tree branch is required to free the
blocks referenced by this tree. If any further operations
are needed, such as modifications to DDT, they may delay
the result.

In particular, this may be why recent versions of zfs/zpool
worked towards asynchronous destructions and "deferred free"
capability. The destroyed branch can be quickly marked as
deleted, then the kernel works in the background to do its
processing. In my (and not only mine) problematic cases
it could have required prodigous amounts of RAM, especially
with dedup procesing in play, and cause computer freezes.
However, sometime after ZFSv22, the deferred freeing in
such cases just takes several hard-resets to complete,
instead of taking truly forever with no progress ;)

Basically, the steps you outlined should be there already,
in some manner, at least for ZFSv28.

So, the practical questions are:
* your version of zpool/zfs; OS version?
* presence of deduplication on this dataset (and dedup
  support in the OS version - lack of it may have less
  code paths to follow and check, and be faster just
  due to that; i.e. Solaris 10 nominally has ZFSv29(?),
  but not all features are implemented as in Solaris 11
  or OpenSolaris of similar ZFS versions);
* did you use clones?
* fragmentation (or how busy is the pool while processing
  the deletion, in terms of iops)?

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] oddity of slow zfs destroy

2012-06-25 Thread Richard Elling
On Jun 25, 2012, at 10:55 AM, Philip Brown wrote:

> I ran into something odd today:
> 
> zfs destroy -r  random/filesystem
> 
> is mindbogglingly slow. But seems to me, it shouldnt be.
> It's slow, because the filesystem has two snapshots on it. Presumably, it's 
> busy "rolling back" the snapshots.
> but I've already declared by my command line, that I DONT CARE about the 
> contents of the filesystem!
> Why doesnt zfs simply do:
> 
> 1. unmount filesystem, if possible (it was possible)
> (1.5 possibly note "intent to delete" somewhere in the pool records)
> 2. zero out/free the in-kernel-memory in one go
> 3. update the pool, "hey I deleted the filesystem, all these blocks are now 
> clear"
> 
> 
> Having this kind of operation take more than even 10 seconds, seems like a 
> huge bug to me. yet it can take many minutes. An order of magnitude off. yuck.

Agree. Asynchronous destroy has been integrated into illumos. Look for it soon
in the distributions derived from illumos soon. For more information, see Chris
Siden and Matt Ahrens discussions on async destroy and ZFS feature flags at
the ZSF Meetup in January 2012 here:
http://blog.delphix.com/ahl/2012/zfs10-illumos-meetup/

 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
















___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] oddity of slow zfs destroy

2012-06-25 Thread Philip Brown

I ran into something odd today:

zfs destroy -r  random/filesystem

is mindbogglingly slow. But seems to me, it shouldnt be.
It's slow, because the filesystem has two snapshots on it. Presumably, it's 
busy "rolling back" the snapshots.
but I've already declared by my command line, that I DONT CARE about the 
contents of the filesystem!

Why doesnt zfs simply do:

1. unmount filesystem, if possible (it was possible)
(1.5 possibly note "intent to delete" somewhere in the pool records)
2. zero out/free the in-kernel-memory in one go
3. update the pool, "hey I deleted the filesystem, all these blocks are now 
clear"



Having this kind of operation take more than even 10 seconds, seems like a 
huge bug to me. yet it can take many minutes. An order of magnitude off. yuck.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Joerg Schilling
Eric Schrock  wrote:

> On Mon, Jun 25, 2012 at 11:19 AM,  wrote:
> >
> >
> > In the very beginning, mkdir(1) was a set-uid application; it used
> > "mknod" to make a directory and then created a link from
> >newdir to newdir/.
> > and from
> >"." to newdir/..
> >
>
> Interesting, guess you learn something new every day :-)
>
> http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/mkdir.c

This was a nice way to become superuser those days.

Just run a loop to make a directory in /tmp and run another program that tries 
to remove the directory and replace it by a hadlink to /etc/passwd. Mkdir(1) 
then did a "chown  /etc/passwd"... We tried this and it took aprox. 
30 minutes to become super user this way.

And BSD introduced the syscall mkdir(2) to fix this and this is is why UFS was 
not designed to support link(2) in directories.

BTW: to implement mkdir(2), there was a new struct dirtemplate in the kernel 
with the following comment:

/* 
 * A virgin directory (no blushing please). 
 */ 
struct dirtemplate mastertemplate = { 
0, 12, 1, ".", 
0, DIRBLKSIZ - 12, 2, ".." 
}; 


This is the first time where Sun verified not to have humor, as Sun removed 
that comment...

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Eric Schrock
On Mon, Jun 25, 2012 at 11:19 AM,  wrote:
>
>
> In the very beginning, mkdir(1) was a set-uid application; it used
> "mknod" to make a directory and then created a link from
>newdir to newdir/.
> and from
>"." to newdir/..
>

Interesting, guess you learn something new every day :-)

http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/mkdir.c

Thanks,

- Eric

-- 
Eric Schrock
Delphix
http://blog.delphix.com/eschrock

275 Middlefield Road, Suite 50
Menlo Park, CA 94025
http://www.delphix.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Joerg Schilling
Eric Schrock  wrote:

> The decision to not support link(2) of directories was very deliberate - it
> is an abomination that never should have been allowed in the first place.
> My guess is that the behavior of unlink(2) on directories is a direct
> side-effect of that (if link isn't supported, then why support unlink?).
> Also worth noting that ZFS also doesn't let you open(2) directories and
> read(2) from them, something (I believe) UFS does allow.

Link/unlink on directories is not a property of UFS.

UFS has been designed without that feature, but it has been added by AT&T with 
SVr4.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Casper . Dik

>The decision to not support link(2) of directories was very deliberate - it
>is an abomination that never should have been allowed in the first place.
>My guess is that the behavior of unlink(2) on directories is a direct
>side-effect of that (if link isn't supported, then why support unlink?).
>Also worth noting that ZFS also doesn't let you open(2) directories and
>read(2) from them, something (I believe) UFS does allow.

In the very beginning, mkdir(1) was a set-uid application; it used
"mknod" to make a directory and then created a link from
newdir to newdir/.
and from
"." to newdir/..

Traditionally, we was only allowed for the superuser and when
we added privileges a special privileges was added.

I think we should remove it for the other filesystems.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Eric Schrock
The decision to not support link(2) of directories was very deliberate - it
is an abomination that never should have been allowed in the first place.
My guess is that the behavior of unlink(2) on directories is a direct
side-effect of that (if link isn't supported, then why support unlink?).
Also worth noting that ZFS also doesn't let you open(2) directories and
read(2) from them, something (I believe) UFS does allow.

- Eric

On Mon, Jun 25, 2012 at 10:40 AM, Garrett D'Amore wrote:

> I don't know the precise history, but I think its a mistake to permit
> direct link() or unlink() of directories.  I do note that on BSD (MacOS at
> least) unlink returns EPERM if the executing user is not superuser.  I do
> see that the man page for unlink() says this on illumos:
>
> The  named   file   is   a   directory   and
> {PRIV_SYS_LINKDIR}  is  not  asserted in the
> effective set of the calling process, or the
> filesystem  implementation  does not support
> unlink() or unlinkat() on directories.
>
> I can't imagine why you'd *ever* want to support unlink() of a *directory*
> -- what's the use case for it anyway (outside of filesystem repair)?
>
> Garrett D'Amore
> garr...@damore.org
>
>
>
> On Jun 25, 2012, at 2:23 AM, Lionel Cons wrote:
>
> > Does someone know the history which led to the EPERM for unlink() of
> > directories on ZFS? Why was this done this way, and not something like
> > allowing the unlink and execute it on the next scrub or remount?
> >
> > Lionel
> >
> >
> > ---
> > illumos-developer
> > Archives: https://www.listbox.com/member/archive/182179/=now
> > RSS Feed:
> https://www.listbox.com/member/archive/rss/182179/21239177-c925e33f
> > Modify Your Subscription: https://www.listbox.com/member/?&;
> > Powered by Listbox: http://www.listbox.com
>
>
>
> ---
> illumos-developer
> Archives: https://www.listbox.com/member/archive/182179/=now
> RSS Feed:
> https://www.listbox.com/member/archive/rss/182179/21175057-f8151d0d
> Modify Your Subscription:
> https://www.listbox.com/member/?member_id=21175057&id_secret=21175057-02786781
> Powered by Listbox: http://www.listbox.com
>



-- 
Eric Schrock
Delphix
http://blog.delphix.com/eschrock

275 Middlefield Road, Suite 50
Menlo Park, CA 94025
http://www.delphix.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Garrett D'Amore
I don't know the precise history, but I think its a mistake to permit direct 
link() or unlink() of directories.  I do note that on BSD (MacOS at least) 
unlink returns EPERM if the executing user is not superuser.  I do see that the 
man page for unlink() says this on illumos:

 The  named   file   is   a   directory   and
 {PRIV_SYS_LINKDIR}  is  not  asserted in the
 effective set of the calling process, or the
 filesystem  implementation  does not support
 unlink() or unlinkat() on directories.

I can't imagine why you'd *ever* want to support unlink() of a *directory* -- 
what's the use case for it anyway (outside of filesystem repair)?

Garrett D'Amore
garr...@damore.org



On Jun 25, 2012, at 2:23 AM, Lionel Cons wrote:

> Does someone know the history which led to the EPERM for unlink() of
> directories on ZFS? Why was this done this way, and not something like
> allowing the unlink and execute it on the next scrub or remount?
> 
> Lionel
> 
> 
> ---
> illumos-developer
> Archives: https://www.listbox.com/member/archive/182179/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/182179/21239177-c925e33f
> Modify Your Subscription: 
> https://www.listbox.com/member/?member_id=21239177&id_secret=21239177-4dba8197
> Powered by Listbox: http://www.listbox.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (fwd) Re: ZFS NFS service hanging on Sunday

2012-06-25 Thread Hung-Sheng Tsao (LaoTsao) Ph.D
in solaris zfs cache many things, you should have more ram
if you setup 18gb swap , imho, ram should be high than 4gb
regards

Sent from my iPad

On Jun 25, 2012, at 5:58, tpc...@mklab.ph.rhul.ac.uk wrote:

>> 
>> 2012-06-14 19:11, tpc...@mklab.ph.rhul.ac.uk wrote:
 
 In message <201206141413.q5eedvzq017...@mklab.ph.rhul.ac.uk>, 
 tpc...@mklab.ph.r
 hul.ac.uk writes:
> Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap
 My WAG is that your "zpool history" is hanging due to lack of
 RAM.
>>> 
>>> Interesting.  In the problem state the system is usually quite responsive, 
>>> eg. not memory trashing.  Under Linux which I'm more
>>> familiar with the 'used memory' = 'total memory - 'free memory', refers to 
>>> physical memory being used for data caching by
>>> the kernel which is still available for processes to allocate as needed 
>>> together with memory allocated to processes, as opposed to
>>> only physical memory already allocated and therefore really 'used'.  Does 
>>> this mean something different under Solaris ?
>> 
>> Well, it is roughly similar. In Solaris there is a general notion
> 
> [snipped]
> 
> Dear Jim,
>Thanks for the detailed explanation of ZFS memory usage.  Special 
> thanks also to John D Groenveld for the initial suggestion of a lack of RAM
> problem.  Since up-ing the RAM from 2GB to 4GB the machine has sailed though 
> the last two Sunday mornings w/o problem.  I was interested to
> subsequently discover the Solaris command 'echo ::memstat | mdb -k' which 
> reveals just how much memory ZFS can use.
> 
> Best regards
> Tom.
> 
> --
> Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
> Egham, Surrey, TW20 0EX, England.
> Email:  T.Crane@rhul dot ac dot uk
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (fwd) Re: ZFS NFS service hanging on Sunday

2012-06-25 Thread TPCzfs
>
> 2012-06-14 19:11, tpc...@mklab.ph.rhul.ac.uk wrote:
> >>
> >> In message <201206141413.q5eedvzq017...@mklab.ph.rhul.ac.uk>, 
> >> tpc...@mklab.ph.r
> >> hul.ac.uk writes:
> >>> Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap
> >> My WAG is that your "zpool history" is hanging due to lack of
> >> RAM.
> >
> > Interesting.  In the problem state the system is usually quite responsive, 
> > eg. not memory trashing.  Under Linux which I'm more
> > familiar with the 'used memory' = 'total memory - 'free memory', refers to 
> > physical memory being used for data caching by
> > the kernel which is still available for processes to allocate as needed 
> > together with memory allocated to processes, as opposed to
> > only physical memory already allocated and therefore really 'used'.  Does 
> > this mean something different under Solaris ?
>
> Well, it is roughly similar. In Solaris there is a general notion

[snipped]

Dear Jim,
Thanks for the detailed explanation of ZFS memory usage.  Special 
thanks also to John D Groenveld for the initial suggestion of a lack of RAM
problem.  Since up-ing the RAM from 2GB to 4GB the machine has sailed though 
the last two Sunday mornings w/o problem.  I was interested to
subsequently discover the Solaris command 'echo ::memstat | mdb -k' which 
reveals just how much memory ZFS can use.

Best regards
Tom.

--
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England.
Email:  T.Crane@rhul dot ac dot uk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Casper . Dik

>Does someone know the history which led to the EPERM for unlink() of
>directories on ZFS? Why was this done this way, and not something like
>allowing the unlink and execute it on the next scrub or remount?


It's not about the unlink(), it's about the link() and unlink().
But not allowing link & unlink, you force the filesystem to contain only 
trees and not graphs.

It also allows you to create directories were ".." points to a directory 
were the inode cannot be found, simply because it was just removed.

The support for link() on directories in ufs has always given issues
and would create problems fsck couldn't fix.

To be honest, I think we should also remove this from all other
filesystems and I think ZFS was created this way because all modern
filesystems do it that way.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Lionel Cons
Does someone know the history which led to the EPERM for unlink() of
directories on ZFS? Why was this done this way, and not something like
allowing the unlink and execute it on the next scrub or remount?

Lionel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss