[zfs-discuss] Borked zpool is now invulnerable

2006-05-18 Thread Jeremy Teo

Hello,

while testing some code changes, I managed to fail an assertion while
doing a zfs create.

My zpool is now invulnerable to destruction. :(

bash-3.00# zpool destroy -f test_undo
internal error: unexpected error 0 at line 298 of ../common/libzfs_dataset.c

bash-3.00# zpool status
 pool: test_undo
state: ONLINE
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   test_undo   ONLINE   0 0 0
 c0d1s1ONLINE   0 0 0
 c0d1s0ONLINE   0 0 0

errors: No known data errors

How can I destroy this pool so I can use the disk for a new pool? Thanks! :)
--
Regards,
Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Borked zpool is now invulnerable

2006-05-18 Thread Darren J Moffat

Jeremy Teo wrote:
How can I destroy this pool so I can use the disk for a new pool? 
Thanks! :)


dd if=/dev/zero of=/dev/dsk/c0d1s1
dd if=/dev/zero of=/dev/dsk/c0d1s0

that should do it.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Borked zpool is now invulnerable

2006-05-18 Thread Tim Foster
On Thu, 2006-05-18 at 22:05 +0800, Jeremy Teo wrote:
> My zpool is now invulnerable to destruction. :(

Nifty - does that mean your disk is also invulnerable to hardware errors
too ? [ as in, your typical superhero who gets endowed with special
abilities due to a failed radiation experiment  ;-) ]

> bash-3.00# zpool destroy -f test_undo
> internal error: unexpected error 0 at line 298 of ../common/libzfs_dataset.c

I'd suggest zpool exporting all other pools on the system, then reboot
into failsafe mode, mount your root partition somewhere (like /tmp/a)
then remove /tmp/a/etc/zfs/zpool.cache. Finally, reboot and create a new
pool on the disk which has now been made available, and then zpool
import the others.

Hope that helps (all disclaimers about making backups of anything you
care about before doing this sort of thing apply)

Anyone know of a simpler way ?

cheers,
tim
-- 
Tim Foster, Sun Microsystems Inc, Operating Platforms Group
Engineering Operationshttp://blogs.sun.com/timf

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Charlie
Sorry to revive such an old thread.. but I'm struggling here.

I really want to use zfs. Fssnap, SVM, etc all have drawbacks. But I work for a 
University, where everyone has a quota. I'd literally have to create > 10K 
partitions. Is that really your intention? Of course, backups become a huge 
pain now. Even with the scripted idea below, that's cumbersome for both backups 
and (especially) restores.

Why can't we just have user quotas in zfs? :)

Respectfully,
-Charlie
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Eric Schrock
On Thu, May 18, 2006 at 11:42:58AM -0700, Charlie wrote:
> Sorry to revive such an old thread.. but I'm struggling here.
> 
> I really want to use zfs. Fssnap, SVM, etc all have drawbacks. But I
> work for a University, where everyone has a quota. I'd literally have
> to create > 10K partitions. Is that really your intention?

Yes.  You'd group them all under a single filesystem in the hierarchy,
allowing you to manage NFS share options, compression, and more from a
single control point.

> Of course, backups become a huge pain now.  below, that's cumbersome
> for both backups and (especially) restores.

Using traditional tools or ZFS send/receive? We are working on RFEs for
recursive snapshots, send, and recv, as well as preserving DSL
properties as part of a 'send', which should make backups of large
filesystem hierarchies much simpler.

> Why can't we just have user quotas in zfs? :)

The fact that per-user quotas exist is really a historical artifact.
With traditional filesystems, it is (effectively) impossible to have a
filesystem per user.  The filesystem is a logical administrative control
point, allowing you to view usage, control properties, perform backups,
take snapshots etc.  For home directory servers, you really want to do
these operations per-user, so logically you'd want to equate the two
(filesystem = user).  Per-user quotas (the most common use of quotas,
but not the only one) were introduced because multiple users had to
share the same filesystem.

ZFS quotas are intentionally not associated with a particular user
because a) it's the logical extension of "filesystems as control point",
b) it's vastly simpler to implement and, most importantly, c) separates
implementation from adminsitrative policy.  ZFS quotas can be set on
filesystems which may represent projects, groups, or any other
abstraction, as well as on entire portions of the hierarchy. This allows
them to be combined in ways that traditional per-user quotas cannot.

Hope that helps,

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Frank Fejes
> Why can't we just have user quotas in zfs? :)

+1 to that.  I support a couple environments with group/user quotas that cannot 
move to ZFS since they serve brain-dead apps that read/write from a single 
directory.

I also agree that using even a few hundred mountpoints is more tedious than 
using quotas, but I can get used to that...I just won't use df as often. :)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Charlie
Eric Schrock wrote:
> > On Thu, May 18, 2006 at 11:42:58AM -0700, Charlie wrote:
>> >> to create > 10K partitions. Is that really your intention?
> > 
> > Yes.  You'd group them all under a single filesystem in the hierarchy,
> > allowing you to manage NFS share options, compression, and more from a
> > single control point.

This isn't so bad. I'm going to assume that mounting 10K partitions at
boot doesn't take forever.  :) 

>> >> Of course, backups become a huge pain now.  ... that's cumbersome
>> >> for both backups and (especially) restores.
> > 
> > Using traditional tools or ZFS send/receive?

Traditional (amanda). I'm not seeing a way to dump zfs file systems to
tape without resorting to 'zfs send' being piped through gtar or
something. Even then, the only thing I could restore was an entire file
system. (We frequently restore single files for users...)

Perhaps, since zfs isn't limited to one snapshot per FS like fssnap is,
I should be redesigning everything. It sounds like I should look at
using many snapshots, and dumping to tape (each file system, somehow)
less frequently.

Waiting for S10_U2 now  :) 

> > Hope that helps,
> > 
> > - Eric

It does. Thanks!

-Charlie
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Gregory Shaw
On Thu, 2006-05-18 at 12:12 -0700, Eric Schrock wrote:
> On Thu, May 18, 2006 at 11:42:58AM -0700, Charlie wrote:
> > Sorry to revive such an old thread.. but I'm struggling here.
> > 
> > I really want to use zfs. Fssnap, SVM, etc all have drawbacks. But I
> > work for a University, where everyone has a quota. I'd literally have
> > to create > 10K partitions. Is that really your intention?
> 
> Yes.  You'd group them all under a single filesystem in the hierarchy,
> allowing you to manage NFS share options, compression, and more from a
> single control point.
> 

I'd agree except for backups.  If the pools are going to grow beyond a
reasonable-to-backup and reasonable-to-restore threshold (measured by
the backup window), it would be practical to break it into smaller
pools.

After all, you'll probably have to restore a pool eventually.  If that
will take a week, your users won't be very happy with your solution.

> > Of course, backups become a huge pain now.  below, that's cumbersome
> > for both backups and (especially) restores.
> 
> Using traditional tools or ZFS send/receive? We are working on RFEs for
> recursive snapshots, send, and recv, as well as preserving DSL
> properties as part of a 'send', which should make backups of large
> filesystem hierarchies much simpler.
> 

Using EBS or NetBackup, can I get a single file back from tape only
through the backup system?  That's a big factor for production
environments.  Also, when users request a restore from tape from offsite
backups, they'll usually specify a date range for when the file was
'good'.  To accomplish that, you need to use the backup solution to find
the requisite file.  These 'fishing expeditions' (as I call them) can
take a lot of time if direct access isn't available via the backup tool.

I believe you're referring in the above to using zfs send/recv for
backup to tape.   Until the vendors work with zfs send/recv, it's not a
viable option for filesystem backups in a production environment.

Related to that, does anybody have a timeframe for direct support for
ZFS send/recv (or something similar) in NBU or EBS?

[ quota explanation deleted for brevity ]
> 
> --
> Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Nicolas Williams
On Thu, May 18, 2006 at 02:23:55PM -0600, Gregory Shaw wrote:
> I'd agree except for backups.  If the pools are going to grow beyond a
> reasonable-to-backup and reasonable-to-restore threshold (measured by
> the backup window), it would be practical to break it into smaller
> pools.

Speaking of backups, and particularly when we get to recursive ones, I'd
like control over the filesystem and snapshot names restored, as well as
control over what snapshots should appear, overrides for properties
(considering the RFE to have property setting on zfs create).

Currently I think one can specify the fs name to restore, yes, I know.

Also, the recursive backup output will need a ToC and a tool to list it.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Matthew Ahrens
On Thu, May 18, 2006 at 12:46:28PM -0700, Charlie wrote:
> Traditional (amanda). I'm not seeing a way to dump zfs file systems to
> tape without resorting to 'zfs send' being piped through gtar or
> something. Even then, the only thing I could restore was an entire file
> system. (We frequently restore single files for users...)
> 
> Perhaps, since zfs isn't limited to one snapshot per FS like fssnap is,
> I should be redesigning everything. It sounds like I should look at
> using many snapshots, and dumping to tape (each file system, somehow)
> less frequently.

That's right.  With ZFS, there should never be a need to go to tape to
recover an accidentally deleted file, becuase it's easy[*] to keep lots
of snapshots around.

[*] Well, modulo 6373978 "want to take lots of snapshots quickly ('zfs
snapshot -r')".  I'm working on that...

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread James Dickens

On 5/18/06, Gregory Shaw <[EMAIL PROTECTED]> wrote:

On Thu, 2006-05-18 at 12:12 -0700, Eric Schrock wrote:
> On Thu, May 18, 2006 at 11:42:58AM -0700, Charlie wrote:
> > Sorry to revive such an old thread.. but I'm struggling here.
> >
> > I really want to use zfs. Fssnap, SVM, etc all have drawbacks. But I
> > work for a University, where everyone has a quota. I'd literally have
> > to create > 10K partitions. Is that really your intention?
>
> Yes.  You'd group them all under a single filesystem in the hierarchy,
> allowing you to manage NFS share options, compression, and more from a
> single control point.
>

I'd agree except for backups.  If the pools are going to grow beyond a
reasonable-to-backup and reasonable-to-restore threshold (measured by
the backup window), it would be practical to break it into smaller
pools.

After all, you'll probably have to restore a pool eventually.  If that
will take a week, your users won't be very happy with your solution.

> > Of course, backups become a huge pain now.  below, that's cumbersome
> > for both backups and (especially) restores.
>
> Using traditional tools or ZFS send/receive? We are working on RFEs for
> recursive snapshots, send, and recv, as well as preserving DSL
> properties as part of a 'send', which should make backups of large
> filesystem hierarchies much simpler.
>

Using EBS or NetBackup, can I get a single file back from tape only
through the backup system?  That's a big factor for production
environments.  Also, when users request a restore from tape from offsite
backups, they'll usually specify a date range for when the file was
'good'.  To accomplish that, you need to use the backup solution to find
the requisite file.  These 'fishing expeditions' (as I call them) can
take a lot of time if direct access isn't available via the backup tool.


ZFS basically eliminates the need to single file restores, because it
has snapshots, then the user can have almost instant access to old
copies of files. and is a lot quicker than even the fastest tape
library. Just make daily snapshots and the need to restore a single
file from tape is almost completely eliminated, you can still use
netbackup for disasters, but to get access to a single old file a
snapshot is much easier.

You can also make it possible to have users initiate there own
snapshots when they feel the need arises.

James Dickens
uadmin.blogspot.com




I believe you're referring in the above to using zfs send/recv for
backup to tape.   Until the vendors work with zfs send/recv, it's not a
viable option for filesystem backups in a production environment.

Related to that, does anybody have a timeframe for direct support for
ZFS send/recv (or something similar) in NBU or EBS?

[ quota explanation deleted for brevity ]
>
> --
> Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Bill Moore
On Thu, May 18, 2006 at 12:46:28PM -0700, Charlie wrote:
> Eric Schrock wrote:
> > > Using traditional tools or ZFS send/receive?
> 
> Traditional (amanda). I'm not seeing a way to dump zfs file systems to
> tape without resorting to 'zfs send' being piped through gtar or
> something. Even then, the only thing I could restore was an entire file
> system. (We frequently restore single files for users...)

Remember, ZFS is a fully POSIX-compliant filesystem.  Any backup program
that uses system calls to do its work will still function properly.  Why
would you believe that your backup program doesn't work with ZFS?  Have
you actually tried it?  If it doesn't work, that's a big bug for us.

> Perhaps, since zfs isn't limited to one snapshot per FS like fssnap is,
> I should be redesigning everything. It sounds like I should look at
> using many snapshots, and dumping to tape (each file system, somehow)
> less frequently.

That's definitely an option.  You can also tell your backup program to
not stop at filesystem boundaries so you can do entire trees of your
namespace at once.


--Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Charlie
Bill Moore wrote:
> On Thu, May 18, 2006 at 12:46:28PM -0700, Charlie wrote:
>> Eric Schrock wrote:
 Using traditional tools or ZFS send/receive?
>> Traditional (amanda). I'm not seeing a way to dump zfs file systems to
>> tape without resorting to 'zfs send' being piped through gtar or
>> something. Even then, the only thing I could restore was an entire file
>> system. (We frequently restore single files for users...)
> 
> Remember, ZFS is a fully POSIX-compliant filesystem.  Any backup program
> that uses system calls to do its work will still function properly.  Why
> would you believe that your backup program doesn't work with ZFS?  Have
> you actually tried it?  If it doesn't work, that's a big bug for us.

Of course, using system calls isn't an issue. Most backup systems funtion at a 
higher level than read() however :)
I was thinking about amanda specifically, and I'd need zfsdump to do that. The 
result is thus:
If I want incrementals, I must tell amanda to use tar. Using 'dump' is 
preferred for many reasons.

And 'zfs send' is neat, but only mildly useful.

-Charlie
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Gregory Shaw

On Thu, 2006-05-18 at 16:43 -0500, James Dickens wrote:
> On 5/18/06, Gregory Shaw <[EMAIL PROTECTED]> wrote:
> > On Thu, 2006-05-18 at 12:12 -0700, Eric Schrock wrote:
> > > On Thu, May 18, 2006 at 11:42:58AM -0700, Charlie wrote:
> > > > Sorry to revive such an old thread.. but I'm struggling here.
> > > >
> > > > I really want to use zfs. Fssnap, SVM, etc all have drawbacks. But I
> > > > work for a University, where everyone has a quota. I'd literally have
> > > > to create > 10K partitions. Is that really your intention?
> > >
> > > Yes.  You'd group them all under a single filesystem in the hierarchy,
> > > allowing you to manage NFS share options, compression, and more from a
> > > single control point.
> > >
> >
> > I'd agree except for backups.  If the pools are going to grow beyond a
> > reasonable-to-backup and reasonable-to-restore threshold (measured by
> > the backup window), it would be practical to break it into smaller
> > pools.
> >
> > After all, you'll probably have to restore a pool eventually.  If that
> > will take a week, your users won't be very happy with your solution.
> >
> > > > Of course, backups become a huge pain now.  below, that's cumbersome
> > > > for both backups and (especially) restores.
> > >
> > > Using traditional tools or ZFS send/receive? We are working on RFEs for
> > > recursive snapshots, send, and recv, as well as preserving DSL
> > > properties as part of a 'send', which should make backups of large
> > > filesystem hierarchies much simpler.
> > >
> >
> > Using EBS or NetBackup, can I get a single file back from tape only
> > through the backup system?  That's a big factor for production
> > environments.  Also, when users request a restore from tape from offsite
> > backups, they'll usually specify a date range for when the file was
> > 'good'.  To accomplish that, you need to use the backup solution to find
> > the requisite file.  These 'fishing expeditions' (as I call them) can
> > take a lot of time if direct access isn't available via the backup tool.
> 
> ZFS basically eliminates the need to single file restores, because it
> has snapshots, then the user can have almost instant access to old
> copies of files. and is a lot quicker than even the fastest tape
> library. Just make daily snapshots and the need to restore a single
> file from tape is almost completely eliminated, you can still use
> netbackup for disasters, but to get access to a single old file a
> snapshot is much easier.
> 
> You can also make it possible to have users initiate there own
> snapshots when they feel the need arises.
> 
> James Dickens
> uadmin.blogspot.com
> 

The above would be fine for testing.   However, on an active filesystem
that is more than 50% full, you'll find that large amounts of space will
be used by the snapshots.

We currently use a pair of the Bluearc Titan fileserver appliances.
They have very similar snapshot functionality.  Currently, we can't
maintain more than about 3 days of snapshots every 4 hours due to space
constraints.

For filesystems that don't move much, 1 snapshot per day for a year may
be practical.  I doubt it, as snapshots have to be managed, and
maintaining 365 snapshots per filesystem (not pool) will be very
difficult.

[ stuff deleted ]

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Erik Trimble

On the topic of ZFS snapshots:

does the snapshot just capture the changed _blocks_, or does it 
effectively copy the entire file if any block has changed?


That is, assuming that the snapshot (destination) stays inside the same 
pool space.


-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Nicolas Williams
On Thu, May 18, 2006 at 03:41:13PM -0700, Erik Trimble wrote:
> On the topic of ZFS snapshots:
> 
> does the snapshot just capture the changed _blocks_, or does it 
> effectively copy the entire file if any block has changed?

Incremental sends capture changed blocks.

Snapshots capture all of the FS state as of the time the snapshot is
taken, though it does so in constant time.  Subsequent changes are kept
as changed blocks, as deltas to the snapshot in the filesystem and
clones.

> That is, assuming that the snapshot (destination) stays inside the same 
> pool space.

Of course it does.  Er, what do you mean by 'destination'?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Nathan Kroenert
Just piqued my interest on this one - 

How would we enforce quotas of sorts in large filesystems that are
shared? I can see times when I might want lots of users to use the same
directory (and thus, same filesystem) but still want to limit the amount
of space each user can consume.

Thoughts?

Nathan. :)

On Fri, 2006-05-19 at 05:12, Eric Schrock wrote:
> On Thu, May 18, 2006 at 11:42:58AM -0700, Charlie wrote:
> > Sorry to revive such an old thread.. but I'm struggling here.
> > 
> > I really want to use zfs. Fssnap, SVM, etc all have drawbacks. But I
> > work for a University, where everyone has a quota. I'd literally have
> > to create > 10K partitions. Is that really your intention?
> 
> Yes.  You'd group them all under a single filesystem in the hierarchy,
> allowing you to manage NFS share options, compression, and more from a
> single control point.
> 
> > Of course, backups become a huge pain now.  below, that's cumbersome
> > for both backups and (especially) restores.
> 
> Using traditional tools or ZFS send/receive? We are working on RFEs for
> recursive snapshots, send, and recv, as well as preserving DSL
> properties as part of a 'send', which should make backups of large
> filesystem hierarchies much simpler.
> 
> > Why can't we just have user quotas in zfs? :)
> 
> The fact that per-user quotas exist is really a historical artifact.
> With traditional filesystems, it is (effectively) impossible to have a
> filesystem per user.  The filesystem is a logical administrative control
> point, allowing you to view usage, control properties, perform backups,
> take snapshots etc.  For home directory servers, you really want to do
> these operations per-user, so logically you'd want to equate the two
> (filesystem = user).  Per-user quotas (the most common use of quotas,
> but not the only one) were introduced because multiple users had to
> share the same filesystem.
> 
> ZFS quotas are intentionally not associated with a particular user
> because a) it's the logical extension of "filesystems as control point",
> b) it's vastly simpler to implement and, most importantly, c) separates
> implementation from adminsitrative policy.  ZFS quotas can be set on
> filesystems which may represent projects, groups, or any other
> abstraction, as well as on entire portions of the hierarchy. This allows
> them to be combined in ways that traditional per-user quotas cannot.
> 
> Hope that helps,
> 
> - Eric
> 
> --
> Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
//
// Nathan Kroenert [EMAIL PROTECTED]  //
// PTS EngineerPhone:  +61 2 9844-5235  //
// Sun ServicesDirect Ext:  x57235  //
// Level 2, 828 Pacific HwyFax:+61 2 9844-5311  //
// Gordon2072  New South Wales   Australia  //
//

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[7]: [zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can't saturate disks

2006-05-18 Thread Robert Milkowski
Hello Roch,

Monday, May 15, 2006, 3:23:14 PM, you wrote:

RBPE> The question put forth is whether the ZFS 128K blocksize is sufficient
RBPE> to saturate a regular disk. There is great body of evidence that shows
RBPE> that the bigger the write sizes and matching large FS clustersize lead
RBPE> to more throughput. The counter point is that ZFS schedules it's I/O
RBPE> like nothing else seen before and manages to sature a single disk
RBPE> using enough concurrent 128K I/O.

Nevertheless I get much more throughput using UFS and writing with
large block than using ZFS on the same disk. And the difference is
actually quite big in favor of UFS.


RBPE>  at places. So I am proposing this for review by the community>

RBPE> I first measured the throughput of a write(2)  to raw device using for
RBPE> instance this;

RBPE> dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024

RBPE> On   Solaris we would  see  some overhead of   reading  the block from
RBPE> /dev/zero and then issuing the write call.  The tightest function that
RBPE> fences the I/O is default_physio(). That  function will issue the I/O to
RBPE> the device then wait for it  to complete.  If  we take the elapse time
RBPE> spent in   this function and  count the  bytes that   are I/O-ed, this
RBPE> should give  a   good  hint as   to   the throughput  thedevice is
RBPE> providing.  The above  dd command will  issue  a single I/O at  a time
RBPE> (d-script to measure is attached).

RBPE> Trying different blocksizes I see:

RBPE>Bytes   Elapse of phys IO Size   
RBPE>Sent
RBPE>
RBPE>8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s

RBPE>9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s

RBPE>31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s

RBPE>78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s

RBPE>124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s

RBPE>178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s

RBPE>226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s

RBPE>226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s

RBPE> 32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s

RBPE>224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s

Just to be sure - you did reconfigure system to actually allow larger
IO sizes?

RBPE> Now lets see what  ZFS gets.  I measure using  single dd process.  ZFS
RBPE> will chunk up data  in 128K blocks.  Now  the dd command interact with
RBPE> memory. But the I/O are scheduled under the control of spa_sync().  So
RBPE> in  the d-script (attached) I check  for the start  of an spa_sync and
RBPE> time that based on elapse.  At the same  time I  gather the number  of
RBPE> bytes and keep  a count of I/O (bdev_strategy)  that are being issued.
RBPE> When the spa_sync completes we are  sure that all  those are on stable
RBPE> storage. The script is a bit more  complex because there are 2 threads
RBPE> that   issue  spa_sync, but  only one  of them actuallybecomes
RBPE> activated. So the script will print out  some spurious lines of output
RBPE> at times. I measure I/O with the script while this runs:


RBPE> dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000

RBPE> And I see:

RBPE>1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
RBPE>1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s


RBPE> OK, I  cheated. Here, ZFS is given  a full disk  to play with. In this
RBPE> case ZFS enables the write cache. Note  that even with the write cache
RBPE> enabled, when the spa_sync()  completes, it will  be after a  flush of
RBPE> the cache has been executed. So the 60MB/sec do correspond to data set
RBPE> to the platter. I just tried disabling the cache  (with format -e) but
RBPE> I  am not sure if  that is taken into account  by ZFS; Results are the
RBPE> same 60MB/sec. This will have to be confirmed.

RBPE> With write cache enabled,  the physio test reaches 66  MB/s as soon as
RBPE> we are issuing 16KB I/Os.   Here clearly though,  data  is not on  the
RBPE> platter when the timed function completes.

RBPE> Another variable  not  fully  controled  is the   physical  (cylinder)
RBPE> locations of  the I/O. It could be  that some of the  differences come
RBPE> from that.

RBPE> What do I take away ?

RBPE> a single 2MB physical I/O will get 46 MB/sec out of my disk.

RBPE> 35  concurrent  128K I/O  sustained  followed  by metadata I/O
RBPE> followed by flush  of  the write cache  allows  ZFS to get  60
RBPE> MB/sec out of the same disk.


RBPE> This is what underwrites my belief that 128K blocksize is sufficiently
RBPE> large. Now, nothing  here proves   

Re: [zfs-discuss] Re: Re: zfs snapshot for backup, Quota

2006-05-18 Thread Richard Elling
On Fri, 2006-05-19 at 10:18 +1000, Nathan Kroenert wrote:
> Just piqued my interest on this one - 
> 
> How would we enforce quotas of sorts in large filesystems that are
> shared? I can see times when I might want lots of users to use the same
> directory (and thus, same filesystem) but still want to limit the amount
> of space each user can consume.
> 
> Thoughts?

rats.

> Nathan. :)

OK :-)
I've been wondering if I should mention this here, but I went ahead
and blogged about it anyway.
http://blogs.sun.com/roller/page/relling?entry=i_m_tired_of_owning

Anyone who is really clever will easily get past a quota, especially
at a university -- triple that probability for an engineering college.

What it really boils down to is 2 things:
1. denial of service -- how to protect others from disk-hogs
2. contractual obligations -- how to charge the government (in
   the US anyway) for space used for government sponsored
   research... and pass the audit.

A few years ago there was a 3rd thing:
3. how to pay for the disk space.

Today, disk space is cheap.  Really.  All of the current college
students I know carry around USB flash drives with all of their
stuff on it.  And iPods.  If I were a college student, why would
I risk my stuff being stored on the campus servers where "the man"
might want to go snooping?  Or, if you don't really care, use
flickr, myspace, godaddy, gmail, or some other such storage service.
Storage space really is becoming inexpensive.

I'm not sure anybody can fix #2, but #1 can be accomplished within
reason without resorting to user quotas.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS recovery from a disk losing power

2006-05-18 Thread Sanjay Nadkarni


Since it's not exactly clear what you did with SVM I am assuming the 
following:


You had a file system on top of the mirror and there was some I/O 
occurring to the mirror.  The *only* time, SVM puts a device into 
maintenance is when we receive an EIO from the underlying device.  So, 
in case a write occurred to the mirror, then the write to the powered 
off side failed (returned an EIO) and SVM kept going.  Since all buffers 
sent to sd/ssd are marked with B_FAILFAST, the driver timeouts are low 
and the device is put into maintenance.


If I understand Eric correctly, ZFS attempts to see if the device is 
really gone.  However I am not quite sure what Eric means by:


We currently only detect device failure when the device "goes away". 


Perhaps the issue here that ldi_open is successful when it should n't 
and therefore confusing ZFS.


Another way to check is perform the same test, without any I/O 
occurring to the file system.   Then run metastat -i (as root).  This is 
similar to scrub for the volumes. 


-Sanjay




Richard Elling wrote:


On Tue, 2006-05-16 at 10:32 -0700, Eric Schrock wrote:
 


On Wed, May 17, 2006 at 03:22:34AM +1000, grant beattie wrote:
   


what I find interesting is that the SCSI errors were continuous for 10
minutes before I detached it, ZFS wasn't backing off at all. it was
flooding the VGA console quicker than the console could print it all
:) from what you said above, once per minute would have been more
desirable.
 


The "once per minute" is related to the frequency at which ZFS tries to
reopen the device.  Regardless, ZFS will try to issue I/O to the device
whenever asked.  If you believe the device is completely broken, the
correct procedure (as documented in the ZFS Administration Guide), is to
'zpool offline' the device until you are able to repair it.

   


I wonder why, given that ZFS knew there was a problem with this disk,
that it wasn't marked FAULTED and the pool DEGRADED?
 


This is the future enhancement that I described below.  We need more
sophisticated analysis than simply 'N errors = FAULTED', and that's what
FMA provides.  It will allow us to interact with larger fault management
(such as correlating SCSI errors, identifying controller failure, and
more).  ZFS is a intentionally dumb.  Each subsystem is responsible for
reporting errors, but coordinated fault diagnosis has to happen at a
higher level.
   



[reason #8752, why pulling disk drives doesn't simulate real failures]
There are also a number of cases where a successful or 
unsuccessful+retryable error codes carry the recommendation to replace

the drive.  There really isn't a clean way to write such diagnosis
engines into the various file systems, LVMs, or databases which might
use disk drives.  Putting that intelligence into an FMA DE and tying
that into file systems or LVMs is the best way to do this.
-- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS recovery from a disk losing power

2006-05-18 Thread grant beattie
On Thu, May 18, 2006 at 11:40:53PM -0600, Sanjay Nadkarni wrote:

> Since it's not exactly clear what you did with SVM I am assuming the 
> following:
> 
> You had a file system on top of the mirror and there was some I/O 
> occurring to the mirror.  The *only* time, SVM puts a device into 
> maintenance is when we receive an EIO from the underlying device.  So, 
> in case a write occurred to the mirror, then the write to the powered 
> off side failed (returned an EIO) and SVM kept going.  Since all buffers 
> sent to sd/ssd are marked with B_FAILFAST, the driver timeouts are low 
> and the device is put into maintenance.

the test was the same in both the SVM and the ZFS case. constant reads
from the mirror device, and unplugging the power. the read throughput
during this test with ZFS drops to around 20% until the device is
manually removed from the pool, after which point it returns to normal.

> If I understand Eric correctly, ZFS attempts to see if the device is 
> really gone.  However I am not quite sure what Eric means by:
> 
> >We currently only detect device failure when the device "goes away". 
> 
> Perhaps the issue here that ldi_open is successful when it should n't 
> and therefore confusing ZFS.

yes, that seems to be the case. it appears to be caused by the way the
aac card deals with the disk going away - it offlines the disk, and the
LUN is still presented, but it now has zero length.

also, after a disk is offlined by the card, there does not seem to be
a way to tell the card to rescan the bus, so it requires a reboot
(though there is nothing that ZFS can do which would fix that). I
believe it can be done with the "aaccli" program provided by Adaptec,
but that doesn't work with the Solaris-provided aac driver.

> Another way to check is perform the same test, without any I/O 
> occurring to the file system.   Then run metastat -i (as root).  This is 
> similar to scrub for the volumes. 

with no IO activity on the mirror, metastat -i does not detect that
anything is wrong.

with IO activity, SVM offlines the metadevice when it gets a fatal
error from the device.

grant.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss