Re: [zfs-discuss] metadata inconsistency?

2006-07-17 Thread Matthew Ahrens
On Thu, Jul 06, 2006 at 12:46:57AM -0700, Patrick Mauritz wrote:
> Hi,
> after some unscheduled reboots (to put it lightly), I've got an interesting 
> setup on my notebook's zfs partition:
> setup: simple zpool, no raid or mirror, a couple of zfs partitions, one zvol 
> for swap. /foo is one such partition, /foo/bar the directory with the issue.
> 
> directly after the reboot happened:
> $ ls /foo/bar
> test.h
> $ ls -l /foo/bar
> Total 0
> 
> the file wasn't accessible with cat, etc.

This can happen when the file appears in the directory listing (ie.
getdents(2)), but a stat(2) on the file fails.  Why that stat would fail
is a bit of a mystery, given that ls doesn't report the error.

It could be that the underlying hardware has failed, and the directory
is still intact but the file's metadata has been damaged.  (Note, this
would be hardware error, not metadata inconsistency.)

Another possibility is that the file's "inode number" is too large to be
expressed in 32 bits, thus causing a 32-bit stat() to fail.  However,
I don't think that Sun's ls(1) should be issuing any 32-bit stats (even
on a 32-bit system, it should be using stat64).

> somewhat later (new data appeared on /foo, in /foo/baz):
> $ ls -l /foo/bar
> Total 3
> -rw-r--r-- 1 user group 1400 Jul 6 02:14 test.h
> 
> the content of test.h is the same as the content of /foo/baz/quux now,
> but the refcount is 1!
> 
> $ chmod go-r /foo/baz/quux
> $ ls -l /foo/bar
> Total 3
> -rw--- 1 user group 1400 Jul 6 02:14 test.h

This behavior could also be explained if there is an unknown bug which
causes the object representing the file to be deleted, but not the
directory entry pointing to it.

> anyway, how do I get rid of test.h now without making quux unreadable?
> (the brute force approach would be a new partition, moving data over
> with copying - instead of moving - the troublesome file, just in case
> - not sure if zfs allows for links that cross zfs partitions and thus
> optimizes such moves, then zfs destroy data/test, but there might be a
> better way?)

Before trying to rectify the problem, could you email me the output of
'zpool status' and 'zdb -vvv foo'?  

FYI, there are no cross-filesystem links, even with ZFS.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-17 Thread Richard Elling

[stirring the pot a little...]

Jim Mauro wrote:
I agree with Greg - For ZFS, I'd recommend a larger number of raidz 
luns, with a smaller number

of disks per LUN, up to 6 disks per raidz lun.


For 6 disks, 3x2-way RAID-1+0 offers better resiliency than RAID-Z
or RAID-Z2.  For 3-5 disks, RAID-Z2 offers better resiliency, even
over split-disk RAID-1+0.

This will more closely align with performance best practices, so it 
would be cool to find

common ground in terms of a sweet-spot for performance and RAS.


It is clear that a single 46-way RAID-Z or RAID-Z2 zpool won't be
popular :-)
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Eric Schrock
On Tue, Jul 18, 2006 at 10:10:33AM +1000, Nathan Kroenert wrote:
> Jeff -
> 
> That sounds like a great idea... 
> 
> Another idea might to be have a zpool create announce the 'availability'
> of any given configuration, and output the Single points of failure.
> 
>   # zpool create mypool a b c
>   NOTICE: This pool has no redundancy. 
>   Without hardware redundancy (raid1 / 5), 
>   a single disk failure will destroy the whole pool.
> 
>   # zpool create mypool raidz a b c
>   NOTICE: This pool has single disk redundancy. 
>   Without hardware redundancy (raid1 / 5), 
>   this pool can survive at most 1 disks failing.
> 
>   # zpool create mypool raidz2 a b c
>   NOTICE: This pool has double disk redundancy. 
>   Without hardware redundancy (raid1 / 5), 
>   this pool can survive at most 2 disks failing.
> 
> It would be especially nice if it was able to detect silly
> configurations too (like adding dimple disks to a raidz or something
> like that (if it's even possible) and announce the reduction in
> reliability.

FYI, zpool(1M) will already detect some variations of "silly" and force
you to use the '-f' option if you really mean it (for add and create).
Examples include using vdevs of different redundancy (raidz + mirror),
as well as using different size devices.  If you have other definitions
of silly, let us know what we should be looking for.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Nathan Kroenert
Jeff -

That sounds like a great idea... 

Another idea might to be have a zpool create announce the 'availability'
of any given configuration, and output the Single points of failure.

# zpool create mypool a b c
NOTICE: This pool has no redundancy. 
Without hardware redundancy (raid1 / 5), 
a single disk failure will destroy the whole pool.

# zpool create mypool raidz a b c
NOTICE: This pool has single disk redundancy. 
Without hardware redundancy (raid1 / 5), 
this pool can survive at most 1 disks failing.

# zpool create mypool raidz2 a b c
NOTICE: This pool has double disk redundancy. 
Without hardware redundancy (raid1 / 5), 
this pool can survive at most 2 disks failing.

It would be especially nice if it was able to detect silly
configurations too (like adding dimple disks to a raidz or something
like that (if it's even possible) and announce the reduction in
reliability.

Thoughts? :)

Nathan.










On Mon, 2006-07-17 at 18:35, Jeff Bonwick wrote:
> > I have a 10 disk raidz pool running Solaris 10 U2, and after a reboot
> > the whole pool became unavailable after apparently loosing a diskdrive.
> > [...]
> > NAMESTATE READ WRITE CKSUM
> > dataUNAVAIL  0 0 0  insufficient replicas
> >   c1t0d0ONLINE   0 0 0
> > [...]
> >   c1t4d0UNAVAIL  0 0 0  cannot open
> > --
> > 
> > The problem as I see it is that the pool should be able to handle
> > 1 disk error, no?
> 
> If it were a raidz pool, that would be correct.  But according to
> zpool status, it's just a collection of disks with no replication.
> Specifically, compare these two commands:
> 
> (1) zpool create data A B C
> 
> (2) zpool create data raidz A B C
> 
> Assume each disk has 500G capacity.
> 
> The first command will create an unreplicated pool with 1.5T capacity.
> The second will create a single-parity RAID-Z pool with 1.0T capacity.
> 
> My guess is that you intended the latter, but actually typed the former,
> perhaps assuming that RAID-Z was always present.  If so, I apologize for
> not making this clearer.  If you have any suggestions for how we could
> improve the zpool(1M) command or documentation, please let me know.
> 
> One option -- I confess up front that I don't really like it -- would be
> to make 'unreplicated' an explicit replication type (in addition to
> mirror and raidz), so that you couldn't get it by accident:
> 
>   zpool create data unreplicated A B C
> 
> The extra typing would be annoying, but would make it almost impossible
> to get the wrong behavior by accident.
> 
> Jeff
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-17 Thread Jim Mauro
I agree with Greg - For ZFS, I'd recommend a larger number of raidz 
luns, with a smaller number

of disks per LUN, up to 6 disks per raidz lun.

This will more closely align with performance best practices, so it 
would be cool to find

common ground in terms of a sweet-spot for performance and RAS.

/jim


Gregory Shaw wrote:
To maximize the throughput, I'd go with 8 5-disk raid-z{2} luns.  
 Using that configuration, a full-width stripe write should be a 
single operation for each controller.


In production, the application needs would probably dictate the 
resulting disk layout.  If the application doesn't need tons of i/o, 
you could bind more disks together for larger luns...


On Jul 17, 2006, at 3:30 PM, Richard Elling wrote:


ZFS fans,
I'm preparing some analyses on RAS for large JBOD systems such as
the Sun Fire X4500 (aka Thumper).  Since there are zillions of possible
permutations, I need to limit the analyses to some common or desirable
scenarios.  Naturally, I'd like your opinions.  I've already got a few
scenarios in analysis, and I don't want to spoil the brain storming, so
feel free to think outside of the box.

If you had 46 disks to deploy, what combinations would you use?  Why?

Examples,
46-way RAID-0  (I'll do this just to show why you shouldn't do this)
22x2-way RAID-1+0 + 2 hot spares
15x3-way RAID-Z2+0 + 1 hot spare
...

Because some people get all wrapped up with the controllers, assume 5
8-disk SATA controllers plus 1 6-disk controller.  Note: the 
reliability of

the controllers is much greater than the reliability of the disks, so
the data availability and MTTDL analysis will be dominated by the disks
themselves.  In part, this is due to using SATA/SAS (point-to-point disk
connections) rather than a parallel bus or FC-AL where we would also have
to worry about bus or loop common cause failures.

I will be concentrating on data availability and MTTDL as two views 
of RAS.

The intention is that the interesting combinations will also be analyzed
for performance and we can complete a full performability analysis on 
them.

Thanks
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-2773
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382  [EMAIL PROTECTED] 
 (work)
Louisville, CO 80028-4382[EMAIL PROTECTED] 
 (home)
"When Microsoft writes an application for Linux, I've Won." - Linus 
Torvalds






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Fun with ZFS and iscsi volumes

2006-07-17 Thread Jason Hoffman

Hi Everyone,

I thought I'd share some benchmarking and playing around that we had  
done with making zpools from "disks" that were iscsi volumes. The  
numbers are representative of 6 benchmarking rounds per.


The interesting finding at least for us was the filebench varmail  
(50:50 reads-writes) results where we had a RAIDZ pool containing 9  
volumes vs other combinations (we also did 8,7,6,5,4 volumes. Groups  
of 3 and 9 seemed to form the boundary cases).


Regards, Jason


More detailed at http://svn.joyent.com/public/experiments/equal-sun- 
iscsi-zfs-fun.txt



## Hardware Setup

- A T1000 server, 8 core, 8GB of RAM
- Equallogic PS300E storage arrays
- Standard 4 dedicated switch arrangements (http://joyeur.com/ 
2006/05/04/what-sysadmins-start-doing-when-hanging-around-designers)


## Software benchmarks

- FileBench (http://www.opensolaris.org/os/community/performance/ 
filebench/) with varmail and webserver workloads

- Bonnie++

## Questions
1) In a zpool of 3x RAIDZ groups of 3 volumes each, can we offline a  
total of 3 "drives" (that could come from 3 different physical  
arrays)? What are the performance differences between 9 online and 6  
online drives?

2) What are the differences between zpools containing
- 3x RAIDZ groups of 3 volumes each,
- a single zpool of 9 volumes with no mirroring or RAIDZ,
- a single RAIDZ group with 9 volumes, and
- two RAIDZ groups with 9 volumes each?
3) Can we saturate a single gigabit connection between the server and  
iSCSI storage?


## Findings
1) Tolerated offlining 3 of 9 drives in a 3x RAIDZ of 3 drives each.  
DEGRADED (6 of 9 online) versus ONLINE (9 of 9)

a) Filebench varmail (50:50 reads-writes):
- 2045.2 ops/s in "state: DEGRADED"
- 2473.0 ops/s in "state: ONLINE"
b) Filebench webserver (90:10 reads-writes)
- 54530.5 ops/s "in state: DEGRADED"
- 54328.1 ops/s "in state: ONLINE".

2) Filebench RAIDZ of 3x3 vs "RAID0" vs RAIDZ of 1x9 vs RAIDZ of 2x9
a) Varmail (50:50 reads-writes):
- 2473.0 ops/s (RAIDZ of 3x3)
- 4316.8 ops/s (RAID0),
- 13144.8 ops/s (RAIDZ of 1x9),
- 11363.7 ops/s (RAIDZ of 2x9)
b) Webserver (90:10 reads-writes):
- 54328.1 ops/s (RAIDZ of 3x3),
- 54386.9 ops/s (RAID0),
- 53960.1 ops/s (RAIDZ of 1x9),
- 56897.2 ops/s (RAIDZ of 2x9)

3) We could saturate a single gigabit connection out to the storage.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS needs a viable backup mechanism

2006-07-17 Thread Matthew Ahrens
On Fri, Jul 07, 2006 at 04:00:38PM -0400, Dale Ghent wrote:
> Add an option to zpool(1M) to dump the pool config as well as the  
> configuration of the volumes within it to an XML file. This file  
> could then be "sucked in" to zpool at a later date to recreate/ 
> replicate the pool and its volume structure in one fell swoop. After  
> that, Just Add Data(tm).

Yep, this has been on our to-do list for quite some time:

RFE #6276640 "zpool config"
RFE #6276912 "zfs config"

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Large device support

2006-07-17 Thread J.P. King



I take it you already have solved the problem.


Yes, my problems went away once my device supported the extended SCSI 
instruction set.


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Big JBOD: what would you do?

2006-07-17 Thread Gregory Shaw
To maximize the throughput, I'd go with 8 5-disk raid-z{2} luns.   Using that configuration, a full-width stripe write should be a single operation for each controller.In production, the application needs would probably dictate the resulting disk layout.  If the application doesn't need tons of i/o, you could bind more disks together for larger luns...On Jul 17, 2006, at 3:30 PM, Richard Elling wrote:ZFS fans,I'm preparing some analyses on RAS for large JBOD systems such asthe Sun Fire X4500 (aka Thumper).  Since there are zillions of possiblepermutations, I need to limit the analyses to some common or desirablescenarios.  Naturally, I'd like your opinions.  I've already got a fewscenarios in analysis, and I don't want to spoil the brain storming, sofeel free to think outside of the box.If you had 46 disks to deploy, what combinations would you use?  Why?Examples,	46-way RAID-0  (I'll do this just to show why you shouldn't do this)	22x2-way RAID-1+0 + 2 hot spares	15x3-way RAID-Z2+0 + 1 hot spare	...Because some people get all wrapped up with the controllers, assume 58-disk SATA controllers plus 1 6-disk controller.  Note: the reliability ofthe controllers is much greater than the reliability of the disks, sothe data availability and MTTDL analysis will be dominated by the disksthemselves.  In part, this is due to using SATA/SAS (point-to-point diskconnections) rather than a parallel bus or FC-AL where we would also haveto worry about bus or loop common cause failures.I will be concentrating on data availability and MTTDL as two views of RAS.The intention is that the interesting combinations will also be analyzedfor performance and we can complete a full performability analysis on them.Thanks -- richard___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss  -Gregory Shaw, IT ArchitectPhone: (303) 673-8273        Fax: (303) 673-2773ITCTO Group, Sun Microsystems Inc.1 StorageTek Drive ULVL4-382              [EMAIL PROTECTED] (work)Louisville, CO 80028-4382                    [EMAIL PROTECTED] (home)"When Microsoft writes an application for Linux, I've Won." - Linus Torvalds ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Big JBOD: what would you do?

2006-07-17 Thread Richard Elling

ZFS fans,
I'm preparing some analyses on RAS for large JBOD systems such as
the Sun Fire X4500 (aka Thumper).  Since there are zillions of possible
permutations, I need to limit the analyses to some common or desirable
scenarios.  Naturally, I'd like your opinions.  I've already got a few
scenarios in analysis, and I don't want to spoil the brain storming, so
feel free to think outside of the box.

If you had 46 disks to deploy, what combinations would you use?  Why?

Examples,
46-way RAID-0  (I'll do this just to show why you shouldn't do this)
22x2-way RAID-1+0 + 2 hot spares
15x3-way RAID-Z2+0 + 1 hot spare
...

Because some people get all wrapped up with the controllers, assume 5
8-disk SATA controllers plus 1 6-disk controller.  Note: the reliability of
the controllers is much greater than the reliability of the disks, so
the data availability and MTTDL analysis will be dominated by the disks
themselves.  In part, this is due to using SATA/SAS (point-to-point disk
connections) rather than a parallel bus or FC-AL where we would also have
to worry about bus or loop common cause failures.

I will be concentrating on data availability and MTTDL as two views of RAS.
The intention is that the interesting combinations will also be analyzed
for performance and we can complete a full performability analysis on them.
Thanks
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Fwd: Re[3]: [zfs-discuss] zpool status and CKSUM errors

2006-07-17 Thread Robert Milkowski
Hi.

   Sorry for forward but maybe this will be more visible that way.

   I really think something strange is going on here and it's
   virtually impossible that I have a problem with hardware and get
   CKSUM errors (many of them) only for ditto blocks.


This is a forwarded message
From: Robert Milkowski <[EMAIL PROTECTED]>
To: Robert Milkowski <[EMAIL PROTECTED]>
Date: Sunday, July 9, 2006, 8:44:16 PM
Subject: [zfs-discuss] zpool status and CKSUM errors

===8<==Original message text===
Hello Robert,

Thursday, July 6, 2006, 1:49:34 AM, you wrote:

RM> Hello Eric,

RM> Monday, June 12, 2006, 11:21:24 PM, you wrote:

ES>> I reproduced this pretty easily on a lab machine.  I've filed:

ES>> 6437568 ditto block repair is incorrectly propagated to root vdev

ES>> To track this issue.  Keep in mind that you do have a flakey
ES>> controller/lun/something.  If this had been a user data block, your data
ES>> would be gone.


RM> I belive that something else is also happening here.
RM> I can see CKSUM errors on two different servers (v240 and T2000) all
RM> on non-redundant zpools and all the times it looks like ditto block
RM> helped - hey, it's just improbable.

RM> And while on T2000 from fmdump -ev I get:

RM> Jul 05 19:59:43.8786 ereport.io.fire.pec.btp   
0x14e4b8015f612002
RM> Jul 05 20:05:28.9165 ereport.io.fire.pec.re
0x14e5f951ce12b002
RM> Jul 05 20:05:58.5381 ereport.io.fire.pec.re
0x14e614e78f4c9002
RM> Jul 05 20:05:58.5389 ereport.io.fire.pec.btp   
0x14e614e7b6ddf002
RM> Jul 05 23:34:11.1960 ereport.io.fire.pec.re
0x1513869a6f7a6002
RM> Jul 05 23:34:11.1967 ereport.io.fire.pec.btp   
0x1513869a95196002
RM> Jul 06 00:09:17.1845 ereport.io.fire.pec.re
0x151b2fca4c988002
RM> Jul 06 00:09:17.1852 ereport.io.fire.pec.btp   
0x151b2fca72e6b002


RM> on v240 fmdump shows nothing for over a month and I'm sure I did zpool
RM> clear on that server later.


RM> v240:
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM> status: One or more devices has experienced an unrecoverable error.  An
RM> attempt was made to correct the error.  Applications are unaffected.
RM> action: Determine if the device needs to be replaced, and clear the errors
RM> using 'zpool clear' or replace the device with 'zpool replace'.
RM>see: http://www.sun.com/msg/ZFS-8000-9P
RM>  scrub: none requested
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0   167
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0   167

RM> errors: No known data errors
RM> bash-3.00#
RM> bash-3.00# zpool clear nfs-s5-s7
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM>  scrub: none requested
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0 0
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0 0

RM> errors: No known data errors
RM> bash-3.00#
RM> bash-3.00# zpool scrub nfs-s5-s7
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM>  scrub: scrub in progress, 0.01% done, 269h24m to go
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0 0
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0 0

RM> errors: No known data errors
RM> bash-3.00#

RM> We'll see the result - I hope I would have not to stop it in the
RM> morning. Anyway I have a feeling that nothing will be reported.


RM> ps. I've got several similar pools on those two servers and I see
RM> CKSUM errors on all of them with the same result - it's almost
RM> impossible.


ok, it took several days actually to complete scrub.
During scrub I saw some CKSUM errors already and now again there are
many of them, however scrub itself reported no errors at all.

bash-3.00# zpool status nfs-s5-s7
  pool: nfs-s5-s7
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Sun Jul  9 02:56:19 2006
config:

NAME STATE READ WRITE CKSUM
nfs-s5-s7ONLINE   0 018
  c4t600C0FF009258F28706F5201d0  ONLINE   0 018

errors: No known data errors
bash-3.00#


-- 
Best regards,
 Robertmailt

Re: [zfs-discuss] Re: zpool unavailable after reboot

2006-07-17 Thread eric kustarz

Mikael Kjerrman wrote:


Jeff,

thanks for your answer, and I almost wish I did type it wrong (the easy 
explanation that I messed up :-) but from what I can tell I did get it right

--- zpool commands I ran ---
bash-3.00# grep zpool /.bash_history 
zpool

zpool create data raidz c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c2t0d0 c2t1d0 c2t2d0 
c2t3d0 c2t4d0
zpool list
zpool status
zpool iostat 3
zpool scrub data
zpool status
bash-3.00#
 



And soon we'll store the 'zpool create' (and other subcommands run) 
on-disk.  I'm finishing that up:

6343741 want to store a command history on disk

eric




the other problem I have with this is that why did it kick the disk out? I can 
run all sorts of tests on the disk and it is perfectly fine... does it kick out 
random disk upon boot? ;-)


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Mark Shellenbaum

James Dickens wrote:

On 7/17/06, Mark Shellenbaum <[EMAIL PROTECTED]> wrote:

The following is the delegated admin model that Matt and I have been
working on.  At this point we are ready for your feedback on the
proposed model.

   -Mark




PERMISSION GRANTING

zfs allow [-l] [-d] <"everyone"|user|group> 
[,...] \


zfs allow [-l] [-d] -u  [,...] 
zfs allow [-l] [-d] -g  [,...] 
zfs allow [-l] [-d] -e [,...] 
zfs allow -c [,...] 

If no flags are used, the ability will be allowed for the specified
dataset and all of its descendents.

-l "Local" means that the permission will be allowed for the
specified dataset, and not its descendents (unless -d is also
specified).

-d "Descendents" means that the permission will be allowed for
descendent datasets, and not for this dataset (unless -l is also
specified).  (needed for 'zfs allow -d ahrens quota tank/home/ahrens')

When using the first form (without -u, -g, or -e), the
<"everyone"|user|group> argument will be interpreted as the keyword
"everyone" if possible, then as a user if possible, then as a group as
possible.  The "-u ", "-g ", and "-e (everyone)" forms
allow one to specify a user named "everyone", or a group whose name
conflicts with a user (or "everyone").  (note: the -e form is not
necessary since "zfs allow everyone" will always mean the keyword
everyone not the user everyone.)

As a possible extension, multiple 's could be allowed in one
command (eg. 'zfs allow -u ahrens,marks create tank/project')

-c "Create" means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

Abilities are mostly self explanatory, the ability to run
'zfs [set]  '.  Note, this implicitly collapses the
subcommand and property namespaces into one.  (I think that the 'set' is
superfluous anyway, it would be more convenient to say
'zfs =' anyway.)

create  create descendent datasets
destroy
snapshot
rollback
clone   create clone of any of the ds's snaps
(must also have 'create' ability in clone's 
parent)

promote (must also have 'promote' ability in origin fs)
rename  (must also have 'create' ability in new parent)
mount   mount and unmount the ds
share   share and unshare this ds
sendsend any of the ds's snapshots
receive create a descendent with 'zfs receive'
(must also have 'create' ability)
quota
reservation
volsize
recordsize
mountpoint
sharenfs
checksum
compression
atime
devices
exec
setuid
readonly
zoned
snapdir
aclmode
aclinherit


Hi

just one addition, "all" or "full" attributes, for the case you want
to get full permissions to the user or group

zfs create p1/john
zfs  allow  p1/john john  full

so we don't have to type out every attribute.



I think you wanted

zfs allow john full p1/john

We could have either a "full" or "all" to represent all permissions, but 
the problem with that is that you will then end up granting more 
permissions than are necessary to achieve the desired goal.


If enough people think its useful then we can do it.



James Dickens
uadmin.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Mark Shellenbaum

Glenn Skinner wrote:

The following is a nit-level comment, so I've directed it onl;y to you,
rather than to the entire list.

Date: Mon, 17 Jul 2006 09:57:35 -0600
From: Mark Shellenbaum <[EMAIL PROTECTED]>
Subject: [zfs-discuss] Proposal: delegated administration

The following is the delegated admin model that Matt and I have been 
working on.  At this point we are ready for your feedback on the 
proposed model.


...
PERMISSION REVOKING

zfs unallow  [-r] [-l] [-d]
<"everyone"|user|group>[,<"everyone"|user|group>...] \
[,...] 
zfs unallow [-r][-l][-d] -u user [,...]  
zfs unallow [-r][-l][-d] -g group [,...]  
	zfs unallow [-r][-l][-d] -e [,...]   


Please, can we have "disallow" instead of "unallow"?  The former is a
real word, the latter isn't.

-- Glenn



The reasoning behind unallow was to imply that you are simply removing 
an "allow".  With *disallow* it would sound more like you are denying a 
permission.


  -Mark

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Large device support

2006-07-17 Thread Robert Milkowski
Hello J.P.,

Monday, July 17, 2006, 3:57:01 PM, you wrote:

>> Well if in fact sd/ssd with EFI labels still have limit to 2TB than
>> create SMI label with one slice representing whole disk and then put
>> zfs on that slice. Eventually manually turn on write cache then.

JPK> How do you suggest that I create a slice representing the whole disk?
JPK> format (with or without -e) only sees 1.19TB

I take it you already have solved the problem.



-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread James Dickens

On 7/17/06, Jonathan Wheeler <[EMAIL PROTECTED]> wrote:

Hi All,

I've just built an 8 disk zfs storage box, and I'm in the testing phase before 
I put it into production. I've run into some unusual results, and I was hoping 
the community could offer some suggestions. I've bascially made the switch to 
Solaris on the promises of ZFS alone (yes I'm that excited about it!), so 
naturally I'm looking forward to some great performance - but it appears I'm 
going to need some help finding all of it.

I was having even lower numbers with filebench, so I decided to dial back to a 
really simple app for testing - bonnie.

The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate sata II 
300GB disks, Supermicro SAT2-MV8 8 port sata controller, running at/on a 133Mhz 
64pci-x bus.
The bottle neck here, by my thinkng, should be the disks themselves.
It's not the disk interfaces ('300MB'), the disk bus (300MB EACH), the pci-x 
bus (1.1GB), and I'd hope a 64-bit 3Ghz cpu would be sufficent.

Tests were run on a fresh clean zpool, on an idle system. Rogue results were 
dropped, and as you can see below, all tests were run more then once. 8GB 
should be far more then the 1GB of RAM that the system has, eliminating caching 
issues.

If I've still managed to overlook something in my testing setup, please let me 
know - I sure did try!

Sorry about the formatting - this is bound to end up ugly

Bonnie
  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raid0MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0  
2.0
8 disk   8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9  
2.1

so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be 
LOWER then writes??

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
mirror   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 33285 38.6 46033  9.9 33077  6.8 67934 90.4  93445  7.7 230.5  1.3
8 disk   8196 34821 41.4 46136  9.0 32445  6.6 67120 89.1  94403  6.9 210.4  1.8

46MB/sec writes, each disk individually can do better, but I guess keeping 8 
disks in sync is hurting performance. The 94MB/sec writes is interesting. One 
the one hand, that's greater then 1 disk's worth, so I'm getting striping 
performance out of a mirror GO ZFS. On the other, if I can get striping 
performance from mirrored reads, why is it only 94MB/sec? Seemingly it's not 
cpu bound.


Now for the important test, raid-z

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raidz  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3  
1.0
8 disk   8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3  
1.0
8 disk   8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5  
0.9
7 disk   8196 51103 58.8  93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9  
1.0
7 disk   8196 49446 56.8  93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1  
1.0
7 disk   8196 49831 57.1  81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4  
1.0
6 disk   8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7  
0.9
6 disk   8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4  
0.8
4 disk   8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1  
0.9

I'm getting distinctly non-linear scaling here.
Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8 =33Mb/sec with 
cpu to spare (roughly half on what each individual disk should be capable of). 
Here I'm getting 123/4= 30Mb/sec, or should that be 123/3= 41Mb/sec?
Using 30 as a basline, I'd be expecting to see twice that with 8 disks 
(240ish?). What I end up with is ~135, Clearly not good scaling at all.
The really interesting numbers happen at 7 disks - it's slower then with 4, in 
all tests.
I ran it 3x to be sure.
Note this was a native 7 disk raid-z, it wasn't 8 running in degraded mode with 
7.
Something is really wrong with my write performance here across the board.

Reads: 4 disks gives me 190MB/sec. WOAH! I'm very happy with that. 8 disks 
should scale to 380 then, Well 320 isn't all that far off - no biggie.
Looking at the 6 disk raidz is interesting though, 290MB/sec. The disks are 
good for 60+MB/sec individually. 290 is 48/disk - note also that this is better 
then my raid0 performance?!
Adding another 2 disks to my raidz gives me a mere 30Mb/sec extra performance? 
Something is going very wrong here too.


I'm not an expert, but would be great if you could run at least one more test.

can you try  2x 4disks i

Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread James Dickens

On 7/17/06, Mark Shellenbaum <[EMAIL PROTECTED]> wrote:

The following is the delegated admin model that Matt and I have been
working on.  At this point we are ready for your feedback on the
proposed model.

   -Mark




PERMISSION GRANTING

zfs allow [-l] [-d] <"everyone"|user|group> [,...] \

zfs allow [-l] [-d] -u  [,...] 
zfs allow [-l] [-d] -g  [,...] 
zfs allow [-l] [-d] -e [,...] 
zfs allow -c [,...] 

If no flags are used, the ability will be allowed for the specified
dataset and all of its descendents.

-l "Local" means that the permission will be allowed for the
specified dataset, and not its descendents (unless -d is also
specified).

-d "Descendents" means that the permission will be allowed for
descendent datasets, and not for this dataset (unless -l is also
specified).  (needed for 'zfs allow -d ahrens quota tank/home/ahrens')

When using the first form (without -u, -g, or -e), the
<"everyone"|user|group> argument will be interpreted as the keyword
"everyone" if possible, then as a user if possible, then as a group as
possible.  The "-u ", "-g ", and "-e (everyone)" forms
allow one to specify a user named "everyone", or a group whose name
conflicts with a user (or "everyone").  (note: the -e form is not
necessary since "zfs allow everyone" will always mean the keyword
everyone not the user everyone.)

As a possible extension, multiple 's could be allowed in one
command (eg. 'zfs allow -u ahrens,marks create tank/project')

-c "Create" means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

Abilities are mostly self explanatory, the ability to run
'zfs [set]  '.  Note, this implicitly collapses the
subcommand and property namespaces into one.  (I think that the 'set' is
superfluous anyway, it would be more convenient to say
'zfs =' anyway.)

create  create descendent datasets
destroy
snapshot
rollback
clone   create clone of any of the ds's snaps
(must also have 'create' ability in clone's parent)
promote (must also have 'promote' ability in origin fs)
rename  (must also have 'create' ability in new parent)
mount   mount and unmount the ds
share   share and unshare this ds
sendsend any of the ds's snapshots
receive create a descendent with 'zfs receive'
(must also have 'create' ability)
quota
reservation
volsize
recordsize
mountpoint
sharenfs
checksum
compression
atime
devices
exec
setuid
readonly
zoned
snapdir
aclmode
aclinherit


Hi

just one addition, "all" or "full" attributes, for the case you want
to get full permissions to the user or group

zfs create p1/john
zfs  allow  p1/john john  full

so we don't have to type out every attribute.


James Dickens
uadmin.blogspot.com





PERMISSION REVOKING

zfs unallow  [-r] [-l] [-d]
<"everyone"|user|group>[,<"everyone"|user|group>...] \
[,...] 
zfs unallow [-r][-l][-d] -u user [,...]  
zfs unallow [-r][-l][-d] -g group [,...]  
zfs unallow [-r][-l][-d] -e [,...]  

'zfs unallow' removes permissions that were granted with 'zfs allow'.
Note that this does not explicitly deny any permissions; the permissions
may still be allowed by ancestors of the specified dataset.

-l "Local" will cause only the Local permission to be removed.

-d "Descendents" will cause only the Descendant permissions to be
removed.

-r "Recursive" will remove the specified permissions from all descendant
datasets, as if 'zfs unallow' had been run on each descendant.

Note that '-r' removes abilities that have been explicitly set on
descendants, whereas '-d' removes abilities that have been set on *this*
dataset but apply to descendants.


PERMISSION PRINTING

zfs allow [-1] 

prints permissions that are set or allowed on this dataset, in the
following format:

  [,...] ()

 is "user", "group", or "everyone"
 is the user or group name, or blank for everyone and create
 can be:
"Local" (ie. set here with -l)
"Descendent" (ie. set here with -d)
"Local+Descendent" (ie. set here with no flags)
"Create" (ie. set here with -c)
"Inherited from " (ie. set on an ancestor without -l)

By default, only one line with a given ,, will be
printed (ie. abilities will be consolodated into one line of output
where possible).

-1 "One" will cause each line of output to print only a single ability,
and a single type (ie. not use "Local+Descendent")



ALLOW EXAMPLE

Lets setup a public build machine where engineers in group "staff" can create
ZFS file systems,clones,snapshots and so on, but you want to allow only
creator of the file system 

Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Nicolas Williams
On Mon, Jul 17, 2006 at 10:11:35AM -0700, Matthew Ahrens wrote:
> > I want root to create a new filesystem for a new user under
> > the /export/home filesystem, but then have that user get the
> > right privs via inheritance rather than requiring root to run
> > a set of zfs commands.
> 
> In that case, how should the system determine who the "owner" is?  We
> toyed with the idea of figuring out the user based on the last component
> of the filesystem name, but that seemed too tricky, at least for the
> first version.

The owner of the root directory of the ZFS filesystem in question.
Could delegation be derived from the ACL of the directory that would
contain a new ZFS filesystem?

E.g.,

# zfs create pool/foo
# chown joe pool/foo
# su - joe
% zfs create pool/foo/a
% chmod  /pool/foo/a
% exit
# su - jane
% zfs create pool/foo/a/b
% 
...

After all, with cheap filesystems creating a filesystem is almost like
creating a directory (I know, not quite the same, but perhaps close
enough for reusing the add_subdirectory ACE flag).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Mark Shellenbaum

Bart Smaalders wrote:

Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote:

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?

I think you're asking for the -c "Creator" flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates 
the

filesystem.  The above example shows how this might be done.

--matt

Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.


In that case, how should the system determine who the "owner" is?  We
toyed with the idea of figuring out the user based on the last component
of the filesystem name, but that seemed too tricky, at least for the
first version.

FYI, here is how you can do it with an additional zfs command:

# zfs create tank/home/barts
# zfs allow barts create,snapshot,... tank/home/barts

--matt


Owner of the top level directory is the owner of the filesystem?



When a file system is created the owner/group of the root of the file 
system are set to the user/group of the user executing the zfs create 
command.   That is also the user that all of the initial create time 
permissions are set to.


  -Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread Torrey McMahon

Or if you have the right patches ...

http://blogs.sun.com/roller/page/torrey?entry=really_big_luns

Cindy Swearingen wrote:

Hi Julian,

Can you send me the documentation pointer that says 2 TB isn't supported
on the Solaris 10 6/06 release?

The  2 TB limit was lifted in the Solaris 10 1/06 release, as described
here:

http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1j?a=view#ftzen

Thanks,

Cindy



J.P. King wrote:

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.



Well, in fact it turned out that the firmware on the device needed 
upgrading to support the appropriate SCSI extensions.


The documentation is still wrong in that it suggests that the ssd/sd 
driver shouldn't work with >2TB, but I am happy, so no problem.



Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Bart Smaalders

Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote:

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?

I think you're asking for the -c "Creator" flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt

Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.


In that case, how should the system determine who the "owner" is?  We
toyed with the idea of figuring out the user based on the last component
of the filesystem name, but that seemed too tricky, at least for the
first version.

FYI, here is how you can do it with an additional zfs command:

# zfs create tank/home/barts
# zfs allow barts create,snapshot,... tank/home/barts

--matt


Owner of the top level directory is the owner of the filesystem?

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Matthew Ahrens
On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote:
> >>So as administrator what do I need to do to set
> >>/export/home up for users to be able to create their own
> >>snapshots, create dependent filesystems (but still mounted
> >>underneath their /export/home/usrname)?
> >>
> >>In other words, is there a way to specify the rights of the
> >>owner of a filesystem rather than the individual - eg, delayed
> >>evaluation of the owner?
> >
> >I think you're asking for the -c "Creator" flag.  This allows
> >permissions (eg, to take snapshots) to be granted to whoever creates the
> >filesystem.  The above example shows how this might be done.
> >
> >--matt
> 
> Actually, I think I mean owner.
> 
> I want root to create a new filesystem for a new user under
> the /export/home filesystem, but then have that user get the
> right privs via inheritance rather than requiring root to run
> a set of zfs commands.

In that case, how should the system determine who the "owner" is?  We
toyed with the idea of figuring out the user based on the last component
of the filesystem name, but that seemed too tricky, at least for the
first version.

FYI, here is how you can do it with an additional zfs command:

# zfs create tank/home/barts
# zfs allow barts create,snapshot,... tank/home/barts

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Mark Shellenbaum

Bart Smaalders wrote:

Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 09:44:28AM -0700, Bart Smaalders wrote:

Mark Shellenbaum wrote:

PERMISSION GRANTING

zfs allow -c [,...] 

-c "Create" means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

ALLOW EXAMPLE
Lets setup a public build machine where engineers in group "staff" 
can create ZFS file systems,clones,snapshots and so on, but you want 
to allow only creator of the file system to destroy it.


# zpool create sandbox 
# chmod 1777 /sandbox
# zfs allow -l staff create sandbox
# zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?


I think you're asking for the -c "Creator" flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt


Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.



Yes, you can delegate snapshot,clone,...

# zfs allow  snapshot,mount,clone, pool

that will allow the above permissions to be inherited by all datasets in 
the pool.


If you wanted to open it up even more you could do

# zfs allow everyone snapshot,mount,clone, pool
That would allow anybody to create a snapshot,clone,...

The -l and -d control the inheritance of the allow permissions.


- Bart



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Bart Smaalders

Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 09:44:28AM -0700, Bart Smaalders wrote:

Mark Shellenbaum wrote:

PERMISSION GRANTING

zfs allow -c [,...] 

-c "Create" means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

ALLOW EXAMPLE 

Lets setup a public build machine where engineers in group "staff" can 
create ZFS file systems,clones,snapshots and so on, but you want to allow 
only creator of the file system to destroy it.


# zpool create sandbox 
# chmod 1777 /sandbox
# zfs allow -l staff create sandbox
# zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?


I think you're asking for the -c "Creator" flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt


Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread Al Hopper
On Mon, 17 Jul 2006, Roch wrote:

>
> Sorry to plug my own blog but have you had a look at these ?
>
>   http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to (raidz)
>   http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs
>
> Also, my thinking is that raid-z is probably more friendly
> when the config contains (power-of-2 + 1) disks (or + 2 for
> raid-z2).

+1

I think that 5 disks for a raidz is the sweet spot IMHO.  But ... YMMV etc.etc.

FWIW: here's a datapoint from a dirty raidz system with 8Gb of RAM & 5 *
300Gb SATA disks:

Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zfs016G 88937  99 195973  47 95536  29 75279  95 228022  27 433.9   
1
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 31812  99 + +++ + +++ 28761  99 + +++ + +++
zfs0,16G,88937,99,195973,47,95536,29,75279,95,228022,27,433.9,1,16,31812,99,+,+++,+,+++,28761,99,+,+++,+,+++

I'm *very* pleased with the current release of ZFS.  That being said, ZFS
can be frustrating at times.  Occasionally it'll issue in excess of 1k I/O
ops a Second (IOPS) and you'll say "holy snit, look at..." - and then
there are times you wonder why it won't issue more that ~250 IOPS.  But,
for a Rev 1 filesystem, with the technical complexity of ZFS, this level
of performance is excellent IMHO and I expect that all kinds of
improvements will continue to be made on the code over time.

Jonathan - I expect the answer to your performance expectations is that
ZFS is-what-it-is at the moment.  A suggestion is to split your 8 drives
into a 5 disk raidz pool and a 2 disk mirror with one spare drive
remaining.  Of course this is from my ZFS experience and for my intended
usage and may not apply to your intended application(s).

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Matthew Ahrens
On Mon, Jul 17, 2006 at 09:44:28AM -0700, Bart Smaalders wrote:
> Mark Shellenbaum wrote:
> >PERMISSION GRANTING
> >
> > zfs allow -c [,...] 
> >
> >-c "Create" means that the permission will be granted (Locally) to the
> >creator on any newly-created descendant filesystems.
> >
> >ALLOW EXAMPLE 
> >
> >Lets setup a public build machine where engineers in group "staff" can 
> >create ZFS file systems,clones,snapshots and so on, but you want to allow 
> >only creator of the file system to destroy it.
> >
> ># zpool create sandbox 
> ># chmod 1777 /sandbox
> ># zfs allow -l staff create sandbox
> ># zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox
> 
> So as administrator what do I need to do to set
> /export/home up for users to be able to create their own
> snapshots, create dependent filesystems (but still mounted
> underneath their /export/home/usrname)?
> 
> In other words, is there a way to specify the rights of the
> owner of a filesystem rather than the individual - eg, delayed
> evaluation of the owner?

I think you're asking for the -c "Creator" flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Darren J Moffat

Mark Shellenbaum wrote:
The following is the delegated admin model that Matt and I have been 
working on.  At this point we are ready for your feedback on the 
proposed model.


Overall this looks really good.

I might have some detailed comments after a third reading, but I think 
it certainly covers functionality I need.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Bart Smaalders

Mark Shellenbaum wrote:
The following is the delegated admin model that Matt and I have been 
working on.  At this point we are ready for your feedback on the 
proposed model.


  -Mark





PERMISSION GRANTING

zfs allow [-l] [-d] <"everyone"|user|group> [,...] \

zfs allow [-l] [-d] -u  [,...] 
zfs allow [-l] [-d] -g  [,...] 
zfs allow [-l] [-d] -e [,...] 
zfs allow -c [,...] 

If no flags are used, the ability will be allowed for the specified
dataset and all of its descendents.

-l "Local" means that the permission will be allowed for the
specified dataset, and not its descendents (unless -d is also
specified).

-d "Descendents" means that the permission will be allowed for
descendent datasets, and not for this dataset (unless -l is also
specified).  (needed for 'zfs allow -d ahrens quota tank/home/ahrens')

When using the first form (without -u, -g, or -e), the
<"everyone"|user|group> argument will be interpreted as the keyword
"everyone" if possible, then as a user if possible, then as a group as
possible.  The "-u ", "-g ", and "-e (everyone)" forms
allow one to specify a user named "everyone", or a group whose name
conflicts with a user (or "everyone").  (note: the -e form is not
necessary since "zfs allow everyone" will always mean the keyword
everyone not the user everyone.)

As a possible extension, multiple 's could be allowed in one
command (eg. 'zfs allow -u ahrens,marks create tank/project')

-c "Create" means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

Abilities are mostly self explanatory, the ability to run
'zfs [set]  '.  Note, this implicitly collapses the
subcommand and property namespaces into one.  (I think that the 'set' is
superfluous anyway, it would be more convenient to say
'zfs =' anyway.)

create  create descendent datasets
destroy
snapshot
rollback
clone   create clone of any of the ds's snaps
(must also have 'create' ability in clone's parent)
promote (must also have 'promote' ability in origin fs)
rename  (must also have 'create' ability in new parent)
mount   mount and unmount the ds
share   share and unshare this ds
sendsend any of the ds's snapshots
receive create a descendent with 'zfs receive'
(must also have 'create' ability)
quota
reservation
volsize 
recordsize
mountpoint
sharenfs
checksum
compression
atime
devices
exec
setuid
readonly
zoned
snapdir
aclmode
aclinherit


PERMISSION REVOKING

zfs unallow  [-r] [-l] [-d]
<"everyone"|user|group>[,<"everyone"|user|group>...] \
[,...] 
zfs unallow [-r][-l][-d] -u user [,...]  
zfs unallow [-r][-l][-d] -g group [,...]  
	zfs unallow [-r][-l][-d] -e [,...]   


'zfs unallow' removes permissions that were granted with 'zfs allow'.
Note that this does not explicitly deny any permissions; the permissions
may still be allowed by ancestors of the specified dataset.

-l "Local" will cause only the Local permission to be removed.

-d "Descendents" will cause only the Descendant permissions to be
removed.

-r "Recursive" will remove the specified permissions from all descendant
datasets, as if 'zfs unallow' had been run on each descendant.

Note that '-r' removes abilities that have been explicitly set on
descendants, whereas '-d' removes abilities that have been set on *this*
dataset but apply to descendants.


PERMISSION PRINTING

zfs allow [-1] 

prints permissions that are set or allowed on this dataset, in the
following format:

  [,...] ()

 is "user", "group", or "everyone"
 is the user or group name, or blank for everyone and create
 can be:
"Local" (ie. set here with -l)
"Descendent" (ie. set here with -d)
"Local+Descendent" (ie. set here with no flags)
"Create" (ie. set here with -c)
"Inherited from " (ie. set on an ancestor without -l)

By default, only one line with a given ,, will be
printed (ie. abilities will be consolodated into one line of output
where possible).

-1 "One" will cause each line of output to print only a single ability,
and a single type (ie. not use "Local+Descendent")



ALLOW EXAMPLE 

Lets setup a public build machine where engineers in group "staff" can create 
ZFS file systems,clones,snapshots and so on, but you want to allow only 
creator of the file system to destroy it.


# zpool create sandbox 
# chmod 1777 /sandbox
# zfs allow -l staff create
# zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox

$ zfs create sandbox/marks

Now verify that a different u

Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread Richard Elling

Dana H. Myers wrote:

Jonathan Wheeler wrote:

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
mirror   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 33285 38.6 46033  9.9 33077  6.8 67934 90.4  93445  7.7 230.5  1.3 
8 disk   8196 34821 41.4 46136  9.0 32445  6.6 67120 89.1  94403  6.9 210.4  1.8


46MB/sec writes, each disk individually can do better, but I guess keeping 8 
disks in sync is hurting performance. The 94MB/sec writes is interesting. One 
the one hand, that's greater then 1 disk's worth, so I'm getting striping 
performance out of a mirror GO ZFS. On the other, if I can get striping 
performance from mirrored reads, why is it only 94MB/sec? Seemingly it's not 
cpu bound.


I expect a mirror to perform about the same as a single disk for writes, and 
about
the same as two disks for reads, which seems to be the case here.  Someone from
the ZFS team can correct me, but I tend to believe that reads from a mirror are
scheduled in pairs; it doesn't help the read performance to have 6 more copies 
of
the same data available.


Is this an 8-way mirror, or a 4x2 RAID-1+0?  For the former, I agree with Dana.
For the latter, you should get more available space and better performance.
8-way mirror:
zpool create blah mirror c1d0 c1d1 c1d2 c1d3 c1d4 c1d5 c1d6 c1d7
4x2-way mirror:
zpool create blag mirror c1d0 c1d1 mirror c1d2 c1d3 mirror c1d4 c1d5 
mirror c1d6 c1d7

 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Richard Elling

I too have seen this recently, due to a partially failed drive.
When I physically removed the drive, ZFS figured everything out and
I was back up and running.  Alas, I have been unable to recreate.
There is a bug lurking here, if someone has a more clever way to
test, we might be able to nail it down.
 -- richard

Mikael Kjerrman wrote:

Hi,

so it happened...

I have a 10 disk raidz pool running Solaris 10 U2, and after a reboot the whole 
pool became unavailable after apparently loosing a diskdrive. (The drive is 
seemingly ok as far as I can tell from other commands)

--- bootlog ---
Jul 17 09:57:38 expprd fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-CS, 
TYPE: Fault, VER: 1, SEVERITY: Major
Jul 17 09:57:38 expprd EVENT-TIME: Mon Jul 17 09:57:38 MEST 2006
Jul 17 09:57:38 expprd PLATFORM: SUNW,UltraAX-i2, CSN: -, HOSTNAME: expprd
Jul 17 09:57:38 expprd SOURCE: zfs-diagnosis, REV: 1.0
Jul 17 09:57:38 expprd EVENT-ID: e2fd61f7-a03d-6279-d5a5-9b8755fa1af9
Jul 17 09:57:38 expprd DESC: A ZFS pool failed to open.  Refer to 
http://sun.com/msg/ZFS-8000-CS for more information.
Jul 17 09:57:38 expprd AUTO-RESPONSE: No automated response will occur.
Jul 17 09:57:38 expprd IMPACT: The pool data is unavailable
Jul 17 09:57:38 expprd REC-ACTION: Run 'zpool status -x' and either attach the 
missing device or
Jul 17 09:57:38 expprd  restore from backup.
---

--- zpool status -x ---
bash-3.00# zpool status -x
  pool: data
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
dataUNAVAIL  0 0 0  insufficient replicas
  c1t0d0ONLINE   0 0 0
  c1t1d0ONLINE   0 0 0
  c1t2d0ONLINE   0 0 0
  c1t3d0ONLINE   0 0 0
  c2t0d0ONLINE   0 0 0
  c2t1d0ONLINE   0 0 0
  c2t2d0ONLINE   0 0 0
  c2t3d0ONLINE   0 0 0
  c2t4d0ONLINE   0 0 0
  c1t4d0UNAVAIL  0 0 0  cannot open
--

The problem as I see it is that the pool should be able to handle 1 disk error, 
no?
and the online, attach, replace commands doesn't work when the pool is 
unavailable. I've filed a case with Sun, but thought I'd ask around here to see 
if anyone has experienced this before.


cheers,

//Mikael
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol of files for Oracle?

2006-07-17 Thread Roch

Robert Milkowski writes:
 > Hello zfs-discuss,
 > 
 >   What would you rather propose for ZFS+ORACLE - zvols or just files
 >   from the performance standpoint?
 >   
 > 
 > -- 
 > Best regards,
 >  Robert  mailto:[EMAIL PROTECTED]
 >  http://milek.blogspot.com
 > 
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Not sure to what extent this would suffer from this:

Synopsis: large writes to zvol synchs too much, better cut down a little
Number: 6428639

At first glance, I would think it's not in the picture as
oracle won't issue those jumbo sized writes.

Right now, I've seen more focus of getting ZFS/Oracle
working well with files because that's what we see as the
demand. But given that Oracle is ready to deal with raw and that 
files bring a extra set of constraints I would think that
zvol is best. Plus you seem to have an early adopter frame
of mind ;-)

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread Roch

Sorry to plug my own blog but have you had a look at these ?

http://blogs.sun.com/roller/page/roch?entry=when_to_and_not_to (raidz)
http://blogs.sun.com/roller/page/roch?entry=the_dynamics_of_zfs

Also, my thinking is that raid-z is probably more friendly
when the config contains (power-of-2 + 1) disks (or + 2 for
raid-z2).

-r

Jonathan Wheeler writes:
 > Hi All,
 > 
 > I've just built an 8 disk zfs storage box, and I'm in the testing
 > phase before I put it into production. I've run into some unusual
 > results, and I was hoping the community could offer some
 > suggestions. I've bascially made the switch to Solaris on the promises
 > of ZFS alone (yes I'm that excited about it!), so naturally I'm
 > looking forward to some great performance - but it appears I'm going
 > to need some help finding all of it. 
 > 
 > I was having even lower numbers with filebench, so I decided to dial
 > back to a really simple app for testing - bonnie. 
 > 
 > The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate
 > sata II 300GB disks, Supermicro SAT2-MV8 8 port sata controller,
 > running at/on a 133Mhz 64pci-x bus. 
 > The bottle neck here, by my thinkng, should be the disks themselves. 
 > It's not the disk interfaces ('300MB'), the disk bus (300MB EACH), the
 > pci-x bus (1.1GB), and I'd hope a 64-bit 3Ghz cpu would be sufficent. 
 > 
 > Tests were run on a fresh clean zpool, on an idle system. Rogue
 > results were dropped, and as you can see below, all tests were run
 > more then once. 8GB should be far more then the 1GB of RAM that the
 > system has, eliminating caching issues. 
 > 
 > If I've still managed to overlook something in my testing setup,
 > please let me know - I sure did try! 
 > 
 > Sorry about the formatting - this is bound to end up ugly
 > 
 > Bonnie
 >   ---Sequential Output ---Sequential Input-- 
 > --Random--
 >   -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- 
 > --Seeks---
 > raid0MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
 > 8 disk   8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 
 > 286.0  2.0
 > 8 disk   8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 
 > 302.9  2.1
 > 
 > so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be 
 > LOWER then writes??
 > 
 >   ---Sequential Output ---Sequential Input-- 
 > --Random--
 >   -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- 
 > --Seeks---
 > mirror   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
 > 8 disk   8196 33285 38.6 46033  9.9 33077  6.8 67934 90.4  93445  7.7 230.5  
 > 1.3 
 > 8 disk   8196 34821 41.4 46136  9.0 32445  6.6 67120 89.1  94403  6.9 210.4  
 > 1.8
 > 
 > 46MB/sec writes, each disk individually can do better, but I guess
 > keeping 8 disks in sync is hurting performance. The 94MB/sec writes is
 > interesting. One the one hand, that's greater then 1 disk's worth, so
 > I'm getting striping performance out of a mirror GO ZFS. On the other,
 > if I can get striping performance from mirrored reads, why is it only
 > 94MB/sec? Seemingly it's not cpu bound. 
 > 
 > 
 > Now for the important test, raid-z
 > 
 >   ---Sequential Output ---Sequential Input-- 
 > --Random--
 >   -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- 
 > --Seeks---
 > raidz  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec 
 > %CPU
 > 8 disk   8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3 
 >  1.0
 > 8 disk   8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3 
 >  1.0
 > 8 disk   8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5 
 >  0.9
 > 7 disk   8196 51103 58.8  93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9 
 >  1.0
 > 7 disk   8196 49446 56.8  93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1 
 >  1.0
 > 7 disk   8196 49831 57.1  81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4 
 >  1.0
 > 6 disk   8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7 
 >  0.9
 > 6 disk   8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4 
 >  0.8
 > 4 disk   8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1 
 >  0.9
 > 
 > I'm getting distinctly non-linear scaling here.
 > 
 > Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8
 > =33Mb/sec with cpu to spare (roughly half on what each individual disk
 > should be capable of). Here I'm getting 123/4= 30Mb/sec, or should
 > that be 123/3= 41Mb/sec? 
 > Using 30 as a basline, I'd be expecting to see twice that with 8 disks 
 > (240ish?). What I end up with is ~135, Clearly not good scaling at all.
 > The really interesting numbers happen at 7 disks - it's slower then
 > with 4, in all tests. 
 > I ran it 3x to be sure.
 > Note this was a native 7 disk raid-z, it wasn't 8 running in degraded
 > mode with 7.  Something is 

[zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Mark Shellenbaum
The following is the delegated admin model that Matt and I have been 
working on.  At this point we are ready for your feedback on the 
proposed model.


  -Mark


PERMISSION GRANTING

zfs allow [-l] [-d] <"everyone"|user|group> [,...] \

zfs allow [-l] [-d] -u  [,...] 
zfs allow [-l] [-d] -g  [,...] 
zfs allow [-l] [-d] -e [,...] 
zfs allow -c [,...] 

If no flags are used, the ability will be allowed for the specified
dataset and all of its descendents.

-l "Local" means that the permission will be allowed for the
specified dataset, and not its descendents (unless -d is also
specified).

-d "Descendents" means that the permission will be allowed for
descendent datasets, and not for this dataset (unless -l is also
specified).  (needed for 'zfs allow -d ahrens quota tank/home/ahrens')

When using the first form (without -u, -g, or -e), the
<"everyone"|user|group> argument will be interpreted as the keyword
"everyone" if possible, then as a user if possible, then as a group as
possible.  The "-u ", "-g ", and "-e (everyone)" forms
allow one to specify a user named "everyone", or a group whose name
conflicts with a user (or "everyone").  (note: the -e form is not
necessary since "zfs allow everyone" will always mean the keyword
everyone not the user everyone.)

As a possible extension, multiple 's could be allowed in one
command (eg. 'zfs allow -u ahrens,marks create tank/project')

-c "Create" means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

Abilities are mostly self explanatory, the ability to run
'zfs [set]  '.  Note, this implicitly collapses the
subcommand and property namespaces into one.  (I think that the 'set' is
superfluous anyway, it would be more convenient to say
'zfs =' anyway.)

create  create descendent datasets
destroy
snapshot
rollback
clone   create clone of any of the ds's snaps
(must also have 'create' ability in clone's parent)
promote (must also have 'promote' ability in origin fs)
rename  (must also have 'create' ability in new parent)
mount   mount and unmount the ds
share   share and unshare this ds
sendsend any of the ds's snapshots
receive create a descendent with 'zfs receive'
(must also have 'create' ability)
quota
reservation
volsize 
recordsize
mountpoint
sharenfs
checksum
compression
atime
devices
exec
setuid
readonly
zoned
snapdir
aclmode
aclinherit


PERMISSION REVOKING

zfs unallow  [-r] [-l] [-d]
<"everyone"|user|group>[,<"everyone"|user|group>...] \
[,...] 
zfs unallow [-r][-l][-d] -u user [,...]  
zfs unallow [-r][-l][-d] -g group [,...]  
zfs unallow [-r][-l][-d] -e [,...]   

'zfs unallow' removes permissions that were granted with 'zfs allow'.
Note that this does not explicitly deny any permissions; the permissions
may still be allowed by ancestors of the specified dataset.

-l "Local" will cause only the Local permission to be removed.

-d "Descendents" will cause only the Descendant permissions to be
removed.

-r "Recursive" will remove the specified permissions from all descendant
datasets, as if 'zfs unallow' had been run on each descendant.

Note that '-r' removes abilities that have been explicitly set on
descendants, whereas '-d' removes abilities that have been set on *this*
dataset but apply to descendants.


PERMISSION PRINTING

zfs allow [-1] 

prints permissions that are set or allowed on this dataset, in the
following format:

  [,...] ()

 is "user", "group", or "everyone"
 is the user or group name, or blank for everyone and create
 can be:
"Local" (ie. set here with -l)
"Descendent" (ie. set here with -d)
"Local+Descendent" (ie. set here with no flags)
"Create" (ie. set here with -c)
"Inherited from " (ie. set on an ancestor without -l)

By default, only one line with a given ,, will be
printed (ie. abilities will be consolodated into one line of output
where possible).

-1 "One" will cause each line of output to print only a single ability,
and a single type (ie. not use "Local+Descendent")



ALLOW EXAMPLE 

Lets setup a public build machine where engineers in group "staff" can create 
ZFS file systems,clones,snapshots and so on, but you want to allow only 
creator of the file system to destroy it.

# zpool create sandbox 
# chmod 1777 /sandbox
# zfs allow -l staff create
# zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox

$ zfs create sandbox/marks

Now verify that a different user can't destroy it

% zfs destroy sandbox/marks
cannot destroy 'sandbox/marks': permission de

Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread Dana H. Myers
Jonathan Wheeler wrote:

I'm not a ZFS expert - I'm just an enthusiastic user inside Sun.

Here are some brief observations:

> Bonnie
>   ---Sequential Output ---Sequential Input-- 
> --Random--
>   -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- 
> --Seeks---
> raid0MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
> 8 disk   8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0 
>  2.0
> 8 disk   8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9 
>  2.1
> 
> so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be 
> LOWER then writes??

I believe this can happen because ZFS is optimized for writes, though I would 
tend
expect a sequential write followed by a sequential read to be about the same if
there's no other filesystem activity during the write.

>   ---Sequential Output ---Sequential Input-- 
> --Random--
>   -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- 
> --Seeks---
> mirror   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
> 8 disk   8196 33285 38.6 46033  9.9 33077  6.8 67934 90.4  93445  7.7 230.5  
> 1.3 
> 8 disk   8196 34821 41.4 46136  9.0 32445  6.6 67120 89.1  94403  6.9 210.4  
> 1.8
> 
> 46MB/sec writes, each disk individually can do better, but I guess keeping 8 
> disks in sync is hurting performance. The 94MB/sec writes is interesting. One 
> the one hand, that's greater then 1 disk's worth, so I'm getting striping 
> performance out of a mirror GO ZFS. On the other, if I can get striping 
> performance from mirrored reads, why is it only 94MB/sec? Seemingly it's not 
> cpu bound.

I expect a mirror to perform about the same as a single disk for writes, and 
about
the same as two disks for reads, which seems to be the case here.  Someone from
the ZFS team can correct me, but I tend to believe that reads from a mirror are
scheduled in pairs; it doesn't help the read performance to have 6 more copies 
of
the same data available.

> Now for the important test, raid-z

I'll have to let the experts dissect this data; it looks a little goofy to me,
too.

Dana
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: half duplex read/write operations to disk sometimes?

2006-07-17 Thread Roch

Hi Sean, You suffer from an extreme bout of 
6429205 each zpool needs to monitor it's  throughput and throttle heavy writers

When this is fixed, your responsiveness will be better.

Note to Mark, Sean is more than willing to test any fix we
would have for this...

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread J.P. King

On Mon, 17 Jul 2006, Cindy Swearingen wrote:


Hi Julian,

Can you send me the documentation pointer that says 2 TB isn't supported
on the Solaris 10 6/06 release?


As per my original post:
http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1k?a=view#disksconcepts-17

This doesn't say which version of Solaris 10 it is talking about.


The  2 TB limit was lifted in the Solaris 10 1/06 release, as described
here:

http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1j?a=view#ftzen


Thanks.  That was exactly what I was looking for but had failed to find, 
confirmation that it should work.



Cindy


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Al Hopper
On Mon, 17 Jul 2006, Darren J Moffat wrote:

> Jeff Bonwick wrote
> > zpool create data unreplicated A B C
> >
> > The extra typing would be annoying, but would make it almost impossible
> > to get the wrong behavior by accident.
>
> I think that is a very good idea from a usability view point.  It is
> better to have to type a few more chars to explicitly say "I know ZFS
> isn't going to do all the data replication" when you run zpool than to
> find out later you aren't protected (by ZFS anyway).

+1

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Mirroring better with checksums?

2006-07-17 Thread Anton B. Rang
Well, it's not related to RAID-Z at all, but yes, mirroring is better with ZFS. 
The checksums allow bad data on either side of the mirror to be detected, so if 
for some reason one disk is sometimes losing or damaging a write, the other 
disk can provide the good data (and ZFS can tell which is correct).
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Darren J Moffat

Jeff Bonwick wrote

zpool create data unreplicated A B C

The extra typing would be annoying, but would make it almost impossible
to get the wrong behavior by accident.


I think that is a very good idea from a usability view point.  It is 
better to have to type a few more chars to explicitly say "I know ZFS 
isn't going to do all the data replication" when you run zpool than to 
find out later you aren't protected (by ZFS anyway).


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread Cindy Swearingen

Hi Julian,

Can you send me the documentation pointer that says 2 TB isn't supported
on the Solaris 10 6/06 release?

The  2 TB limit was lifted in the Solaris 10 1/06 release, as described
here:

http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1j?a=view#ftzen

Thanks,

Cindy



J.P. King wrote:

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.



Well, in fact it turned out that the firmware on the device needed 
upgrading to support the appropriate SCSI extensions.


The documentation is still wrong in that it suggests that the ssd/sd 
driver shouldn't work with >2TB, but I am happy, so no problem.



Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread J.P. King

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.


Well, in fact it turned out that the firmware on the device needed 
upgrading to support the appropriate SCSI extensions.


The documentation is still wrong in that it suggests that the ssd/sd 
driver shouldn't work with >2TB, but I am happy, so no problem.



Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread J.P. King

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.


How do you suggest that I create a slice representing the whole disk?
format (with or without -e) only sees 1.19TB


Robertmailto:[EMAIL PROTECTED]


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Enabling compression/encryption on a populated filesystem

2006-07-17 Thread Luke Scharf




Darren J Moffat wrote:
Buth the
reason thing is how do you tell the admin "its done now the filesystem
is safe".   With compression you don't generally care if some old stuff
didn't compress (and with the current implementation it has to compress
a certain amount or it gets written uncompressed anyway).  With
encryption the human admin really needs to be told.

As a sysadmin, I'd be happy with another scrub-type command.  Something
with the following meaning: 
"Reapply all block-level properties such as compression,
encryption, and checksum to every block in the volume.  Have the admin
come back tomorrow and run 'zpool status' too see if it's zone."  

Mad props if I can do this on a live filesystem (like the other ZFS
commands, which also get mad props for being good tools).

A natural command for this would be something like "zfs blockscrub
tank/volume".  Also, "zpool blockscrub tank" would make sense to me as
well, even though it might touch more data.

Of course, it's easy for me to just say this, since I'm not thinking
about the implementation very deeply...

-Luke





smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large device support

2006-07-17 Thread Robert Milkowski
Hello J.P.,

Monday, July 17, 2006, 2:15:56 PM, you wrote:

JPK> Possibly not the right list, but the only appropriate one I knew about.

JPK> I have a Solaris box (just reinstalled to Sol 10 606) with a 3.19TB device
JPK> hanging off it, attatched by fibre.

JPK> Solaris refuses to see this device except as a 1.19 TB device.

JPK> Documentation that I have found 
JPK> 
(http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1k?a=view#disksconcepts-17)
JPK> Suggests that this will not work with the ssd or sd drives.

JPK> Is this really the case?  If that isn't supported, what is?

Well if in fact sd/ssd with EFI labels still have limit to 2TB than
create SMI label with one slice representing whole disk and then put
zfs on that slice. Eventually manually turn on write cache then.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: zvol Performance

2006-07-17 Thread Neil Perrin

This is change request:

6428639 large writes to zvol synchs too much, better cut down a little

which I have a fix for, but it hasn't been put back.

Neil.

Jürgen Keil wrote On 07/17/06 04:18,:

Further testing revealed
that it wasn't an iSCSI performance issue but a zvol
issue.  Testing on a SATA disk locally, I get these
numbers (sequentual write):

UFS: 38MB/s
ZFS: 38MB/s
Zvol UFS: 6MB/s
Zvol Raw: ~6MB/s

ZFS is nice and fast but Zvol performance just drops
off a cliff.  Suggestion or observations by others
using zvol would be extremely helpful.   



# zfs create -V 1g data/zvol-test
# time dd if=/data/media/sol-10-u2-ga-x86-dvd.iso 
of=/dev/zvol/rdsk/data/zvol-test bs=32k count=1
1+0 records in
1+0 records out
0.08u 9.37s 2:21.56 6.6%

That's ~ 2.3 MB/s.

I do see *frequent* DKIOCFLUSHWRITECACHE ioctls
(one flush write cache ioctl after writing ~36KB of data, needs ~6-7 
milliseconds per flush):


  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02778, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 5736778 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e027c0, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6209599 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02808, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6572132 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02850, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6732316 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02898, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6175876 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e028e0, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6251611 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02928, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7756397 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02970, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6393356 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e029b8, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6147003 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a00, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6247036 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a48, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6061991 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a90, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6284297 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02ad8, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6174818 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02b20, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6245923 
nsec, error 0



dtrace with stack backtraces:


  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5d1ec10, count 9000
  0  39404  zio_ioctl:entry
  zfs`zil_flush_vdevs+0x144
  zfs`zil_commit+0x311
  zfs`zvol_strategy+0x4bc
  genunix`default_physio+0x308
  genunix`physio+0x1d
  zfs`zvol_write+0x22
  genunix`cdev_write+0x25
  specfs`spec_write+0x4d6
  genunix`fop_write+0x2e
  genunix`write+0x2ae
  unix`sys_sysenter+0x104

  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6638189 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5d1ec58, count 9000
  0  39404  zio_ioctl:entry
  zfs`zil_flush_vdevs+0x144
  zfs`zil_commit+0x311
  zfs`zvol_strategy+0x4bc
  genunix`default_physio+0x308
  genunix`physio+0x1d
  zfs`zvol_write+0x22
  genunix`cdev_write+0x25
  specfs`spec_write+0x4d6
  genunix`fop_write+0x2e
  genunix`write+0x2ae
  unix`sys_sysenter+0x104

  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7881400 
nsec, error 0


===

[zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?

2006-07-17 Thread Jonathan Wheeler
Hi All,

I've just built an 8 disk zfs storage box, and I'm in the testing phase before 
I put it into production. I've run into some unusual results, and I was hoping 
the community could offer some suggestions. I've bascially made the switch to 
Solaris on the promises of ZFS alone (yes I'm that excited about it!), so 
naturally I'm looking forward to some great performance - but it appears I'm 
going to need some help finding all of it.

I was having even lower numbers with filebench, so I decided to dial back to a 
really simple app for testing - bonnie.

The system is an nevada_41 EM64T 3ghz xeon. 1GB ram, with 8x seagate sata II 
300GB disks, Supermicro SAT2-MV8 8 port sata controller, running at/on a 133Mhz 
64pci-x bus.
The bottle neck here, by my thinkng, should be the disks themselves.
It's not the disk interfaces ('300MB'), the disk bus (300MB EACH), the pci-x 
bus (1.1GB), and I'd hope a 64-bit 3Ghz cpu would be sufficent.

Tests were run on a fresh clean zpool, on an idle system. Rogue results were 
dropped, and as you can see below, all tests were run more then once. 8GB 
should be far more then the 1GB of RAM that the system has, eliminating caching 
issues.

If I've still managed to overlook something in my testing setup, please let me 
know - I sure did try!

Sorry about the formatting - this is bound to end up ugly

Bonnie
  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raid0MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 78636 93.0 261804 64.2 125585 25.6 72160 95.3 246172 19.1 286.0  
2.0
8 disk   8196 79452 93.9 286292 70.2 129163 26.0 72422 95.5 243628 18.9 302.9  
2.1

so ~270MB/sec writes - awesome! 240MB/sec reads though - why would this be 
LOWER then writes??

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
mirror   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 33285 38.6 46033  9.9 33077  6.8 67934 90.4  93445  7.7 230.5  
1.3 
8 disk   8196 34821 41.4 46136  9.0 32445  6.6 67120 89.1  94403  6.9 210.4  1.8

46MB/sec writes, each disk individually can do better, but I guess keeping 8 
disks in sync is hurting performance. The 94MB/sec writes is interesting. One 
the one hand, that's greater then 1 disk's worth, so I'm getting striping 
performance out of a mirror GO ZFS. On the other, if I can get striping 
performance from mirrored reads, why is it only 94MB/sec? Seemingly it's not 
cpu bound.


Now for the important test, raid-z

  ---Sequential Output ---Sequential Input-- --Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
raidz  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
8 disk   8196 61785 70.9 142797 29.3 89342 19.9 64197 85.7 320554 32.6 131.3  
1.0
8 disk   8196 62869 72.4 131801 26.7 90692 20.7 63986 85.7 306152 33.4 127.3  
1.0
8 disk   8196 63103 72.9 128164 25.9 86175 19.4 64126 85.7 320410 32.7 124.5  
0.9
7 disk   8196 51103 58.8  93815 19.1 74093 16.1 64705 86.5 331865 32.8 124.9  
1.0
7 disk   8196 49446 56.8  93946 18.7 73092 15.8 64708 86.7 331458 32.7 127.1  
1.0
7 disk   8196 49831 57.1  81305 16.2 78101 16.9 64698 86.4 331577 32.7 132.4  
1.0
6 disk   8196 62360 72.3 157280 33.4 99511 21.9 65360 87.3 288159 27.1 132.7  
0.9
6 disk   8196 63291 72.8 152598 29.1 97085 21.4 65546 87.2 292923 26.7 133.4  
0.8
4 disk   8196 57965 67.9 123268 27.6 78712 17.1 66635 89.3 189482 15.9 134.1  
0.9

I'm getting distinctly non-linear scaling here.
Writes: 4 disks gives me 123MB/sec. Raid0 was giving me 270/8 =33Mb/sec with 
cpu to spare (roughly half on what each individual disk should be capable of). 
Here I'm getting 123/4= 30Mb/sec, or should that be 123/3= 41Mb/sec?
Using 30 as a basline, I'd be expecting to see twice that with 8 disks 
(240ish?). What I end up with is ~135, Clearly not good scaling at all.
The really interesting numbers happen at 7 disks - it's slower then with 4, in 
all tests.
I ran it 3x to be sure.
Note this was a native 7 disk raid-z, it wasn't 8 running in degraded mode with 
7.
Something is really wrong with my write performance here across the board.

Reads: 4 disks gives me 190MB/sec. WOAH! I'm very happy with that. 8 disks 
should scale to 380 then, Well 320 isn't all that far off - no biggie.
Looking at the 6 disk raidz is interesting though, 290MB/sec. The disks are 
good for 60+MB/sec individually. 290 is 48/disk - note also that this is better 
then my raid0 performance?!
Adding another 2 disks to my raidz gives me a mere 30Mb/sec extra performance? 
Something is going very wrong here too.

The 7 disk raidz read test is about what I'd expect (330/7= 47/disk), but it 
shows that the 8 disk is actually going backwards.

hmm...


I understand that goin

[zfs-discuss] Large device support

2006-07-17 Thread J.P. King


Possibly not the right list, but the only appropriate one I knew about.

I have a Solaris box (just reinstalled to Sol 10 606) with a 3.19TB device 
hanging off it, attatched by fibre.


Solaris refuses to see this device except as a 1.19 TB device.

Documentation that I have found 
(http://docs.sun.com/app/docs/doc/817-5093/6mkisoq1k?a=view#disksconcepts-17)

Suggests that this will not work with the ssd or sd drives.

Is this really the case?  If that isn't supported, what is?

Before you ask, yes, we can divide the device up into LUNs and glue it 
back together again in Solaris, but that is a stupid solution, and if I 
were to buy a top of the line Sun storage solution with 330TB I wouldn't 
be pleased if I had to glue 165 virtual disks back together because of 
this.



Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: howto reduce ?zfs introduced? noise

2006-07-17 Thread Thomas Maier-Komor
Thanks Robert,

that's exactly what I was looking for. I will try it when I come back home 
tomorrow. Is it possible to set this value in /etc/system, too?

Cheers,
Tom
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: zpool unavailable after reboot

2006-07-17 Thread Mikael Kjerrman
Jeff,

thanks for your answer, and I almost wish I did type it wrong (the easy 
explanation that I messed up :-) but from what I can tell I did get it right

--- zpool commands I ran ---
bash-3.00# grep zpool /.bash_history 
zpool
zpool create data raidz c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c2t0d0 c2t1d0 c2t2d0 
c2t3d0 c2t4d0
zpool list
zpool status
zpool iostat 3
zpool scrub data
zpool status
bash-3.00#



the other problem I have with this is that why did it kick the disk out? I can 
run all sorts of tests on the disk and it is perfectly fine... does it kick out 
random disk upon boot? ;-)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: zvol Performance

2006-07-17 Thread Jürgen Keil
> Further testing revealed
> that it wasn't an iSCSI performance issue but a zvol
> issue.  Testing on a SATA disk locally, I get these
>  numbers (sequentual write):
> 
> UFS: 38MB/s
> ZFS: 38MB/s
> Zvol UFS: 6MB/s
> Zvol Raw: ~6MB/s
> 
> ZFS is nice and fast but Zvol performance just drops
> off a cliff.  Suggestion or observations by others
> using zvol would be extremely helpful.   

# zfs create -V 1g data/zvol-test
# time dd if=/data/media/sol-10-u2-ga-x86-dvd.iso 
of=/dev/zvol/rdsk/data/zvol-test bs=32k count=1
1+0 records in
1+0 records out
0.08u 9.37s 2:21.56 6.6%

That's ~ 2.3 MB/s.

I do see *frequent* DKIOCFLUSHWRITECACHE ioctls
(one flush write cache ioctl after writing ~36KB of data, needs ~6-7 
milliseconds per flush):


  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02778, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 5736778 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e027c0, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6209599 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02808, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6572132 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02850, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6732316 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02898, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6175876 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e028e0, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6251611 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02928, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7756397 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02970, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6393356 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e029b8, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6147003 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a00, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6247036 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a48, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6061991 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02a90, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6284297 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02ad8, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6174818 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5e02b20, count 9000
  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6245923 
nsec, error 0



dtrace with stack backtraces:


  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5d1ec10, count 9000
  0  39404  zio_ioctl:entry
  zfs`zil_flush_vdevs+0x144
  zfs`zil_commit+0x311
  zfs`zvol_strategy+0x4bc
  genunix`default_physio+0x308
  genunix`physio+0x1d
  zfs`zvol_write+0x22
  genunix`cdev_write+0x25
  specfs`spec_write+0x4d6
  genunix`fop_write+0x2e
  genunix`write+0x2ae
  unix`sys_sysenter+0x104

  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6638189 
nsec, error 0
  0  12308  bdev_strategy:entry edev 1980047, flags 1080101, bn 
5d1ec58, count 9000
  0  39404  zio_ioctl:entry
  zfs`zil_flush_vdevs+0x144
  zfs`zil_commit+0x311
  zfs`zvol_strategy+0x4bc
  genunix`default_physio+0x308
  genunix`physio+0x1d
  zfs`zvol_write+0x22
  genunix`cdev_write+0x25
  specfs`spec_write+0x4d6
  genunix`fop_write+0x2e
  genunix`write+0x2ae
  unix`sys_sysenter+0x104

  0  38530   vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7881400 
nsec, error 0


=
#!/usr/sbin/dtrace -s


BEGIN 
{
DKIOC = 0x04 << 8;
DKIOCFLUSHWRITECACHE = DKIOC|34;
}


fbt::bdev_strategy:entry
{
bp = (struct buf *)arg0;
 

Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Michael Schuster - Sun Microsystems

Jeff Bonwick wrote:


One option -- I confess up front that I don't really like it -- would be
to make 'unreplicated' an explicit replication type (in addition to
mirror and raidz), so that you couldn't get it by accident:

zpool create data unreplicated A B C

>


The extra typing would be annoying, 


to address the "extra typing": would it be such a bad idea to offer the 
option of .. erm ... options, thus:


zpool create pool [-u|-z|-m|unreplicated|mirror|raidz|..] vdev ...

in addition to the "long" keywords?

Michael
--
Michael Schuster  (+49 89) 46008-2974 / x62974
visit the online support center:  http://www.sun.com/osc/

Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Jeff Bonwick
> I have a 10 disk raidz pool running Solaris 10 U2, and after a reboot
> the whole pool became unavailable after apparently loosing a diskdrive.
> [...]
> NAMESTATE READ WRITE CKSUM
> dataUNAVAIL  0 0 0  insufficient replicas
>   c1t0d0ONLINE   0 0 0
> [...]
>   c1t4d0UNAVAIL  0 0 0  cannot open
> --
> 
> The problem as I see it is that the pool should be able to handle
> 1 disk error, no?

If it were a raidz pool, that would be correct.  But according to
zpool status, it's just a collection of disks with no replication.
Specifically, compare these two commands:

(1) zpool create data A B C

(2) zpool create data raidz A B C

Assume each disk has 500G capacity.

The first command will create an unreplicated pool with 1.5T capacity.
The second will create a single-parity RAID-Z pool with 1.0T capacity.

My guess is that you intended the latter, but actually typed the former,
perhaps assuming that RAID-Z was always present.  If so, I apologize for
not making this clearer.  If you have any suggestions for how we could
improve the zpool(1M) command or documentation, please let me know.

One option -- I confess up front that I don't really like it -- would be
to make 'unreplicated' an explicit replication type (in addition to
mirror and raidz), so that you couldn't get it by accident:

zpool create data unreplicated A B C

The extra typing would be annoying, but would make it almost impossible
to get the wrong behavior by accident.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool unavailable after reboot

2006-07-17 Thread Mikael Kjerrman
Hi,

so it happened...

I have a 10 disk raidz pool running Solaris 10 U2, and after a reboot the whole 
pool became unavailable after apparently loosing a diskdrive. (The drive is 
seemingly ok as far as I can tell from other commands)

--- bootlog ---
Jul 17 09:57:38 expprd fmd: [ID 441519 daemon.error] SUNW-MSG-ID: ZFS-8000-CS, 
TYPE: Fault, VER: 1, SEVERITY: Major
Jul 17 09:57:38 expprd EVENT-TIME: Mon Jul 17 09:57:38 MEST 2006
Jul 17 09:57:38 expprd PLATFORM: SUNW,UltraAX-i2, CSN: -, HOSTNAME: expprd
Jul 17 09:57:38 expprd SOURCE: zfs-diagnosis, REV: 1.0
Jul 17 09:57:38 expprd EVENT-ID: e2fd61f7-a03d-6279-d5a5-9b8755fa1af9
Jul 17 09:57:38 expprd DESC: A ZFS pool failed to open.  Refer to 
http://sun.com/msg/ZFS-8000-CS for more information.
Jul 17 09:57:38 expprd AUTO-RESPONSE: No automated response will occur.
Jul 17 09:57:38 expprd IMPACT: The pool data is unavailable
Jul 17 09:57:38 expprd REC-ACTION: Run 'zpool status -x' and either attach the 
missing device or
Jul 17 09:57:38 expprd  restore from backup.
---

--- zpool status -x ---
bash-3.00# zpool status -x
  pool: data
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
dataUNAVAIL  0 0 0  insufficient replicas
  c1t0d0ONLINE   0 0 0
  c1t1d0ONLINE   0 0 0
  c1t2d0ONLINE   0 0 0
  c1t3d0ONLINE   0 0 0
  c2t0d0ONLINE   0 0 0
  c2t1d0ONLINE   0 0 0
  c2t2d0ONLINE   0 0 0
  c2t3d0ONLINE   0 0 0
  c2t4d0ONLINE   0 0 0
  c1t4d0UNAVAIL  0 0 0  cannot open
--

The problem as I see it is that the pool should be able to handle 1 disk error, 
no?
and the online, attach, replace commands doesn't work when the pool is 
unavailable. I've filed a case with Sun, but thought I'd ask around here to see 
if anyone has experienced this before.


cheers,

//Mikael
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zvol Performance

2006-07-17 Thread Ben Rockwood
Hello, 
  I'm curious if anyone would mind sharing their experiences with zvol's.  I 
recently started using zvol as an iSCSI backend and was supprised by the 
performance I was getting.  Further testing revealed that it wasn't an iSCSI 
performance issue but a zvol issue.  Testing on a SATA disk locally, I get 
these numbers (sequentual write):

UFS: 38MB/s
ZFS: 38MB/s
Zvol UFS: 6MB/s
Zvol Raw: ~6MB/s

ZFS is nice and fast but Zvol performance just drops off a cliff.  Suggestion 
or observations by others using zvol would be extremely helpful.   

My current testing is being done using a debug build of B44 (NV 6/10/06).

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss