Re: [zfs-discuss] Re: ZFS RAID10

2006-08-11 Thread Roch


RM:
   I do not understand - why in some cases with smaller block writing
   block twice could be actually faster than doing it once every time?
   I definitely am missing something here...

In addition to what Neil said, I want to add that

when an application O_DSYNC write cover only parts of a file
record you have the choice of issuing a log I/O that
contains only the newly written data or do a full record I/O 
(using the up-to-date cached record) along with a small log
I/O to match.

So if you do 8K writes to a file stored using 128K records,
you truly want each 8K writes to go to the log and then
every txg, take the state of a record and I/O that. You
certainly don't want to I/O 128K every 8K writes.

But then if you do a 100K write, it's not as clear a win.
Should I cough up the full 128K I/O now, hoping that the
record will not be modified further before the txg clock
hits ? That's part of what goes into zfs_immediate_write_sz.

And even for full record writes, there are some block
allocation issues that come into play and complicates things 
further.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] Re: ZFS RAID10

2006-08-11 Thread Roch

Robert Milkowski writes:
  Hello Neil,
  
  Thursday, August 10, 2006, 7:02:58 PM, you wrote:
  
  NP Robert Milkowski wrote:
   Hello Matthew,
   
   Thursday, August 10, 2006, 6:55:41 PM, you wrote:
   
   MA On Thu, Aug 10, 2006 at 06:50:45PM +0200, Robert Milkowski wrote:
   
  btw: wouldn't it be possible to write block only once (for synchronous
  IO) and than just point to that block instead of copying it again?
   
   
   MA We actually do exactly that for larger (32k) blocks.
   
   Why such limit (32k)?
  
  NP By experimentation that was the cutoff where it was found to be
  NP more efficient. It was recently reduced from 64K with a more
  NP efficient dmu-sync() implementaion.
  NP Feel free to experiment with the dynamically changable tunable:
  
  NP ssize_t zfs_immediate_write_sz = 32768;
  
  
  I've just checked using dtrace on one of production nfs servers that
  90% of the time arg5 in zfs_log_write() is exactly 32768 and the rest
  is always smaller.

Those should not be O_DSYNC though. Are they ?

The I/O should be deferred to a subsequent COMMIT but then
I'm not sure how it's handled then.


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Removing a device from a zfs pool

2006-08-11 Thread Louwtjie Burger
Hi there

Are there any consideration given to this feature...?

I would also agree that this will not only be a testing feature, but will 
find it's way into production.

It would probably work on the same princaple of swap -a and swap -d ;) Just a 
little bit more complex.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance using slices vs. entire disk?

2006-08-11 Thread Roch

Darren:

   With all of the talk about performance problems due to
   ZFS doing a sync to force the drives to commit to data
   being on disk, how much of a benefit is this - especially
   for NFS?

I would not call those things as problems, more like setting 
proper expectations.

My understanding is that enabling write cache helps by
providing I/O concurrency for drives that do not implement
other form of Command Queuing. In other cases, WCE should
not buy much if anything. I'd be interested in analysing any 
cases that shows otherwise...

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] user quotas vs filesystem quotas?

2006-08-11 Thread Jeff A. Earickson

Hi,

I'm looking at moving two UFS quota-ed filesystems to ZFS under
Solaris 10 release 6/06, and the quota issue is gnarly.

One filesystem is user home directories and I'm aiming towards the
one zfs filesystem per user model, attempting to use Casper
Dik's auto_home script for on-the-fly zfs filesystem creation.
I'm having problems there, but that is an automounter issue, not
ZFS.

The other filesystem is /var/mail on my mail server.  I've traditionally
run (big) user quotas in mailboxes just to keep some malicious
emailer from filling up /var/mail, maybe.   The notion of having
one zfs filesystem per mailbox seems unwieldy, just to run quotas
per user.

Are there any plans/schemes for per-user quotas within a ZFS filesystem,
akin to the UFS quotaon(1M) mechanism?  I take it that quotaon won't
work with a ZFS filesystem, right?  Suggestions please?  My notion 
right now is to drop quotas for /var/mail.


Jeff Earickson
Colby College
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-11 Thread Anton Rang


On Aug 9, 2006, at 8:18 AM, Roch wrote:




So while I'm feeling optimistic  :-) we really ought to be
  able to do this in two I/O operations. If we have, say, 500K
  of data to write (including all  of the metadata), we should
  be able  to allocate  a contiguous  500K  block on disk  and
  write  that with  a  single  operation.  Then we update  the
  Uberblock.

Hi Anton, Optimistic a little yes.

The data block should have aggregated quite well into near
recordsize I/Os, are you sure they did not ? No O_DSYNC in
here right ?


When I repeated this with just 512K written in 1K chunks via dd,
I saw six 16K writes.  Those were the largest.  The others were
around 1K-4K.  No O_DSYNC.

  dd if=/dev/zero of=xyz bs=1k count=512

So some writes are being aggregated, but we're missing a lot.


Once  the data  blocks are  on disk we  have the information
necessary to update the  indirect  blocks iteratively up  to
the  ueberblock. Those  are the  smaller I/Os;  I guess that
becauseof ditto blocks  they  go  to physically seperate
locations, by design.


We shouldn't have to wait for the data blocks to reach disk,
though.  We know where they're going in advance.  One of the
key advantages of the überblock scheme is that we can, in a
sense, speculatively write to disk.  We don't need the tight
ordering that UFS requires to avoid security exposures and
allow the file system to be repaired.  We can lay out all of
the data and metadata, write them all to disk, choose new
locations if the writes fail, etc. and not worry about any
ordering or state issues, because the on-disk image doesn't
change until we commit it.

You're right, the ditto block mechanism will mean that some
writes will be spread around (at least when using a
non-redundant pool like mine), but then we should have at
most three writes followed by the überblock update, assuming
three degrees of replication.


All of these though are normally done asynchronously to
applications, unless the disks are flooded.


Which is a good thing (I think they're asynchronous anyway,
unless the cache is full).


But  I follow  you in that,  It  may be remotely possible to
reduce the number of Iterations  in the process by  assuming
that the I/O will  all succeed, then  if some fails, fix  up
the consequence and when all  done, update the ueberblock. I
would not hold my breath quite yet for that.


Hmmm.  I guess my point is that we shouldn't need to iterate
at all.  There are no dependencies between these writes; only
between the complete set of writes and the überblock update.

-- Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS LVM and EVMS

2006-08-11 Thread Humberto Ramirez
Thanks for replying (I thought nobody would bother.) 

So, If understand correctly, I won't give up ANYTHING available in 
EVMS. LVM , Linux Raid -by going to ZFS and Raid -Z  Right ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Proposal: zfs create -o

2006-08-11 Thread Eric Schrock
Following up on earlier mail, here's a proposal for create-time
properties.  As usual, any feedback or suggestions is welcome.

For those curious about the implementation, this finds its way all the
way down to the create callback, so that we can pick out true
create-time properties (e.g. volblocksize, future crypto properties).
The remaining properties are handled by the generic creation code.

- Eric

A. INTRODUCTION

A complicated ZFS installation will typically create a number of
datasets, each with their own property settings.  Currently, this
requires several steps, one for creating the dataset, and one for each
property that must be configured:

# zfs create pool/fs
# zfs set compression=on pool/fs
# zfs set mountpoint=/export pool/fs
...

This has several drawbacks, the first of which is simply unnecessary
steps.  For these complicated setups, it would be simpler to create the
dataset and all its properties at the same time.  This has been
requested by the ZFS community, and resulted in the following RFE:

6367103 create-time properties

More importantly, it forces the user to instantiate (and often mount)
the dataset before assigning properties.  In the case of the
'mountpoint' property, it means that we create an inherited mountpoint,
only to be later changed when the property is modified.  This also makes
setting the 'canmount' property (PSARC 2006/XXX) more intuitive.

This RFE is also required for crypto support, as the encryption
algorithm must be known when the filesystem is created It also has the
benefit of cleaning up the implementation of other creation-time
properties (volsize and volblocksize) that were previously special
cases.

B. DESCRIPTION

This case adds a new option, 'zfs create -o', which allows for any ZFS
property to be set at creation time.  Multiple '-o' options can appear
in the same subcommand.  Specifying the same property multiple times in
the same command results in an error.  For example:

# zfs create -o compression=on -o mountpoint=/export pool/fs

The option '-o' was chosen over '-p' (for 'property') to reserve this
for a future RFE:

6290249 zfs {create,clone,rename} -p to create parents

The functionality of 'zfs create -b' has been superceded by this new
option, though it will be retained for backwards compatibility.  There
is no plan to formally obsolete or remove this options.  For example:

# zfs create -b 16k -V 10M pool/vol

is equivalent to

# zfs create -o volblocksize=16k -V 10M pool/vol

If '-o volblocksize' is specified in addition to '-b', the resulting
behavior is undefined.

C. MANPAGE CHANGES

TBD

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS LVM and EVMS

2006-08-11 Thread Eric Schrock
No, there are some features we haven't implemented, that may or may not
be available in other RAID solutions.  In particular:

- ZFS storage pool cannot be 'shrunk', i.e. removing an entire toplevel
  device (mirror, RAID group, etc).  Devices can be removed by attaching
  and detaching to existing mirrors, but you cannot shrink the overall
  size of the pool.

- ZFS RAID-Z stripes cannot be expanded.  ZFS storage pools are all
  dynamically striped across all device groups.  So you can add a new
  RAID-Z group ((5+1) - 2x(5+1) for example), but you cannot expand
  an existing stripe ((5+1) - (6+1)).

There are likely other features that are different and/or missing from
other solutions, so it's a little extreme to say you won't give up
ANYTHING.  But in terms of large scale features, there's not much
besides the two above, and remember that you have a lot to gain ;-)

- Eric

On Fri, Aug 11, 2006 at 09:28:58AM -0700, Humberto Ramirez wrote:
 Thanks for replying (I thought nobody would bother.) 
 
 So, If understand correctly, I won't give up ANYTHING available in 
 EVMS. LVM , Linux Raid -by going to ZFS and Raid -Z  Right ?
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Difficult to recursive-move ZFS filesystems to another server

2006-08-11 Thread Brad Plecs
Just wanted to point this out -- 

I have a large web tree that used to have UFS user quotas on it.  I converted 
to ZFS using 
the model that each user has their own ZFS filesystem quota instead.  I worked 
around some 
NFS/automounter issues, and it now seems to be working fine. 

Except now I have to move it to another server.   The problem is that there 
doesn't appear
to be any recursive dump/restore command that lets me do this easily.  'zfs 
send' and 'zfs receive' 
only appear to work within filesystem boundaries.  

What I want to do is move all of zfspool/www from server A to server B. 

Each user filesystem underneath zfspool/www: 

 zfspool/www/user-joe
 zfspool/www/user-john
 zfspool/www/user-mary 

...has a unique quota assigned to it. 

There doesn't appear to be a way to move zfspool/www and its decendants en 
masse to 
a new machine with those quotas intact.  I have to script the recreation of all 
of the 
descendant filesystems by hand. 

I can move the *data* with tar or rsync easily enough, but it seems silly that 
I have to recreate
all the descendant filesystems and their characteristics by hand. 

I know the comprehensive dump subject has been brought up before... I'd like 
to reiterate a suggestion that it'd be nice if the various commands (zfs 
send/receive, zfs snapshot) could optionally include a filesystem's 
descendants.  If zfs send could do this and included the filesystem quotas, it 
might solve this issue. 

Or maybe I'm missing something?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-11 Thread Jonathan Adams
On Fri, Aug 11, 2006 at 11:04:06AM -0500, Anton Rang wrote:
 Once  the data  blocks are  on disk we  have the information
 necessary to update the  indirect  blocks iteratively up  to
 the  ueberblock. Those  are the  smaller I/Os;  I guess that
 becauseof ditto blocks  they  go  to physically seperate
 locations, by design.
 
 We shouldn't have to wait for the data blocks to reach disk,
 though.  We know where they're going in advance.  One of the
 key advantages of the ?berblock scheme is that we can, in a
 sense, speculatively write to disk.  We don't need the tight
 ordering that UFS requires to avoid security exposures and
 allow the file system to be repaired.  We can lay out all of
 the data and metadata, write them all to disk, choose new
 locations if the writes fail, etc. and not worry about any
 ordering or state issues, because the on-disk image doesn't
 change until we commit it.

 You're right, the ditto block mechanism will mean that some
 writes will be spread around (at least when using a
 non-redundant pool like mine), but then we should have at
 most three writes followed by the ?berblock update, assuming
 three degrees of replication.

The problem is that you don't know the actual *contents* of the parent block
until *all* of its children have been written to their final locations.
(This is because the block pointer's value depends on the final location)
The ditto blocks don't really effect this, since they can all be written
out in parallel.

So you end up with the current N phases; data, it's parents,
it's parents, ..., uberblock.

 But  I follow  you in that,  It  may be remotely possible to
 reduce the number of Iterations  in the process by  assuming
 that the I/O will  all succeed, then  if some fails, fix  up
 the consequence and when all  done, update the ueberblock. I
 would not hold my breath quite yet for that.
 
 Hmmm.  I guess my point is that we shouldn't need to iterate
 at all.  There are no dependencies between these writes; only
 between the complete set of writes and the ?berblock update.

Again, there is;  if a block write fails, you have to re-write it and
all of it's parents.  So the best you could do would be:

1. assign locations for all blocks, and update the space bitmaps
   as necessary.
2. update all of the non-Uberdata blocks with their actual
   contents (which requires calculating checksums on all of the
   child blocks)
3. write everything out in parallel.
3a. if any write fails, re-do 1+2 for that block, and 2 for all of its
parents, then start over at 3 with all of the changed blocks.

4. once everything is on stable storage, update the uberblock.

That's a lot more complicated than the current model, but certainly seems
possible.

Cheers,
- jonathan

(this is only my understanding of how ZFS works;  I could be mistaken)


-- 
Jonathan Adams, Solaris Kernel Development
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-11 Thread Anton Rang

On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote:

The problem is that you don't know the actual *contents* of the  
parent block
until *all* of its children have been written to their final  
locations.
(This is because the block pointer's value depends on the final  
location)


But I know where the children are going before I actually write  
them.  There
is a dependency of the parent's contents on the *address* of its  
children, but
not on the actual write.  We can compute everything that we are going  
to write

before we start to write.

(Yes, in the event of a write failure we have to recover; but that's
very rare, and can easily be handled -- we just start over, since no
visible state has been changed.)

The ditto blocks don't really effect this, since they can all be  
written

out in parallel.


The reason they affect my desire of turning the update into a two-phase
commit (make all the changes, then update the überblock) is because the
ditto blocks are deliberately spread across the disk, so we can't  
collect
them into a single write (for a non-redundant pool, or at least a one- 
disk

pool -- presumably they wind up on different disks for a two-disk pool,
in which case we can still do a single write per disk).


Again, there is;  if a block write fails, you have to re-write it and
all of it's parents.  So the best you could do would be:

1. assign locations for all blocks, and update the space bitmaps
   as necessary.
2. update all of the non-Uberdata blocks with their actual
   contents (which requires calculating checksums on all of the
   child blocks)
3. write everything out in parallel.
	3a. if any write fails, re-do 1+2 for that block, and 2 for all of  
its

parents, then start over at 3 with all of the changed blocks.

4. once everything is on stable storage, update the uberblock.

That's a lot more complicated than the current model, but certainly  
seems

possible.


(3a could actually be simplified to simply mark the bad blocks as
unallocatable, and go to 1, but it's more efficient as you describe.)

The eventual advantage, though, is that we get the performance of a  
single

write (plus, always, the überblock update).  In a heavily loaded system,
the current approach (lots of small writes) won't scale so well.   
(Actually

we'd probably want to limit the size of each write to some small value,
like 16 MB, simply to allow the first write to start earlier under  
fairly

heavy loads.)

As I pointed out earlier, this would require getting scatter/gather  
support
through the storage subsystem, but the potential win should be quite  
large.

Something to think about for the future.  :-)

Incidentally, this is part of how QFS gets its performance for  
streaming I/O.
We use an allocate forward policy, allow very large allocation  
blocks, and
separate the metadata from data.  This allows us to write (or read)  
data in

fairly large I/O requests, without unnecessary disk head motion.

Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SPEC SFS97 benchmark of ZFS,UFS,VxFS

2006-08-11 Thread eric kustarz

Leon Koll wrote:


On 8/11/06, eric kustarz [EMAIL PROTECTED] wrote:


Leon Koll wrote:

 ...

 So having 4 pools isn't a recommended config - i would destroy 
those 4

 pools and just create 1 RAID-0 pool:
 #zpool create sfsrocks c4t00173801014Bd0 c4t00173801014Cd0
 c4t001738010140001Cd0 c4t0017380101400012d0

 each of those devices is a 64GB lun, right?


 I did it - created one pool, 4*64GB size, and running the benchmark 
now.
 I'll update you on results, but one pool is definitely not what I 
need.
 My target is - SunCluster with HA ZFS where I need 2 or 4 pools per 
node.


Why do you need 2 or 4 pools per node?

If you're doing HA-ZFS (which is SunCluster 3.2 - only available in beta
right now), then you should divide your storage up to the number of



I know, I run the 3.2  now.


*active* pools.  So say you have 2 nodes and 4 luns (each lun being
64GB), and only need one active node - then you can create one pool of



To have one active node doesn't look smart to me. I want to distribute
load between 2 nodes, not to have 1 active and 1 standby.
The LUN size in this test is 64GB but in real configuration it will be 
6TB



all 4 luns, and attach the 4 luns to both nodes.

The way HA-ZFS basically works is that when the active node fails, it
does a 'zpool export', and the takeover node does a 'zpool import'.  So
both nodes are using the same storage, but they cannot use the same
storage at the same time, see:
http://www.opensolaris.org/jive/thread.jspa?messageID=49617



Yes, it works this way.



If however, you have 2 nodes, 4 luns, and wish both nodes to be active,
then you can divy up the storage into two pools.  So each node has one
active pool of 2 luns.  All 4 luns are doubly attached to both nodes,
and when one node fails, the takeover node then has 2 active pools.



I agree with you - I can have 2 active pools, not 4 in case of
dual-node cluster.



So how many nodes do you have? and how many do you wish to be active
at a time?



Currently - 2 nodes, both active. If I define 4 pools, I can easily
expand the cluster to the 4-nodes configuration, that may be the good
reason to have 4 pools.



Ok, that makes sense.



And what was your configuration for VxFS and SVM/UFS?



4 SVM concat volumes (I need a concatenation of 1TB LUNs if I am in
SC3.1 that doesn't support EFI label) with UFS or VxFS on top.



So you have 2 nodes, 2 file systems (of either UFS or VxFS) on each node?

I'm just trying to make sure its a fair comparison bewteen ZFS, UFS, and 
VxFS.




And now comes the questions - my short test showed that 1-pool config
doesn't behave better than 4-pools one - with the first the box was
hung, with the second - didn't.
Why do you think the 1-pool config is better?



I suggested the 1 pool config before i knew you were doing HA-ZFS :)  
Purposely dividing up your storage (by creating separate pools) in a 
non-clustered environment usually doesn't make sense (root being one 
notable exception).


eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Proposal expand raidz

2006-08-11 Thread Brad Plecs
Just a data point -- our netapp filer actually creates additional raid groups 
that are added to the greater pool when you add disks, much as zfs does now.  
 They aren't simply used to expand
the one large raid group of the volume.I've been meaning to rebuild the 
whole thing to 
get use of the multiple parity disks back.  

Ours is a few years old and isn't running the latest software rev, so maybe 
they've overcome
this now, but thought I'd mention it.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Difficult to recursive-move ZFS filesystems to another server

2006-08-11 Thread Matthew Ahrens
On Fri, Aug 11, 2006 at 10:02:41AM -0700, Brad Plecs wrote:
 There doesn't appear to be a way to move zfspool/www and its
 decendants en masse to a new machine with those quotas intact.  I have
 to script the recreation of all of the descendant filesystems by hand. 

Yep, you need

6421959 want zfs send to preserve properties ('zfs send -p')
6421958 want recursive zfs send ('zfs send -r')

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Looking for motherboard/chipset experience, again

2006-08-11 Thread David Dyer-Bennet

What about the Asus M2N-SLI Deluxe motherboard?  It has 7 SATA ports,
supports ECC memory, socket AM2, generally looks very attractive for
my home storage server.  Except that it, and the nvidia nForce 570-SLI
it's built on, don't seem to be on the HCL.  I'm hoping that's just
yet, not reported yet.  Anybody run Solaris on it?  Or at least on
any nForce 570-SLI board?  Would you risk buying it to find out
yourself?

I've heard rumors of ZFS in one of the more obscure Linuxes, perhaps
Ubuntu; I suppose that could be a backup plan if I try and Solaris
doesn't work.

I have the general feeling that Linux runs on anything I can buy
today, pretty much, since I've been using it for over a decade and am
somewhat plugged into the community.  I don't yet have the impression
that Solaris runs on most anything, possibly after tracking down a few
drivers.  Does it, really?  Should I be not worrying about this so
much?
--
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] user quotas vs filesystem quotas?

2006-08-11 Thread Frank Cusack

On August 11, 2006 10:31:50 AM -0400 Jeff A. Earickson [EMAIL PROTECTED] 
wrote:

Suggestions please?


Ideally you'd be able to move to mailboxes in $HOME instead of /var/mail.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question on Zones and memory usage (65120349)

2006-08-11 Thread Jeff Victor
Follow-up: it looks to me like prstat displays the portion of the system's 
physical memory in use by the processes in that zone.


How much memory does that system have?  Something seems amiss, as a V490 can hold 
up to 32GB, and prstat is showing 163GB of physical memory just for fmtest.



Irma Garcia wrote:

Hi All,

Sun Fire V440
Solaris 10
Solaris Resource Manager

Customer wrote the following:

I have a v490 with 4 zones:

tsunami:/#-zoneadm list -iv
ID NAME STATUS PATH
0 global running /
4 fmstage running /fmstage
12 fmprod running /fmprod
15 fmtest running /fmtest

fmtest has a pool assigned to it with acess
to 2 cpus. When I run the psstat -Z in the
fmtest zone I see;

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 192 169G 163G 100% 0:29:55 96% fmtest

on the global zone (tsunami) I see with the
psstat -Z ;

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 188 169G 163G 100% 0:46:00 48% fmtest
0 54 708M 175M 0.1% 2:23:40 0.1% global
12 27 112M 51M 0.0% 0:02:48 0.0% fmprod
4 27 281M 66M 0.0% 0:14:13 0.0% fmstage

Questions?
Does the 100% memory usage on each mean that
the fmtest zone is using all the memory. How
come when I run the top command I see
different result for memory usage.
What is the best method to tie a certian
percentage of memory to certain zones — rcapd ??




Thanks in Advance
Irma


-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
--
Jeff VICTOR  Sun Microsystemsjeff.victor @ sun.com
OS AmbassadorSr. Technical Specialist
Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unreliable ZFS backups or....

2006-08-11 Thread Frank Cusack

On August 11, 2006 5:25:11 PM -0700 Peter Looyenga [EMAIL PROTECTED] wrote:

I looked into backing up ZFS and quite honostly I can't say I am convinced 
about its usefullness
here when compared to the traditional ufsdump/restore. While snapshots are nice 
they can never
substitute offline backups.


It doesn't seem to me that they are meant to.


However, while you can make one using 'zfs send' it somewhat worries me that 
the only way to
perform a restore is by restoring the entire filesystem (/snapshot). I somewhat 
shudder at the
thought of having to restore /export/home this way to retrieve but a single 
file/directory.


You can mitigate this by creating more granular filesystems, e.g. a
filesystem per user homedir.  This has other advantages like per-user
quotas.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question on Zones and memory usage (65120349)

2006-08-11 Thread Mike Gerdts

On 8/11/06, Irma Garcia [EMAIL PROTECTED] wrote:

ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
15 188 169G 163G 100% 0:46:00 48% fmtest
0 54 708M 175M 0.1% 2:23:40 0.1% global
12 27 112M 51M 0.0% 0:02:48 0.0% fmprod
4 27 281M 66M 0.0% 0:14:13 0.0% fmstage

Questions?
Does the 100% memory usage on each mean that
the fmtest zone is using all the memory. How
come when I run the top command I see
different result for memory usage.


The %mem column is the sum of the %mem that each process uses.
Unfortuantely, that value seems to include the pages that are shared
between many processes (e.g. database files, libc, etc.) without
dividing by the number of processes that have that memory mapped.  In
other words, if you have 50 database processes that have used mmap()
on the same 1 GB database, prstat will think that 50 GB of RAM is used
when only 1 GB is really used.

I have seen prstat report that oracle workloads on a 15k domain are
using well over a terabyte of memory.  This is kinda hard to do on a
domain with ~300 GB of RAM  50 GB swap.


What is the best method to tie a certian
percentage of memory to certain zones — rcapd ??


I *think* that rcapd suffers from the same problem that prstat does
and may cause undesirable behavior.  Because of the way that it works,
I fully expect that if rcapd begins to force pages out, the paging
activity for the piggy workload will cause severe performance
degredation for everything on the machine.  My personal opinion (not
backed by extensive testing) is that rcapd is more likely to do more
harm than good.

If the workload that you are trying to control is java-based, consider
using the various java flags to limit heap size.  This will not
protect you against memory leaks in the JVM, but it will protect
against a misbehaving app.  The same is likely true for the stack
size.

If the workload you are trying to control is some other single
process, consider using ulimit to limit the stack and heap size.

Set the size= option for all tmpfs file systems.

Bug the folks that are working on memory sets and swap sets to get
this code out sooner than later.

If running on sun4v, consider LDOM's when they are available (November?).

Mike

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss