Re: [zfs-discuss] MySQL, Lustre and ZFS

2008-02-07 Thread Atul Vidwansa
Not sure why would you want these 3 together, but lustre and zfs will
work together in Lustre 1.8 version. ZFS will be backend filesystem
for Lustre servers. See this
http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU

Cheers,
-Atul

On Feb 7, 2008 8:39 AM, kilamanjaro [EMAIL PROTECTED] wrote:
 Hi all, Any thoughts on if and when ZFS, MySQL, and Lustre 1.8 (and
 beyond) will work together and be supported so by Sun?

 - Network Systems Architect
Advanced Digital Systems Internet
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Atul Vidwansa
Cluster File Systems Inc.
http://www.clusterfs.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS on Solaris and Mac Leopard

2008-02-07 Thread Klas Heggemann
For some time now, I have had zfs pool, created (if I
remeber this correctly) on  my x86 opensolaris, 
with zfs version 6, and have it accessable on 
my Leopard Mac. I ran the ZFS beta on the Leopard beta
with no problems at all. I've now installed the latest zfs RW build
on my Leopard and it work nicely readwrite on my macbook.

It is a pool consisting of one whoe disk. The Leopard says:

# zpool import
  pool: space
id: 123931456072276617
 state: ONLINE
status: The pool is formatted using an older on-disk version.
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit 'zpool 
upgrade'.config:

space   ONLINE
  disk3 ONLINE

However, on Solaris the pool is found, but considered damaged on a
5.11 snv_64a sun4u sparc:
# zpool import
  pool: space
id: 123931456072276617
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

space   UNAVAIL   insufficient replicas
  c4t0d0s0  UNAVAIL   corrupted data

and the disk is reported 'unknown' by format.


Anyone seen something simular? The pool is still version 6, since
I'd like to use it on standard Leopard which has the read-only zfs
limited to version 6. Could a cure be to upgrade (OS and/or  zfs)?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Is swap still needed on c0d0s1 to get crash dumps?

2008-02-07 Thread Roman Morokutti
Lori Alt writes in the netinstall README that a slice 
should be available for crash dumps. In order to get
this done the following line should be defined within
the profile:

filesys c0[t0]d0s1 auto swap

So my question is, is this still needed and how to
access a crash dump if it happened?

Roman
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread William Fretts-Saxton
I just installed nv82 so we'll see how that goes.  I'm going to try the 
recordsize idea above as well.

A note about UFS:  I was told by our local Admin guru that ZFS turns on 
write-caching for disks, which is something that a UFS file system should not 
have turned on, so that if I convert the ZFS f/s to a UFS one, I could be 
giving the UFS performance an unrealistic boost to performance because it 
would still have the caching on.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread William Fretts-Saxton
Unfortunately, I don't know the record size of the writes.  Is it as simple as 
looking @ the size of a file, before and after a client request, and noting the 
difference in size?  This is binary data, so I don't know if that makes a 
difference, but the average write size is a lot smaller than the file size.  

Should the recordsize be in place BEFORE data is written to the file system, or 
can it be changed after the fact?  I might try a bunch of different settings 
for trial and error.

The I/O is actually done by RRD4J, which is a round-robin database library.  It 
is a Java version of 'rrdtool' which saves data into a binary format, but also 
cleans up the data according to its age, saving less of the older data as 
time goes on.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool destroy core dumps with unavailable iscsi device

2008-02-07 Thread Tim Foster
Hi Ross,

On Thu, 2008-02-07 at 08:30 -0800, Ross wrote:
 While playing around with ZFS and iSCSI devices I've managed to remove
 an iscsi target before removing the zpool.  Now any attempt to delete
 the pool (with or without -f) core dumps zpool.
 
 Any ideas how I get rid of this pool?

Yep, here's one way:  zpool export other pools on the system, then
delete /etc/zfs/zpool.cache, reboot the machine then do a zpool import
for each of the other pools you want to keep.

cheers,
tim
-- 
Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops
http://blogs.sun.com/timf

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool destroy core dumps with unavailable iscsi device

2008-02-07 Thread Ross
While playing around with ZFS and iSCSI devices I've managed to remove an iscsi 
target before removing the zpool.  Now any attempt to delete the pool (with or 
without -f) core dumps zpool.

Any ideas how I get rid of this pool?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NFS device IDs for snapshot filesystems

2008-02-07 Thread A Darren Dunham
I notice that files within a snapshot show a different deviceID to stat
than the parent file does.  But this is not true when mounted via NFS.

Is this a limitation of the NFS client, or just what the ZFS fileserver
is doing?

Will this change in the future?  With NFS4 mirror mounts?
-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
  This line left intentionally blank to confuse you. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread William Fretts-Saxton
To avoid making multiple posts, I'll just write everything here:

-Moving to nv_82 did not seem to do anything, so I doesn't look like fsync was 
the issue.
-Disabling ZIL didn't do anything either
-Still playing with 'recsize' values but it doesn't seem to be doing much...I 
don't think I have a good understand of what exactly is being written...I think 
the whole file might be overwritten each time because it's in binary format.
-Setting zfs_nocacheflush, though got me drastically increased 
throughput--client requests took, on average, less than 2 seconds each!

So, in order to use this, I should have a storage array, w/battery backup, 
instead of using the internal drives, correct?  I have the option of using a 
6120 or 6140 array on this system so I might just try that out.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Lost intermediate snapshot; incremental backup still possible?

2008-02-07 Thread Ian
I keep my system synchronized to a USB disk from time to time.  The script 
works by sending incremental snapshots to a pool on the USB disk, then deleting 
those snapshots from the source machine.

A botched script ended up deleting a snapshot that was not successfully 
received on the USB disk.  Now, I've lost the ability to send incrementally 
since the intermediate snapshot is lost.  From what I gather, if I try to send 
a full snapshot, it will require deleting and replacing the dataset on the USB 
disk.  Is there any way around this?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread William Fretts-Saxton
Slight correction.  'recsize' must be a power of 2 so it would be 8192.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread William Fretts-Saxton
One thing I just observed is that the initial file size is 65796 bytes.  When 
it gets an update, the file size remains @ 65796.

Is there a minimum file size?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hardware RAID vs. ZFS RAID

2008-02-07 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

John-Paul Drawneek wrote:
| I guess a USB pendrive would be slower than a
| harddisk. Bad performance
| for the ZIL.

A decent pendrive of mine writes at 3-5MB/s. Sure there are faster
ones, but any desktop harddisk can write at 50MB/s.

If you are *not* talking about consumer grade pendrives, I can't comment.

- --
Jesus Cea Avion _/_/  _/_/_/_/_/_/
[EMAIL PROTECTED] http://www.argo.es/~jcea/ _/_/_/_/  _/_/_/_/  _/_/
jabber / xmpp:[EMAIL PROTECTED] _/_/_/_/  _/_/_/_/_/
~   _/_/  _/_/_/_/  _/_/  _/_/
Things are not so easy  _/_/  _/_/_/_/  _/_/_/_/  _/_/
My name is Dump, Core Dump   _/_/_/_/_/_/  _/_/  _/_/
El amor es poner tu felicidad en la felicidad de otro - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBR6sQSplgi5GaxT1NAQKD+AP/XdzxquaUk559ldZr2Wwcq0mIGnAXXDsf
uCz+HBiYVLpgqqyv6I5gGgoeF417YopPvsiL0fpAEWIMeB/BgeTvU/xarq2sFeD6
NOt9S31C2pOaRCfDkPerBwof5ScKvqL4LICPUhWfYbrx45V6A6dV6IVYYzx1Pj6r
ePKcyjPfDhQ=
=n2Ut
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hardware RAID vs. ZFS RAID

2008-02-07 Thread Andy Lubel
With my (COTS) LSI 1068 and 1078 based controllers I get consistently  
better performance when I export all disks as jbod (MegaCli - 
CfgEachDskRaid0).

I even went through all the loops and hoops with 6120's, 6130's and  
even some SGI storage and the result was always the same; better  
performance exporting single disk than even the ZFS profiles within  
CAM.

---
'pool0':
#zpool create pool0 mirror c2t0d0 c2t1d0
#zpool add pool0 mirror c2t2d0 c2t3d0
#zpool add pool0 mirror c2t4d0 c2t5d0
#zpool add pool0 mirror c2t6d0 c2t7d0

'pool2':
#zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0
#zpool add pool2 raidz c3t12d0 c3t13d0 c3t14d0 c3t15d0


I have really learned not to do it this way with raidz and raidz2:

#zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0 c3t12d0  
c3t13d0 c3t14d0 c3t15d0


So when is thumper going to have an all SAS option? :)


-Andy


On Feb 7, 2008, at 2:28 PM, Joel Miller wrote:

 Much of the complexity in hardware RAID is in the fault detection,  
 isolation, and management.  The fun part is trying to architect a  
 fault-tolerant system when the suppliers of the components can not  
 come close to enumerating most of the possible failure modes.

 What happens when a drive's performance slows down because it is  
 having to go through internal retries more than others?

 What layer gets to declare a drive dead? What happens when you start  
 declaring the drives dead one by one because of they all seemed to  
 stop responding but the problem is not really the drives?

 Hardware RAID systems attempt to deal with problems that are not  
 always straight forward...Hopefully we will eventually get similar  
 functionality in Solaris...

 Understand that I am a proponent of ZFS, but everything has it's use.

 -Joel


 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send / receive between different opensolaris versions?

2008-02-07 Thread Albert Lee

On Wed, 2008-02-06 at 13:42 -0600, Michael Hale wrote:
 Hello everybody,
 
 I'm thinking of building out a second machine as a backup for our mail  
 spool where I push out regular filesystem snapshots, something like a  
 warm/hot spare situation.
 
 Our mail spool is currently running snv_67 and the new machine would  
 probably be running whatever the latest opensolaris version is (snv_77  
 or later).
 
 My first question is whether or not zfs send / receive is portable  
 between differing releases of opensolaris.  My second question (kind  
 of off topic for this list) is that I was wondering the difficulty  
 involved in upgrading snv_67 to a later version of opensolaris given  
 that we're running a zfs root boot configuration


For your first question, zfs(1) says:

 zfs upgrade [-r] [-V version] [-a | filesystem]

 Upgrades file systems to a  new  on-disk  version.  Once
 this  is done, the file systems will no longer be acces-
 sible on systems running older versions of the software.
 zfs send streams generated from new snapshots of these
 file systems can not  be  accessed  on  systems  running
 older versions of the software.

The format of the stream is dependent on just the zfs filesystem version
at the time of the snapshot, so as they are backwards compatible, a
system with newer zfs bits can always receive an older snapshot. The
current filesystem version is 3 (not to be confused with zpool which is
at 10), so it's unlikely to have changed recently.


The officially supported method for upgrading a zfs boot system is to
BFU (which upgrades ON but breaks package support). However, you should
be able to do an in-place upgrade with the zfs_ttinstall wrapper for
ttinstall (the Solaris text installer). This means booting from CD/DVD
(or netbooting) and then running the script:
http://opensolaris.org/jive/thread.jspa?threadID=46588tstart=255
You will have to edit it to fit your zfs layout.

 --
 Michael Hale  
 [EMAIL PROTECTED] 
  
 Manager of Engineering SupportEnterprise 
 Engineering Group
 Transcom Enhanced Services
 http://www.transcomus.com
 
 
 

-Albert



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nfs exporting nested zfs

2008-02-07 Thread Nicolas Williams
On Thu, Feb 07, 2008 at 01:54:58PM -0800, Andrew Tefft wrote:
 Let's say I have a zfs called pool/backups and it contains two
 zfs'es, pool/backups/server1 and pool/backups/server2
 
 I have sharenfs=on for pool/backups and it's inherited by the
 sub-zfs'es. I can then nfs mount pool/backups/server1 or
 pool/backups/server2, no problem.
 
 If I mount pool/backups on a system running Solaris Express build 81,

The NFSv3 client, and the NFSv4 client up to some older snv build (I
forget which) will *not* follow the sub-mounts that exist on the server
side.

In recent snv builds the NFSv4 client will follow the sub-mounts that
exist on the server side.

If you use the -hosts automount map (/net) then the NFSv3 client and
older NFSv4 clients will mount the server-side sub-mounts, but only as
they existed when the automount was made.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread William Fretts-Saxton
RRD4J isn't a DB, per se, so it doesn't really have a record size.  In fact, 
I don't even know if, when data is written to the binary, whether it is 
contiguous or not so the amount written may not directly correlate to a proper 
record-size.

I did run your command and found the size patterns you were talking about:

  462  java409
 3320  java409
 6819  java409
5  java   1227
1  java   1692
   16  java   3243

409 is the number of clients I tested, so I assume it means the largest write 
it makes is 6819.  Is that bits or bytes?

Does that mean I should try setting my recordsize equal to the lowest multiple 
of 512 GREATER than 6819? (14 x 512 = 7168)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] nfs exporting nested zfs

2008-02-07 Thread Andrew Tefft
Let's say I have a zfs called pool/backups and it contains two zfs'es, 
pool/backups/server1 and pool/backups/server2

I have sharenfs=on for pool/backups and it's inherited by the sub-zfs'es. I can 
then nfs mount pool/backups/server1 or pool/backups/server2, no problem.

If I mount pool/backups on a system running Solaris Express build 81, I can see 
the contents of pool/backups/server1 and pool/backups/server2 as I'd expect. 
But when I mount pool/backups on Solaris 10 or Solaris 8, I just see empty 
directories for server1 and server2. And if I actually write there, the files 
go in /pool/backups (and they can be seen on the nfs server if I unmount the 
sub-zfs'es). And that's extra bad because if I reboot the nfs server, the 
sub-zfs'es fail to mount because their mountpoints are not empty, and so it 
won't come up in multi-user).

(the whole idea here is that I really want just the one nfs mount, but I want 
to be able to separate the data into separate zfs'es).

So why does this work with the build 81 nfs client, and not others, and is it 
possible to make it work? Right now the number of sub-zfs'es is only a handful 
so I can mount them individually, but it's not the way I want it to work.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread Sanjeev Bagewadi
William,

It should be fairly easy to find the record size using DTrace. Take an 
aggregation of the
the writes happening (aggregate on size for all the write(2) system calls).

This would give fair idea of the IO size pattern.

Does RRD4J have a record size mentioned ? Usually if it is a 
database-application they have a record-size
option when the DB is created (based on my limited knowledge about DBs).

Thanks and regards,
Sanjeev.

PS : Here is a simple script which just aggregates on the write size and 
executable name :
-- snip --
#!/usr/sbin/dtrace -s


syscall::write:entry
{
wsize = (size_t) arg2;
@write[wsize, execname] = count();
}
-- snip --

William Fretts-Saxton wrote:
 Unfortunately, I don't know the record size of the writes.  Is it as simple 
 as looking @ the size of a file, before and after a client request, and 
 noting the difference in size?  This is binary data, so I don't know if that 
 makes a difference, but the average write size is a lot smaller than the file 
 size.  

 Should the recordsize be in place BEFORE data is written to the file system, 
 or can it be changed after the fact?  I might try a bunch of different 
 settings for trial and error.

 The I/O is actually done by RRD4J, which is a round-robin database library.  
 It is a Java version of 'rrdtool' which saves data into a binary format, but 
 also cleans up the data according to its age, saving less of the older data 
 as time goes on.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


-- 
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] UFS on zvol Cache Questions...

2008-02-07 Thread Brad Diggs
Hello,

I have a unique deployment scenario where the marriage
of ZFS zvol and UFS seem like a perfect match.  Here are
the list of feature requirements for my use case:

* snapshots
* rollback
* copy-on-write
* ZFS level redundancy (mirroring, raidz, ...)
* compression
* filesystem cache control (control what's in and out)
* priming the filesystem cache (dd if=file of=/dev/null)
* control the upper boundary of RAM consumed by the
  filesystem.  This helps me to avoid contention between
  the filesystem cache and my application.

Before zfs came along, I could achieve all but rollback,
copy-on-write and compression through UFS+some volume manager.

I would like to use ZFS but with ZFS I cannot prime the cache
and I don't have the ability to control what is in the cache 
(e.g. like with the directio UFS option).

If I create a ZFS zvol and format it as a UFS filesystem, it
seems like I get the best of both worlds.  Can anyone poke 
holes in this strategy?

I think the biggest possible risk factor is if the ZFS zvol
still uses the arc cache.  If this is the case, I may be 
double-dipping on the filesystem cache.  e.g. The UFS filesystem
uses some RAM and ZFS zvol uses some RAM for filesystem cache.
Is this a true statement or does the zvol use a minimal amount
of system RAM?

Lastly, if I were to try this scenario, does anyone know how to
monitor the RAM consumed by the zvol and UFS?  e.g. Is there a 
dtrace script for monitoring ZFS or UFS memory consuption?

Thanks in advance,
Brad

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hardware RAID vs. ZFS RAID

2008-02-07 Thread Joel Miller
Much of the complexity in hardware RAID is in the fault detection, isolation, 
and management.  The fun part is trying to architect a fault-tolerant system 
when the suppliers of the components can not come close to enumerating most of 
the possible failure modes.

What happens when a drive's performance slows down because it is having to go 
through internal retries more than others?

What layer gets to declare a drive dead? What happens when you start declaring 
the drives dead one by one because of they all seemed to stop responding but 
the problem is not really the drives?

Hardware RAID systems attempt to deal with problems that are not always 
straight forward...Hopefully we will eventually get similar functionality in 
Solaris...

Understand that I am a proponent of ZFS, but everything has it's use.

-Joel
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nfs exporting nested zfs

2008-02-07 Thread Cindy . Swearingen
Because of the mirror mount feature that integrated into that Solaris 
Express, build 77.

You can read about here on page 20 of the ZFS Admin Guide:

http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf

Cindy

Andrew Tefft wrote:
 Let's say I have a zfs called pool/backups and it contains two zfs'es, 
 pool/backups/server1 and pool/backups/server2
 
 I have sharenfs=on for pool/backups and it's inherited by the sub-zfs'es. I 
 can then nfs mount pool/backups/server1 or pool/backups/server2, no problem.
 
 If I mount pool/backups on a system running Solaris Express build 81, I can 
 see the contents of pool/backups/server1 and pool/backups/server2 as I'd 
 expect. But when I mount pool/backups on Solaris 10 or Solaris 8, I just see 
 empty directories for server1 and server2. And if I actually write there, the 
 files go in /pool/backups (and they can be seen on the nfs server if I 
 unmount the sub-zfs'es). And that's extra bad because if I reboot the nfs 
 server, the sub-zfs'es fail to mount because their mountpoints are not empty, 
 and so it won't come up in multi-user).
 
 (the whole idea here is that I really want just the one nfs mount, but I want 
 to be able to separate the data into separate zfs'es).
 
 So why does this work with the build 81 nfs client, and not others, and is it 
 possible to make it work? Right now the number of sub-zfs'es is only a 
 handful so I can mount them individually, but it's not the way I want it to 
 work.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is swap still needed on c0d0s1 to get crash dumps?

2008-02-07 Thread Richard Elling
Roman Morokutti wrote:
 Lori Alt writes in the netinstall README that a slice 
 should be available for crash dumps. In order to get
 this done the following line should be defined within
 the profile:

 filesys c0[t0]d0s1 auto swap

 So my question is, is this still needed and how to
 access a crash dump if it happened?
   

The dumpadm command can be used to manage dump devices.
The key is that today, you can't use ZFS for a dump device.  So if you
want to collect dumps, you'll need to use a non-ZFS device to do so.
For many people, the historic use is a swap device on slice 1 which
is also the default dump device.  So yes, this will work, but no it is
not a requirement.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread johansen
 -Still playing with 'recsize' values but it doesn't seem to be doing
 much...I don't think I have a good understand of what exactly is being
 written...I think the whole file might be overwritten each time
 because it's in binary format.

The other thing to keep in mind is that the tunables like compression
and recsize only affect newly written blocks.  If you have a bunch of
data that was already laid down on disk and then you change the tunable,
this will only cause new blocks to have the new size.  If you experiment
with this, make sure all of your data has the same blocksize by copying
it over to the new pool once you've changed the properties.

 -Setting zfs_nocacheflush, though got me drastically increased
 throughput--client requests took, on average, less than 2 seconds
 each!
 
 So, in order to use this, I should have a storage array, w/battery
 backup, instead of using the internal drives, correct?

zfs_nocacheflush should only be used on arrays with a battery backed
cache.  If you use this option on a disk, and you lose power, there's no
guarantee that your write successfully made it out of the cache.

A performance problem when flushing the cache of an individual disk
implies that there's something wrong with the disk or its firmware.  You
can disable the write cache of an individual disk using format(1M).  When you
do this, ZFS won't lose any data, whereas enabling zfs_nocacheflush can
lead to problems.

I'm attaching a DTrace script that will show the cache-flush times
per-vdev.  Remove the zfs_nocacheflush tuneable and re-run your test
while using this DTrace script.  If one particular disk takes longer
than the rest to flush, this should show us.  In that case, we can
disable the write cache on that particular disk.  Otherwise, we'll need
to disable the write cache on all of the disks.

The script is attached as zfs_flushtime.d

Use format(1M) with the -e option to adjust the write_cache settings for
SCSI disks.

-j
#!/usr/sbin/dtrace -Cs
/*
 * CDDL HEADER START
 *
 * The contents of this file are subject to the terms of the
 * Common Development and Distribution License (the License).
 * You may not use this file except in compliance with the License.
 *
 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 * or http://www.opensolaris.org/os/licensing.
 * See the License for the specific language governing permissions
 * and limitations under the License.
 *
 * When distributing Covered Code, include this CDDL HEADER in each
 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 * If applicable, add the following below this CDDL HEADER, with the
 * fields enclosed by brackets [] replaced with your own identifying
 * information: Portions Copyright [] [name of copyright owner]
 *
 * CDDL HEADER END
 */

/*
 * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

#define DKIOC   (0x04  8)
#define DKIOCFLUSHWRITECACHE(DKIOC|34)

fbt:zfs:vdev_disk_io_start:entry
/(args[0]-io_cmd == DKIOCFLUSHWRITECACHE)  (self-traced == 0)/
{
self-traced = args[0];
self-start = timestamp;
}

fbt:zfs:vdev_disk_ioctl_done:entry
/args[0] == self-traced/
{
@a[stringof(self-traced-io_vd-vdev_path)] =
quantize(timestamp - self-start);
self-start = 0;
self-traced = 0;
}

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ? Removing a disk from a ZFS Storage Pool

2008-02-07 Thread James Andrewartha
Dave Lowenstein wrote:
 Couldn't we move fixing panic the system if it can't find a lun up to 
 the front of the line? that one really sucks.

That's controlled by the failmode property of the zpool, added in PSARC 
2007/567 which was integrated in b77.

-- 
James Andrewartha
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread Vincent Fox
 -Setting zfs_nocacheflush, though got me drastically
 increased throughput--client requests took, on
 average, less than 2 seconds each!
 
 So, in order to use this, I should have a storage
 array, w/battery backup, instead of using the
 internal drives, correct?  I have the option of using
 a 6120 or 6140 array on this system so I might just
 try that out.

We use 3510 and 2540 arrays for Cyrus mail-stores which hold about 10K accounts 
each.  Recommend going with dual-controllers though for safety.  Our setups are 
really simple.  Put 2 array units on the SAN, make a pair or RAID-5 LUNs.  Then 
RAID-10 these LUNs together in ZFS.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hardware RAID vs. ZFS RAID

2008-02-07 Thread Kyle McDonald
Andy Lubel wrote:
 With my (COTS) LSI 1068 and 1078 based controllers I get consistently  
 better performance when I export all disks as jbod (MegaCli - 
 CfgEachDskRaid0).

   
Is that really 'all disks as JBOD'? or is it 'each disk as a single 
drive RAID0'?

It may not sound different on the surface, but I asked in another thread 
and others confirmed, that if your RAID card has a battery backed cache 
giving ZFS many single drive RAID0's is much better than JBOD (using the 
'nocacheflush' option may even improve it more.)

My understanding is that it's kind of like the best of both worlds. You 
get the higher number of spindles and vdevs for ZFS to manage, ZFS gets 
to do the redundancy, and the the HW RAID Cache gives virtually instant 
acknowledgement of writes, so that ZFS can be on it's way.

So I think many RAID0's is not always the same as JBOD. That's not to 
say that even True JBOD doesn't still have an advantage over HW RAID. I 
don't know that for sure.

But I think there is a use for HW RAID in ZFS configs which wasn't 
always the theory I've heard.
 I have really learned not to do it this way with raidz and raidz2:

 #zpool create pool2 raidz c3t8d0 c3t9d0 c3t10d0 c3t11d0 c3t12d0  
 c3t13d0 c3t14d0 c3t15d0
   
Why? I know creating raidz's with more than 9-12 devices, but that 
doesn't cross that threshold.
Is there a reason you'd split 8 disks up into 2 groups of 4? What 
experience led you to this?
(Just so I don't have to repeat it. ;) )

   -Kyle
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread Daniel Cheng
William Fretts-Saxton wrote:
 Unfortunately, I don't know the record size of the writes.  Is it as simple 
 as looking @ the size of a file, before and after a client request, and 
 noting the difference in size?  This is binary data, so I don't know if that 
 makes a difference, but the average write size is a lot smaller than the file 
 size.  
 
 Should the recordsize be in place BEFORE data is written to the file system, 
 or can it be changed after the fact?  I might try a bunch of different 
 settings for trial and error.
 
 The I/O is actually done by RRD4J, which is a round-robin database library.  
 It is a Java version of 'rrdtool' which saves data into a binary format, but 
 also cleans up the data according to its age, saving less of the older data 
 as time goes on.
  

You should tune that in application level, see
https://rrd4j.dev.java.net/ down in performance issue section.

Try the NIO backend and use smaller (2048?)  record size...

-- 
This space was intended to be left blank.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss