[zfs-discuss] zpool import hangs

2009-07-08 Thread Nicholas
I am having trouble with a Raid-Z zpool bigtank of 5x 750GB drives that will 
not import.
After having some trouble with this pool, I exported it and attempted a 
reimport only to discover this issue:

I can see the pool by running zpool import, and the devices are all online 
however
running zpool import bigtank with or without the -f simply causes the entire 
system to hang...keyboard and ssh are both non responsive. I am very new to 
Solaris and the *nix scene so your help would be greatly appreciated. I am 
currently running zdb -e -bcsvL bigtank to check checksums on the pool but 
this has yet to find anything wrong. I really need this data too! Please walk 
me through this one as best as you can.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Losts of small files vs fewer big files

2009-07-08 Thread Miles Nordin
 dt == Don Turnbull dturnb...@greenplum.com writes:

dt Any idea why this is?

maybe prefetch?

WAG, though.

dt I work with Greenplum which is essentially a number of
dt Postgres database instances clustered together.

haha, yeah I know who you are.  Too bad the open source postgres can't
do that. :/

coughAFFEROcough.


pgpK4PMndIbty.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-08 Thread James Andrewartha
James Lever wrote:
 
 On 07/07/2009, at 8:20 PM, James Andrewartha wrote:
 
 Have you tried putting the slog on this controller, either as an SSD or
 regular disk? It's supported by the mega_sas driver, x86 and amd64 only.
 
 What exactly are you suggesting here?  Configure one disk on this array
 as a dedicated ZIL?  Would that improve performance any over using all
 disks with an internal ZIL?

I was mainly thinking about using the battery-backup write cache to
eliminate the NFS latency. There's not much difference between internal vs
dedicated ZIL if the disks are the same and on the same controller -
dedicated ZIL wins come from using SSDs and battery-backed cache.
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices

 Is there a way to disable the write barrier in ZFS in the way you can
 with Linux filesystems (-o barrier=0)?  Would this make any difference?

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes
might help if the RAID card is still flushing to disk when ZFS asks it to
even though it's safe in the battery-backed cache.

-- 
James Andrewartha | Sysadmin
Data Analysis Australia Pty Ltd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-08 Thread Peter Eriksson
You might wanna try one thing I just noticed - wrap the log device inside a SVM 
(disksuite) metadevice - makes wonders for the performance on my test server 
(Sun Fire X4240)... I do wonder what the downsides might be (except for having 
to fiddle with Disksuite again). Ie:

# zpool create TEST c1t12d0
# format c1t13d0
(Create a 4GB partition 0)
# metadb -f -a -c 3 c1t13d0s0
# metainit d0 1 1 c1t13d0s0
# zpool add TEST log /dev/md/dsk/d0

In my case the disks involved above are:
c1t12d0 146GB 10krpm SAS disk
c1t13d0 32GB Intel X25-E SLC SSD SATA disk

Without the log added running 'gtar zxf emacs-22.3.tar.gz' over NFS from 
another server
takes 1:39.2 (almost 2 minutes). With c1t15d0s0 added as log it takes 1:04.2, 
but with the same c1t15d0s0 added, but wrapped inside a SVM metadevice the same 
operation takes 10.4 seconds...

1:39 vs 0:10 is a pretty good speedup I think...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-08 Thread Peter Eriksson
Oh, and for completeness: If I wrap 'c1t12d0s0' inside a SVM metadevice to and 
use that to create the TEST zpool (without a log) I run the same test command 
in 36.3 seconds... Ie:

# metadb -f -a -c3 c1t13d0s0
# metainit d0 1 1 c1t13d0s0
# metainit d2 1 1 c1t12d0s0
# zpool create TEST /dev/md/dsk/d2

If I then add a log to that device:

# zpool add TEST log /dev/md/dsk/d0

the same test (gtar zxf emacs-22.3.tar.gz) runs in 10.1 seconds...
(Ie, not much better than just using a raw disk + svm-encapsulated log).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [nfs-discuss] NFS, ZFS ESX

2009-07-08 Thread erik.ableson

Comments in line.

On 7 juil. 09, at 19:36, Dai Ngo wrote:

Without any tuning, the default TCP window size and send buffer size  
for NFS
connections is around 48KB which is not very optimal for bulk  
transfer. However

the 1.4MB/s write seems to indicate something else is seriously wrong.


My sentiment as well.

iSCSI performance was good, so the network connection seems to be OK  
(assuming

it's 1GbE).


Yup - I'm running at wire speed on the iSCSI connections.


What is your mount options look like?


Unfortunately, ESX doesn't give any controls over mount options

I don't know what datastore browser does for copying file, but have  
you tried

the vanilla 'cp' command?


The datastore browser copy command is just a wrapper for cp from what  
I can gather. All types of copy operations to the NFS volume, even  
from other machines top out at this speed.  The NFS/iSCSI connections  
are in a separate physical network so I can't easily plug anything  
into it for testing other mount options from another machine or OS.  
I'll try from another VM to see if I can't force a mount with the  
async option to see if that helps any.


You can also try NFS performance using tmpfs, instead of ZFS, to  
make sure

NIC, protocol stack, NFS are not the culprit.


From what I can observe, it appears that the sync commands issues  
over the NFS stack are slowing down the process, even with a  
reasonable number of disks in the pool.


What I was hoping for was the same behavior (albeit slightly risky) of  
having writes cached to RAM and then dumped out in an optimal manner  
to disk, as per the local behavior where you see the flush to disk  
operations happening on a regular cycle. I think that this would be  
doable with an async mount, but I can't set this on the server side  
where it would be used by the servers automatically.


Erik


erik.ableson wrote:
OK - I'm at my wit's end here as I've looked everywhere to find  
some means of tuning NFS performance with ESX into returning  
something acceptable using osol 2008.11.  I've eliminated  
everything but the NFS portion of the equation and am looking for  
some pointers in the right direction.


Configuration: PE2950 bi pro Xeon, 32Gb RAM with an MD1000 using a  
zpool of 7 mirror vdevs. ESX 3.5 and 4.0.  Pretty much a vanilla  
install across the board, no additional software other than the  
Adaptec StorMan to manage the disks.


local performance via dd - 463MB/s write, 1GB/s read (8Gb file)
iSCSI performance - 90MB/s write, 120MB/s read (800Mb file from a VM)
NFS performance - 1.4MB/s write, 20MB/s read (800Mb file from the  
Service Console, transfer of a 8Gb file via the datastore browser)


I just found the tool latencytop which points the finger at the ZIL  
(tip of the hat to Lejun Zhu).  Ref: http://www.infrageeks.com/zfs/nfsd.png 
  http://www.infrageeks.com/zfs/fsflush.png.  Log file: http://www.infrageeks.com/zfs/latencytop.log 



Now I can understand that there is a performance hit associated  
with this feature of ZFS for ensuring data integrity, but this  
drastic a difference makes no sense whatsoever. The pool is capable  
of handling natively (at worst) 120*7 IOPS and I'm not even seeing  
enough to saturate a USB thumb drive. This still doesn't answer why  
the read performance is so bad either.  According to latencytop,  
the culprit would be genunix`cv_timedwait_sig rpcmod`svc


From my searching it appears that there's no async setting for the  
osol nfsd, and ESX does not offer any mount controls to force an  
async connection.  Other than putting in an SSD as a ZIL (which  
still strikes me as overkill for basic NFS services) I'm looking  
for any information that can bring me up to at least reasonable  
throughput.


Would a dedicated 15K SAS drive help the situation by moving the  
ZIL traffic off to a dedicated device? Significantly? This is the  
sort of thing that I don't want to do without some reasonable  
assurance that it will help since you can't remove a ZIL device  
from a pool at the moment.


Hints and tips appreciated,

Erik
___
nfs-discuss mailing list
nfs-disc...@opensolaris.org




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-08 Thread Andrew Robert Nicols
On Wed, Jul 08, 2009 at 08:43:17AM +1200, Ian Collins wrote:
 Ian Collins wrote:
 Brent Jones wrote:
 On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote:
  
 Ian Collins wrote:

 I was doing an incremental send between pools, the receive side is
 locked up and no zfs/zpool commands work on that pool.

 The stacks look different from those reported in the earlier ZFS
 snapshot send/recv hangs X4540 servers thread.

 Here is the process information from scat (other commands hanging on
 the pool are also in cv_wait):

   
 Has anyone else seen anything like this?  The box wouldn't even
 reboot, it had to be power cycled.  It locks up on receive regularly
 now.

 I hit this too:
 6826836

 Fixed in 117

 http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120
   
 I don't think this is the same problem (which is why a started a new  
 thread), a single incremental set will eventually lock the pool up,  
 pretty much guaranteed each time.

 One more data point: 

 This didn't happen when I had a single pool (stripe of mirrors) on the  
 server.  It started happening when I split the mirrors and created a  
 second pool built from 3 8 drive raidz2 vdevs.  Sending to the new pool  
 (either locally or from another machine) causes the hangs.

And here are my data points:

We were running two X4500s under Nevada 112 but came across this issue on
both of them. On receiving much data through a ZFS receive, they would lock
up. Any zpool or zfs commands would hang and were unkillable. The only way
to resolve the situation was to reboot without syncing disks. I reported
this in some posts back in April
(http://opensolaris.org/jive/click.jspa?searchID=2021762messageID=368524)

One of them had an old enough zpool and zfs version to down/up/sidegrade to
Solaris 10 u6 and so I made this change.
The thumper running Solaris 10 is now mostly fine - it normally receives an
hourly snapshot with no problem.

The thumper unning 112 has continued to experience the issues described by
Ian and others. I've just upgraded to 117 and am having even more issues -
I'm unable to receive or roll back snapshots, instead I see:

506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool
receiving incremental stream of vlepool/m...@200906182000 into 
thumperp...@200906182000
cannot receive incremental stream: most recent snapshot of thumperpool does not
match incremental source

511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800
cannot destroy 'thumperpool/m...@200906181900': dataset already exists

As a result, I'm a bit scuppered. I'm going to try going back to by 112
installation instead to see if that resolves any of my issues.

All of our thumpers have the following disk configuration:
4 x 11 Disk raidz2 arrays with 2 disks as hot spares in a single pool.
2 disks in a mirror for booting.

When zpool locks up on the main pool, I'm still able to get a zpool status
on the boot pool. I can't access any data on the pool which is locked up.

Andrew

-- 
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 4311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW


signature.asc
Description: Digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with mounting ZFS from USB drive

2009-07-08 Thread Darren J Moffat

Karl Dalen wrote:

I'm a new user of ZFS and I have an external USB drive which contains
a ZFS pool with file system. It seems that it does not get auto mounted
when I plug in the drive. I'm running osol-0811.

How can I manually mount this drive? It has a pool named rpool on it.
Is there any diagnostics commands that can be used to investigate the
contents of the pool or repair a damaged file system ?

rmformat shows that the physical name of the USB device is: /dev/rdsk/c4t0d0p0
If I try '# zpool import I get:
  pool: rpool
id: 3765122753259138111
 state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:
rpool   UNAVAIL  newer version
  c4t0d0s0  ONLINE


Did you try this:

zpool import -f rpool someothername

I think there are two reasons it won't import:
1) It was last accessed by another system (or maybe the same
 one but it had a different hostid at the time) so you need
 to use the -f flag.
2) There is probably another pool called rpool (the one you
are running from), right ?

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-08 Thread Ian Collins

Andrew Robert Nicols wrote:

On Wed, Jul 08, 2009 at 08:43:17AM +1200, Ian Collins wrote:
  

Ian Collins wrote:


Brent Jones wrote:
  

On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote:
 


Ian Collins wrote:
   
  

I was doing an incremental send between pools, the receive side is
locked up and no zfs/zpool commands work on that pool.

The stacks look different from those reported in the earlier ZFS
snapshot send/recv hangs X4540 servers thread.

Here is the process information from scat (other commands hanging on
the pool are also in cv_wait):

  


Has anyone else seen anything like this?  The box wouldn't even
reboot, it had to be power cycled.  It locks up on receive regularly
now.
  

I hit this too:
6826836

Fixed in 117

http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120
  

I don't think this is the same problem (which is why a started a new  
thread), a single incremental set will eventually lock the pool up,  
pretty much guaranteed each time.


  
One more data point: 

This didn't happen when I had a single pool (stripe of mirrors) on the  
server.  It started happening when I split the mirrors and created a  
second pool built from 3 8 drive raidz2 vdevs.  Sending to the new pool  
(either locally or from another machine) causes the hangs.



And here are my data points:

We were running two X4500s under Nevada 112 but came across this issue on
both of them. On receiving much data through a ZFS receive, they would lock
up. Any zpool or zfs commands would hang and were unkillable. The only way
to resolve the situation was to reboot without syncing disks. I reported
this in some posts back in April
(http://opensolaris.org/jive/click.jspa?searchID=2021762messageID=368524)

One of them had an old enough zpool and zfs version to down/up/sidegrade to
Solaris 10 u6 and so I made this change.
The thumper running Solaris 10 is now mostly fine - it normally receives an
hourly snapshot with no problem.

The thumper unning 112 has continued to experience the issues described by
Ian and others. I've just upgraded to 117 and am having even more issues -
I'm unable to receive or roll back snapshots, instead I see:

506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool
receiving incremental stream of vlepool/m...@200906182000 into 
thumperp...@200906182000
cannot receive incremental stream: most recent snapshot of thumperpool does not
match incremental source

511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800
cannot destroy 'thumperpool/m...@200906181900': dataset already exists

  

Thanks for the additional data Andrew.

Can you do a zfs destroy of thumperpool/m...@200906181900?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-08 Thread Andrew Robert Nicols
On Wed, Jul 08, 2009 at 08:31:54PM +1200, Ian Collins wrote:
 Andrew Robert Nicols wrote:

 The thumper unning 112 has continued to experience the issues described by
 Ian and others. I've just upgraded to 117 and am having even more issues -
 I'm unable to receive or roll back snapshots, instead I see:

 506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool
 receiving incremental stream of vlepool/m...@200906182000 into 
 thumperp...@200906182000
 cannot receive incremental stream: most recent snapshot of thumperpool does 
 not
 match incremental source

 511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800
 cannot destroy 'thumperpool/m...@200906181900': dataset already exists

   
 Thanks for the additional data Andrew.

 Can you do a zfs destroy of thumperpool/m...@200906181900?

I'm afraid not:

503 r...@thumper1:~ zfs destroy thumperpool/m...@200906181900
cannot destroy 'thumperpool/m...@200906181900': dataset already exists

Andrew

-- 
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 4311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW


signature.asc
Description: Digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with mounting ZFS from USB drive

2009-07-08 Thread Victor Latushkin

On 08.07.09 12:30, Darren J Moffat wrote:

Karl Dalen wrote:

I'm a new user of ZFS and I have an external USB drive which contains
a ZFS pool with file system. It seems that it does not get auto mounted
when I plug in the drive. I'm running osol-0811.

How can I manually mount this drive? It has a pool named rpool on it.
Is there any diagnostics commands that can be used to investigate the
contents of the pool or repair a damaged file system ?

rmformat shows that the physical name of the USB device is: 
/dev/rdsk/c4t0d0p0

If I try '# zpool import I get:
  pool: rpool
id: 3765122753259138111
 state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:
rpool   UNAVAIL  newer version
  c4t0d0s0  ONLINE


Did you try this:

zpool import -f rpool someothername

I think there are two reasons it won't import:
1) It was last accessed by another system (or maybe the same
 one but it had a different hostid at the time) so you need
 to use the -f flag.
2) There is probably another pool called rpool (the one you
are running from), right ?


I think this pool is just too modern for the system you are trying to 
import it on as it is UNAVAIL due to newer version.


Here's an example:

r...@jax # mkfile -n 64m version
r...@jax # zpool create version /var/tmp/version
r...@jax # zpool upgrade version
This system is currently running ZFS pool version 10.

Pool 'version' is already formatted using the current version.
r...@jax #
r...@jax # rcp version theorem:/var/tmp

On the other host:

r...@theorem # zpool upgrade
This system is currently running ZFS version 4.

All pools are formatted using this version.
r...@theorem # zpool import -d /var/tmp
  pool: version
id: 2589325003567752919
 state: FAULTED
status: The pool is formatted using an incompatible version.
action: The pool cannot be imported.  Access the pool on a system 
running newer

software, or recreate the pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-A5
config:

version UNAVAIL   newer version
  /var/tmp/version  ONLINE

r...@theorem #


There's a difference in messages though, but older host in my case is 
running Solaris 10 U4, so it may explain it. Anyway, I think pool 
version is the real reason here.


Victor


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hanging receive

2009-07-08 Thread Andrew Robert Nicols
On Wed, Jul 08, 2009 at 09:41:12AM +0100, Andrew Robert Nicols wrote:
 On Wed, Jul 08, 2009 at 08:31:54PM +1200, Ian Collins wrote:
  Andrew Robert Nicols wrote:
 
  The thumper unning 112 has continued to experience the issues described by
  Ian and others. I've just upgraded to 117 and am having even more issues -
  I'm unable to receive or roll back snapshots, instead I see:
 
  506 r...@thumper1:~ cat snap | zfs receive -vF thumperpool
  receiving incremental stream of vlepool/m...@200906182000 into 
  thumperp...@200906182000
  cannot receive incremental stream: most recent snapshot of thumperpool 
  does not
  match incremental source
 
  511 r...@thumper1:~ zfs rollback -r thumperpool/m...@200906181800
  cannot destroy 'thumperpool/m...@200906181900': dataset already exists
 

  Thanks for the additional data Andrew.
 
  Can you do a zfs destroy of thumperpool/m...@200906181900?
 
 I'm afraid not:
 
 503 r...@thumper1:~ zfs destroy thumperpool/m...@200906181900
 cannot destroy 'thumperpool/m...@200906181900': dataset already exists

Moving back to Nevada 112, I'm once again able to receive snapshots and
destroy datasets as appropriate - thank goodness!

However, I'm fairly sure that in a few hours, with the volume of data I'm
sending I'll see zfs hang.

Can anyone on the list suggest some diagnostics which may be of use when
this happens?

Thanks in advance,

Andrew Nicols

-- 
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 4311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW


signature.asc
Description: Digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [nfs-discuss] NFS, ZFS ESX

2009-07-08 Thread Roch

erik.ableson writes:

  Comments in line.
  
  On 7 juil. 09, at 19:36, Dai Ngo wrote:
  
   Without any tuning, the default TCP window size and send buffer size  
   for NFS
   connections is around 48KB which is not very optimal for bulk  
   transfer. However
   the 1.4MB/s write seems to indicate something else is seriously wrong.
  
  My sentiment as well.
  
   iSCSI performance was good, so the network connection seems to be OK  
   (assuming
   it's 1GbE).
  
  Yup - I'm running at wire speed on the iSCSI connections.
  
   What is your mount options look like?
  
  Unfortunately, ESX doesn't give any controls over mount options
  
   I don't know what datastore browser does for copying file, but have  
   you tried
   the vanilla 'cp' command?
  
  The datastore browser copy command is just a wrapper for cp from what  
  I can gather. All types of copy operations to the NFS volume, even  
  from other machines top out at this speed.  The NFS/iSCSI connections  
  are in a separate physical network so I can't easily plug anything  
  into it for testing other mount options from another machine or OS.  
  I'll try from another VM to see if I can't force a mount with the  
  async option to see if that helps any.
  
   You can also try NFS performance using tmpfs, instead of ZFS, to  
   make sure
   NIC, protocol stack, NFS are not the culprit.
  
   From what I can observe, it appears that the sync commands issues  
  over the NFS stack are slowing down the process, even with a  
  reasonable number of disks in the pool.
  
  What I was hoping for was the same behavior (albeit slightly risky) of  
  having writes cached to RAM and then dumped out in an optimal manner  
  to disk, as per the local behavior where you see the flush to disk  
  operations happening on a regular cycle. I think that this would be  
  doable with an async mount, but I can't set this on the server side  
  where it would be used by the servers automatically.
  
  Erik
  

I would wouldn't do this, sounds like you want to have
zil_disable.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

If you do, then be prepared to unmount or reboot all clients of
the server in case of a crash in order to clear their
corrupted caches.

This is in no way a ZIL problem nor a ZFS problem.

http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine 

And most NFS appliance provider will use a form of write
accelerating devices to try to make the NFS experience closer to
local filesystem behavior.


-r



   erik.ableson wrote:
   OK - I'm at my wit's end here as I've looked everywhere to find  
   some means of tuning NFS performance with ESX into returning  
   something acceptable using osol 2008.11.  I've eliminated  
   everything but the NFS portion of the equation and am looking for  
   some pointers in the right direction.
  
   Configuration: PE2950 bi pro Xeon, 32Gb RAM with an MD1000 using a  
   zpool of 7 mirror vdevs. ESX 3.5 and 4.0.  Pretty much a vanilla  
   install across the board, no additional software other than the  
   Adaptec StorMan to manage the disks.
  
   local performance via dd - 463MB/s write, 1GB/s read (8Gb file)
   iSCSI performance - 90MB/s write, 120MB/s read (800Mb file from a VM)
   NFS performance - 1.4MB/s write, 20MB/s read (800Mb file from the  
   Service Console, transfer of a 8Gb file via the datastore browser)
  
   I just found the tool latencytop which points the finger at the ZIL  
   (tip of the hat to Lejun Zhu).  Ref: 
   http://www.infrageeks.com/zfs/nfsd.png 
 http://www.infrageeks.com/zfs/fsflush.png.  Log file: 
http://www.infrageeks.com/zfs/latencytop.log 
   
  
   Now I can understand that there is a performance hit associated  
   with this feature of ZFS for ensuring data integrity, but this  
   drastic a difference makes no sense whatsoever. The pool is capable  
   of handling natively (at worst) 120*7 IOPS and I'm not even seeing  
   enough to saturate a USB thumb drive. This still doesn't answer why  
   the read performance is so bad either.  According to latencytop,  
   the culprit would be genunix`cv_timedwait_sig rpcmod`svc
  
   From my searching it appears that there's no async setting for the  
   osol nfsd, and ESX does not offer any mount controls to force an  
   async connection.  Other than putting in an SSD as a ZIL (which  
   still strikes me as overkill for basic NFS services) I'm looking  
   for any information that can bring me up to at least reasonable  
   throughput.
  
   Would a dedicated 15K SAS drive help the situation by moving the  
   ZIL traffic off to a dedicated device? Significantly? This is the  
   sort of thing that I don't want to do without some reasonable  
   assurance that it will help since you can't remove a ZIL device  
   from a pool at the moment.
  
   Hints and tips appreciated,
  
   Erik
   ___
   nfs-discuss mailing list
   

Re: [zfs-discuss] NFS load balancing / was: ZFS, ESX , and NFS. oh my!

2009-07-08 Thread Nils Goroll

Hi Miles and All,

this is off-topic, but as the discussion has started here:


Finally, *ALL THIS IS COMPLETELY USELESS FOR NFS* because L4 hashing
can only split up separate TCP flows.


The reason why I have spend some time with
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942
is to make NFS loadbalancing over more than one TCP stream work again.

When rpcmod:clnt_max_conns is set to a value  1, the NFS client will use 
multiple TCP connections.


Now the next question is which IP adresses and TCP ports are chosen for these 
connections, which are not guaranteed to be successive in order to get optimal 
load distribution with the hashes I've seen in the field.


That's a topic I'll probably revisit..

Nils
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs snapshoot of rpool/* to usb removable drives?

2009-07-08 Thread Carl Brewer
G'day,
I'm putting together a LAN server with a couple of terabyte HDDs as a mirror 
(zfs root) on b117 (updated 2009.06).

I want to back up snapshots of all of rpool to a removable drive on a USB port 
- simple  cheap backup media for a two week rolling DR solution - ie: once a 
week a HDD gets swapped out and kept offsite.  I figure ZFS snapshots are 
perfect for local backups of files, it's only DR that we need the offsite 
backup for.

I created and formatted one drive on the USB interface (hopefully this will 
cope with drives being swapped in and out?), called it 'backup' to confuse 
things :)

zfs list shows :
NAME   USED  AVAIL  REFER  MOUNTPOINT
backup 114K   913G21K  /backup
rpool 16.1G   897G84K  /rpool
rpool/ROOT13.7G   897G19K  legacy
rpool/ROOT/opensolaris37.7M   897G  5.02G  /
rpool/ROOT/opensolaris-1  13.7G   897G  10.9G  /
rpool/cashmore 140K   897G22K  /rpool/cashmore
rpool/dump1018M   897G  1018M  -
rpool/export   270M   897G23K  /export
rpool/export/home  270M   897G   736K  /export/home
rpool/export/home/carl 267M   897G   166M  /export/home/carl
rpool/swap1.09G   898G   101M  -

I've tried this :

zfs snapshot -r rp...@mmddhh
zfs send rp...@mmddhh | zfs receive -F backup/data

eg :

c...@lan2:/backup# zfs snapshot -r rp...@2009070804
c...@lan2:/backup# zfs send rp...@2009070804 | zfs receive -F backup/data

Now I'd expect to see the drive light up and to see some activity, but not much 
seems to happen.

zpool status shows :
# zpool status
  pool: backup
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
backup   ONLINE   0 0 0
  c10t0d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  mirrorONLINE   0 0 0
c8d1s0  ONLINE   0 0 0
c9d1s0  ONLINE   0 0 0

errors: No known data errors

and zfs list -t all :
zfs list -t all | grep back
backup 238K   913G  
  23K  /backup
bac...@zfs-auto-snap:frequent-2009-07-08-23:30  18K  -  
  21K  -
backup/data 84K   913G  
  84K  /backup/data
backup/d...@20090708040  -  
  84K  -

So nothing much is getting copied onto the USB drive as far as I can tell. 
Certainly not a few GB of stuff.  Can anyone tell me what I've missed or 
misunderstood?  Does snapshot -r not get all of rpool?

Thankyou!

Carl
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-08 Thread Jim Klimov
First of all, as other posters stressed, your data is not safe by being stored 
in a
single copy, in the first place. Before doing anything to it, make a backup and 
test the backup if anyhow possible. At least, do it to any data that is more 
worth
than the rest of it ;)

As it was stressed in other posts, and not replied - how much disk space do you 
actually have available on your 8 disks? I.e. can you copy over some files in 
WinXP
over to some disks in order to free up at least one drive? Half of the drives 
(ideal)?

How much compressable is your data (i.e. videos vs text files)? Is it compressed
on the NTFS filesystem already (pointer to freeing up space - if not)?

Depending on how much free space can be actually gained by moving and 
compressing your data, there's a number of scenarios possible, detailed below. 

The point I'm trying to get to is: as soon as you can free up a single drive so 
you 
can move it into the Solaris machine, you can set it up as your ZFS pool.

Initially this pool would only contain a single vdev (a single drive, a mirror 
or a 
raidz group of drives, which may be concatenated to make up the larger pool if 
there's more than one vdev, as detailed below).

You create a filesystem dataset on the pool and enable compression (if your 
data 
can be compressed). In recent Solaris and OpenSolaris you can use gzip-9 to fit 
the info tighter on the drive. Also keep in mind that this setting applies to 
any 
data written *after* the value is set. So a dataset can store data objects 
written 
with mixed compression levels, if the value is changed on the fly. 
Alternatively,
and more simple to support, you can make several datasets with pre-defined 
compression evels (i.e. not to waste CPU cycles to compress JPEGs and MP3's).

Now, as you copy the data from NTFS to Solaris, you (are expected to) free up 
at 
least one more drive which can be added to the ZFS pool. Its capacity is at this
moment concatenated to the same pool. If you free up many drives at once, you
can go for a raidz vdev.

Best-case scenario is that you free up enough disks to build a redundant ZFS 
volume right away (raidz1, raidz2 or mirror - as the redundant pool's capacty 
decreases and data protection grows). Apparently, you don't expect to have
enough drives to mirror all data, so let's skip that idea for now. The raidz 
levels 
require that you free up at least two drives initially. AFAIK the raidz vdevs 
can 
not be expanded at the moment, so the more drives you're initially using - the 
less overhead capacity you'll lose. As you progress with data copying, you can
free up some more drives and make another raidz vdev, attached to this pool.

You can use a trick to make a raidz vdev with missing redundancy disks (which 
you'd attach and resilver later on). This is possible, but not production 
ready in 
any manner, and prone to data loss of the whole set of several drives whenever 
anything goes wrong. To my sole risk, I used it to make and populate a raidz2 
pool 
of 4 devices while I only had 2 available drives at that moment (the other 2 
were 
the old raid10 mirror's components with original data). 

The fake raidz redundancy devices trick is discussed in this thread:
[http://opensolaris.org/jive/thread.jspa?messageID=328360tstart=0]

In a worst-case scenario you'll have a either single pool of concatenated 
disks, 
or a number of separate pools - like your separate NTFS systems are now; in my 
opinion, this is the better of two worses. In case of separate ZFS pools, you 
can
move them around and you only lose one disk worth of data if anything (drive, 
cabling, software, power) goes wrong. With a concatenated pool, however, you
have all of the drives' free space also concatenated as one big available 
bubble.

That's your choice to make. 

Later on you can expand the single drive vdevs to become mirrors, as you buy or
free up drives.

If you find that your data compresses well, so that you start with a 
single-drive 
concatenation pool and then find that you can free up several drives at once and
use raidz sets, see if you can squeeze out at least 3-4 drives (including a fake
device for raidz redundancy if you choose to try the trick). If you can - start 
a
new pool made with raidz vdevs and migrate the data from single drives to it,
then scrap their pool and reuse them. Remember that you can't currently remove
a vdev from the pool.

For such temporary pools (preferably redundant, or not) you can also use any 
number of older-smaller drives if you can get hands on them ;)

On a side note, copying this much data over LAN would take ages. If your disks
are not too much fragmented, you can typically expect 20-40Mb/s for large 
files. 
Zillions of small files (or heavily fragmented disks) make up so many mechanical
seeks that speeds can fall down to well under 1Mb/s. Easy to see that copying a 
single 1.5Tb drive can take anywhere from half a day on a gigabit LAN, and 
about 
2-3 days on a 100Mbit LAN (7-10 

Re: [zfs-discuss] zfs snapshoot of rpool/* to usb removable drives?

2009-07-08 Thread Darren J Moffat

Carl Brewer wrote:

G'day,
I'm putting together a LAN server with a couple of terabyte HDDs as a mirror 
(zfs root) on b117 (updated 2009.06).

I want to back up snapshots of all of rpool to a removable drive on a USB port - 
simple  cheap backup media for a two week rolling DR solution - ie: once a 
week a HDD gets swapped out and kept offsite.  I figure ZFS snapshots are perfect 
for local backups of files, it's only DR that we need the offsite backup for.

I created and formatted one drive on the USB interface (hopefully this will 
cope with drives being swapped in and out?), called it 'backup' to confuse 
things :)

zfs list shows :
NAME   USED  AVAIL  REFER  MOUNTPOINT
backup 114K   913G21K  /backup
rpool 16.1G   897G84K  /rpool
rpool/ROOT13.7G   897G19K  legacy
rpool/ROOT/opensolaris37.7M   897G  5.02G  /
rpool/ROOT/opensolaris-1  13.7G   897G  10.9G  /
rpool/cashmore 140K   897G22K  /rpool/cashmore
rpool/dump1018M   897G  1018M  -
rpool/export   270M   897G23K  /export
rpool/export/home  270M   897G   736K  /export/home
rpool/export/home/carl 267M   897G   166M  /export/home/carl
rpool/swap1.09G   898G   101M  -

I've tried this :

zfs snapshot -r rp...@mmddhh
zfs send rp...@mmddhh | zfs receive -F backup/data

eg :

c...@lan2:/backup# zfs snapshot -r rp...@2009070804
c...@lan2:/backup# zfs send rp...@2009070804 | zfs receive -F backup/data


You are missing a -R for the 'zfs send' part.

What you have done there is create snapshots of all the datasets in 
rpool called 2009070804 but you only sent the one of the top level rpool 
dataset.


 -R

 Generate a replication stream  package,  which  will
 replicate  the specified filesystem, and all descen-
 dant file systems, up to the  named  snapshot.  When
 received, all properties, snapshots, descendent file
 systems, and clones are preserved.

 If the -i or -I flags are used in  conjunction  with
 the  -R  flag,  an incremental replication stream is
 generated. The current  values  of  properties,  and
 current  snapshot and file system names are set when
 the stream is received. If the -F flag is  specified
 when  this  stream  is  recieved, snapshots and file
 systems that do not exist on the  sending  side  are
 destroyed.



--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recover data after zpool create

2009-07-08 Thread stephen bond
Kees,

can you provide an example of how to read from dd cylinder by cylinder?

also if a file is fragmented is there a marker at the end of the first piece 
telling where is the second?

Thank you
stephen
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Very slow ZFS write speed to raw zvol

2009-07-08 Thread Leon Verrall
Guys,

Have an opensolairs x86 box running:

SunOS thsudfile01 5.11 snv_111b i86pc i386 i86pc Solaris

This has 2 old qla2200 1Gbit FC cards attached. Each bus is connected to an old 
transtec F/C raid array. This has a couple of large luns that form a single 
large zpool:

r...@thsudfile01:~# zpool status bucket
  pool: bucket
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
bucket  ONLINE   0 0 0
  c5t0d0ONLINE   0 0 0
  c8t3d0ONLINE   0 0 0

errors: No known data errors
r...@thsudfile01:~# zfs list bucket
NAME USED  AVAIL  REFER  MOUNTPOINT
bucket  2.69T  5.31T22K  /bucket

This is being used as an iSCSI target for an ESX 4.0 development environemnt. I 
found the performance to be really poor and found the culprit seems to be write 
performance to the raw zvol. For example on this zfs filesystem allocated as a 
volume:

r...@thsudfile01:~# zfs list bucket/iSCSI/lun1
NAMEUSED  AVAIL  REFER  MOUNTPOINT
bucket/iSCSI/lun1   250G  5.55T  3.64G  -

r...@thsudfile01:~# dd if=/dev/zero of=/dev/zvol/rdsk/bucket/iSCSI/lun1 
bs=65536 count=102400
^C7729+0 records in
7729+0 records out
506527744 bytes (507 MB) copied, 241.707 s, 2.1 MB/

Some zpool iostat 1 1000:

bucket  2.44T  5.68T  0203  0  2.73M
bucket  2.44T  5.68T  0216  0  2.83M
bucket  2.44T  5.68T  0120  63.4K  1.58M
bucket  2.44T  5.68T  2350   190K  16.9M
bucket  2.44T  5.68T  0123  0  1.64M
bucket  2.44T  5.68T  0230  0  3.02M

Read performance from that zvol (assuming /dev/null behaves properly) is fine:

r...@thsudfile01:/bucket/transtec# dd of=/dev/null 
if=/dev/zvol/rdsk/bucket/iSCSI/lun1 bs=65536 count=204800
204800+0 records in
204800+0 records out
13421772800 bytes (13 GB) copied, 47.0256 s, 285 MB/s

Somewhat optimistic that... but iostat shows 100MB/s ish.

Write to a zfs filesystem from that zpool is also fine, here with a a write big 
enough to exhaust the machines 12GB memory:

r...@thsudfile01:/bucket/transtec# dd if=/dev/zero of=FILE bs=65536 count=409600
^C
336645+0 records in
336645+0 records out
22062366720 bytes (22 GB) copied, 176.369 s, 125 MB/s

and bursts of cache flush from iostat:

bucket  2.44T  5.68T  0342  0  38.7M
bucket  2.44T  5.68T  0  1.47K  0   188M
bucket  2.44T  5.68T  0240  0  21.3M
bucket  2.44T  5.68T  0  1.54K  0   191M
bucket  2.44T  5.68T  0  1.49K  0   191M
bucket  2.44T  5.68T  0434  0  44.2M

So we seem to be able to get data down to disk via the cache at a reasonable 
rate and read from a raw zvol OK, but writes  are horribly slow. 

Am I missing something obvious? let me know what info would be diagnostic and 
I'll post it... 

Cheers,

Leon
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-08 Thread Jim Klimov
I meant to add that due to the sheer amount of data (and time needed) to copy,
you really don't want to use copying tools which abort on error, such as MS 
Explorer.

Normally I'd suggest something like FAR in Windows or Midnight Commander in Unix
to copy over networked connections (CIFS shares), or further on - 
tar/cpio/whatever.
These would let you know of errors and/or suggest that you retry copying (if 
errors
were due to environment like LAN switch reset). However, interactive tools would
stall until you answer, and non-interactive tools would not continue copying 
over
what they lost on the first pass.

Overall from my experience, I'd suggest RSync running in a loop with partial-dir
enabled, for either local copying or over-the-net copying. This way rsync takes 
care
of copying only the changed files (or continuing files which failed from the 
point 
where they failed), and it does so without requiring supervision. For Windows 
side
you can look for a project called cwRsync which includes parts of Cygwin to make
the environment for rsync (ssh, openssl, etc).

My typical runs between Unix hosts look like:

solaris# cd /pool/dumpstore/databases
solaris# while ! rsync -vaP --stats --exclude='*.bak' --exclude='temp' 
--partial --append  source:/DUMP/snapshots/mysql . ; do sleep 5; echo = 
`date`: RETRY; done; date

(Slashes in the end of pathnames do matter a lot - directory or its contents)

For Windows the basic syntax remains nearly the same, I don't want to add
confusion by crafting it out of my head now with nowhere to test.

If your setup is in a LAN and security overhead can be disregarded, use 
'rsync -e rsh' (or use ssh with lightweight algorithms) to not waste CPU on 
encryption.

Alternatively, you can configure the Solaris host to act as an rsync server and 
use
the rsync algorithm (with desired settings) directly.

Also, if your files are not ASCII-named, you might want to look at rsync 
--iconv 
parameter to recode pathnames. And remember about ZFS 255-byte(!) limit on 
names. For Unicode names the string character length is roughly half that.

//HTH, Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 10TB of data from NTFS is there a simple way?

2009-07-08 Thread David Magda
On Wed, July 8, 2009 11:55, Jim Klimov wrote:

 My typical runs between Unix hosts look like:

 solaris# cd /pool/dumpstore/databases
 solaris# while ! rsync -vaP --stats --exclude='*.bak' --exclude='temp'
 --partial --append  source:/DUMP/snapshots/mysql . ; do sleep 5; echo
 = `date`: RETRY; done; date

If possible, also try to use rsync 3.x if you're going to go down that
route. In previous versiosn it was necessary to traverse the entire file
system to get a file list before starting a transfer.

Starting with 3.0.0 (and when talking to another 3.x), it will send
incremental updates so bits start moving quicker:

  ENHANCEMENTS:

  - A new incremental-recursion algorithm is now used when rsync is talking
to another 3.x version.  This starts the transfer going more quickly
(before all the files have been found), and requires much less memory.
See the --recursive option in the manpage for some restrictions.

http://www.samba.org/ftp/rsync/src/rsync-3.0.0-NEWS


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recover data after zpool create

2009-07-08 Thread Carson Gaspar

stephen bond wrote:


can you provide an example of how to read from dd cylinder by cylinder?


What's a cylinder? That's a meaningless term these days. You dd byte ranges. 
Pick whatever byte range you want. If you want mythical cylinders, fetch the 
cylinder size from format and use that as your block size for dd. But the 
disks all lie about that, and remap sectors anyway, so I don't see why you would 
possibly care...


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Poor Man's Cluster using zpool export and zpool import

2009-07-08 Thread Shawn Joy
Is it supported to use zpool export and zpool import to manage disk access 
between two nodes that have access to the same storage device. 

What issues exist if the host currently owning the zpool goes down? In this 
case will using zpool import -f work? Is there possible data corruption issues? 

Thanks,
Shawn
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Poor Man's Cluster using zpool export and zpool import

2009-07-08 Thread Darren J Moffat

Shawn Joy wrote:
Is it supported to use zpool export and zpool import to manage disk access between two nodes that have access to the same storage device. 

What issues exist if the host currently owning the zpool goes down? In this case will using zpool import -f work? Is there possible data corruption issues? 


See the description of the cachefile property in the zpool(1M) man page, 
that was put there for this type of export/import clustering.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recover from zfs

2009-07-08 Thread Kees Nuyt

(in the spirit of open source, directed back to the list)

On Wed, 8 Jul 2009 14:51:55 + (GMT), Stephen C. Bond
wrote:

Kees,

 can you provide an example of how to read from dd 
 cylinder by cylinder or even better by exact coordinates?

That's hard to do, many disks don't tell you the real
geometry. 
dd if=/dev/rdsk/cXtYdZsB of=output_file_name\
 bs=block_size \
 skip=nr_of_blocks_to_skip \
 count=nr_of_blocks_to_copy

 also if a file is fragmented is there a marker at the
 end of the first piece telling where is the second?

No. That kind of information is kept in the zfs
administrative blocks. You'll have to study the on-disk
format to get that kind of info.

http://opensolaris.org/os/community/zfs/docs/ondiskformatfinal.pdf
http://blogs.sun.com/storage/en_US/entry/examining_zfs_on_disk_format

The zdb utility (zfs debugging tool) might be of help as
well.

Thank you
Stephen C. Bond
-- 
  (  Kees Nuyt
  )
c[_]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Poor Man's Cluster using zpool export and zpool import

2009-07-08 Thread Cindy . Swearingen

Hi Shawn,

I have no experience with this configuration, but you might review
the information in this blog:

http://blogs.sun.com/erickustarz/entry/poor_man_s_cluster_end

ZFS is not a cluster file system and yes, possible data corruption
issues exist. Eric mentions this in his blog.

You might also check out the HA-cluster product:

http://opensolaris.org/os/community/ha-clusters/

Cindy


Shawn Joy wrote:
Is it supported to use zpool export and zpool import to manage disk access between two nodes that have access to the same storage device. 

What issues exist if the host currently owning the zpool goes down? In this case will using zpool import -f work? Is there possible data corruption issues? 


Thanks,
Shawn

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single disk parity

2009-07-08 Thread Mark J Musante

On Wed, 8 Jul 2009, Moore, Joe wrote:

The copies code is nice because it tries to put each copy far away 
from the others.  This does have a significant performance impact when 
on a single spindle, however, because each logical write will be written 
here and then a disk seek to write it to there.


That's true for the worst case, but zfs mitigates that somewhat by 
batching i/o into a transaction group.  This means that i/o is done every 
30 seconds (or 5 seconds, depending on the version you're running), 
allowing multiple writes to be written together in the disparate 
locations.



Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write I/O stalls

2009-07-08 Thread John Wythe
 I was all ready to write about my frustrations with
 this problem, but I upgraded to snv_117 last night to
 fix some iscsi bugs and now it seems that the write
 throttling is working as described in that blog.

I may have been a little premature. While everything is much improved for Samba 
and local disk operations (dd, cp) on snv_117, Comstar ISCSI writes still seem 
to incur this write a bit, block, write a bit, block every 5 seconds.

But on top of that, I am getting relatively poor ISCSI performance for some 
reason with a direct gigabit link with MTU=9000. I'm not sure what that is 
about yet.

-John
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-08 Thread Miles Nordin
 pe == Peter Eriksson no-re...@opensolaris.org writes:

pe With c1t15d0s0 added as log it takes 1:04.2, but with the same
pe c1t15d0s0 added, but wrapped inside a SVM metadevice the same
pe operation takes 10.4 seconds...

so now SVM discards cache flushes, too?  great.


pgpFnpp1mdyTO.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Poor Man's Cluster using zpool export and zpool import

2009-07-08 Thread Shawn Joy
Thanks Cindy and Darren
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-08 Thread Jerry K

Hello Lori,

It has been a while since this has been discussed, and I am hoping that 
you can provide an update, or time estimate.  As we are several months 
into Update 7, is there any chance of an Update 7 patch, or are we still 
waiting for Update 8.


Also, can you share the CR # that you mentioned in your previous email 
(below), so I can read further into this?


thanks again,

Jerry Kemp


Lori Alt wrote:

Latest is that this will go into an early build of Update 8
and be available as a patch shortly thereafter (shortly
after it's putback, that is.  The patch doesn't have to wait for U8
to be released.)

I will update the CR with this information.

Lori


On 02/18/09 09:12, Jerry K wrote:

Hello Lori,

Any update to this issue, and can you speculate as to if it will be a 
patch to Solaris 10u6, or part of 10u7?


Thanks again,

Jerry


Lori Alt wrote:


This is in the process of being resolved right now.  Stay tuned
for when it will be available.  It might be a patch to Update 6.

In the meantime, you might try this:

http://blogs.sun.com/scottdickson/entry/flashless_system_cloning_with_zfs 



- Lori


On 01/09/09 12:28, Jerry K wrote:
I understand that currently, at least under Solaris 10u6, it is not 
possible to jumpstart a new system with a zfs root using a flash 
archive as a source.


Can anyone comment as to whether this restriction will pass in the 
near term, or if this is a while out (6+ months) before this will be 
possible?


Thanks,

Jerry
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-08 Thread Bob Friesenhahn

On Wed, 8 Jul 2009, Jerry K wrote:

It has been a while since this has been discussed, and I am hoping that you 
can provide an update, or time estimate.  As we are several months into 
Update 7, is there any chance of an Update 7 patch, or are we still waiting 
for Update 8.


I saw that a Solaris 10 patch for supporting Flash archives on ZFS 
came out about a week ago.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-08 Thread Fredrich Maney
Any idea what the Patch ID was?

fpsm

On Wed, Jul 8, 2009 at 3:43 PM, Bob
Friesenhahnbfrie...@simple.dallas.tx.us wrote:
 On Wed, 8 Jul 2009, Jerry K wrote:

 It has been a while since this has been discussed, and I am hoping that
 you can provide an update, or time estimate.  As we are several months into
 Update 7, is there any chance of an Update 7 patch, or are we still waiting
 for Update 8.

 I saw that a Solaris 10 patch for supporting Flash archives on ZFS came out
 about a week ago.

 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-08 Thread Lori Alt

On 07/08/09 13:43, Bob Friesenhahn wrote:

On Wed, 8 Jul 2009, Jerry K wrote:

It has been a while since this has been discussed, and I am hoping 
that you can provide an update, or time estimate.  As we are several 
months into Update 7, is there any chance of an Update 7 patch, or 
are we still waiting for Update 8.


I saw that a Solaris 10 patch for supporting Flash archives on ZFS 
came out about a week ago.


Correct.  These are the patches:

sparc:
119534-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
124630-26: updates to the install software

x86:
119535-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
124631-27: updates to the install software


Lori
---BeginMessage---
I received the following message about the patches for zfs flash archive 
support:



The submitted patch has been received as release ready by raid.central and
will be officially released to the Enterprise Services patch databases within
24 - 48 hours (except on weekends or holidays) or submitter will be further
notified of any issues that prevent SunService from releasing it.

Contact patch-mana...@sun.com if there are any further questions.

The patches are:

sparc:
119534-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
124630-26: updates to the install software

x86:
119535-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
124631-27: updates to the install software

A couple weeks ago, I sent out a mail about the content of these patches 
and about how they should be applied.  I have included that message 
again, below.


Lori

---


I have two pieces of information to convey with this mail.  The first is 
a summary
of how flash archives work with zfs as a result of this patch.  During 
the discussions

of what should be implemented, there was some disagreement about what was
needed.  I want to summarize what finally got implemented, just so there 
is no
confusion.  Second, I want to bring everyone up to date on the state of 
the patch

for zfs flash archive support.

Overview of ZFS Flash Archive Functionality
-
With this new functionality, it is possible to

- generate flash archives that can be used to install systems to boot off
of ZFS root pools
- perform Jumpstart initial installations of entire systems using these
zfs-type flash archives
- the flash archive backs up an entire root pool, not individual
boot environments.  Individual datasets within the pool can be
excluded using a new -D option to flarcreate and flar.

Here are the limitations:

- Jumpstart installations only.  No interactive install support for
flash archive installs of zfs-rooted systems.  No installation of
individual boot environments using Live Upgrade.
- Full initial install only.  No differential flash installs.
- No hybrid ufs/zfs archives.  Existing (ufs-type) flash archives
can still only be used to install ufs roots.  The new zfs-type
flash archive can only be used to install zfs-rooted systems.
- Although the entire root pool (minus any explicitly excluded
datasets) is archived and installed, only the BE booted at
the time of the flarcreate will be usable after the flash archive
is installed.  (except for pools archived with the -R rootdir
option,which can be used to archive a root pool other than the
one currently booted).
- The options to flarcreate and flar to include and exclude
individual files in a flash archive is not support with zfs-type
flash archives.  Only entire datasets may be excluded from a
zfs flar.
- The new pool created by the flash archive install will have the
same name as the pool that was used to generate the flash archive.



Status of the ZFS Flash Archive Patches
---
I have received test versions of the patches for zfs flash archive 
support (CR 6690473).

Those patches are:

sparc:
119534-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
124630-26: updates to the install software

x86:
119535-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
124631-27: updates to the install software

The patches are applied as follows:

The flarcreate/flar patch  (119534-15/119535-15) must be applied to the 
system

where the flash archive is generated.
The install software patch (124630-26/124631-27) must be applied to the
install medium (probably a netinstall image), since that is where the 
install

software resides.  A system being installed with a flash archive image will
have to be booted from a patched image so that the install software can
recognize the zfs-type flash archive and handle it correctly.
I verified these patches on both sparc and x86 platforms, and as applied to
both Update 6 and Update 7 systems and images.  On Update 6, it is also
necessary to apply the kernel update (KU) patch to the netinstall image
in order for the install to work.  The KU patch is

  

Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-08 Thread Bob Friesenhahn

On Wed, 8 Jul 2009, Fredrich Maney wrote:


Any idea what the Patch ID was?


x86:119535-15
SPARC:  119534

Description of change 6690473 request to have flash support for ZFS 
root install.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-08 Thread Jean-Noël Mattern

Bob,

Patches that allow the creation and installation of a flash archive on a 
zpool are available:


For SPARC:
119534-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
124630-26: updates to the install software

For x86:
119535-15 : fixes to the /usr/sbin/flarcreate and /usr/sbin/flar command
124631-27: updates to the install software

You'll have to patch the miniroot of your network boot server in order 
to have it work 
(http://www.sun.com/bigadmin/features/hub_articles/patchmini.jsp).


Jnm.

--

Bob Friesenhahn a écrit :

On Wed, 8 Jul 2009, Jerry K wrote:

It has been a while since this has been discussed, and I am hoping 
that you can provide an update, or time estimate.  As we are several 
months into Update 7, is there any chance of an Update 7 patch, or 
are we still waiting for Update 8.


I saw that a Solaris 10 patch for supporting Flash archives on ZFS 
came out about a week ago.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs root, jumpstart and flash archives

2009-07-08 Thread Enda O'Connor

Hi
for sparc
119534-15
124630-26


for x86
119535-15
124631-27

higher rev's of these will also suffice.

Note these need to be applied to the miniroot of the jumpstart image so 
that it can then install zfs flash archive.
 please read the README notes in these for more specific instructions, 
including instructions on miniroot patching.


Enda

Fredrich Maney wrote:

Any idea what the Patch ID was?

fpsm

On Wed, Jul 8, 2009 at 3:43 PM, Bob
Friesenhahnbfrie...@simple.dallas.tx.us wrote:

On Wed, 8 Jul 2009, Jerry K wrote:


It has been a while since this has been discussed, and I am hoping that
you can provide an update, or time estimate.  As we are several months into
Update 7, is there any chance of an Update 7 patch, or are we still waiting
for Update 8.

I saw that a Solaris 10 patch for supporting Flash archives on ZFS came out
about a week ago.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs snapshoot of rpool/* to usb removable drives?

2009-07-08 Thread Carl Brewer
Thankyou!  Am I right in thinking that rpool snapshots will include things like 
swap?  If so, is there some way to exclude them?  Much like rsync has --exclude?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs snapshoot of rpool/* to usb removable drives?

2009-07-08 Thread Richard Elling

Carl Brewer wrote:

Thankyou!  Am I right in thinking that rpool snapshots will include things like 
swap?  If so, is there some way to exclude them?  Much like rsync has --exclude?
  


No. Snapshots are a feature of the dataset, not the pool.  So you
would have separate snapshot policies for each file system (eg rpool)
and volume (eg swap and dump).
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs snapshoot of rpool/* to usb removable drives?

2009-07-08 Thread Lori Alt

On 07/08/09 15:57, Carl Brewer wrote:

Thankyou!  Am I right in thinking that rpool snapshots will include things like 
swap?  If so, is there some way to exclude them?  Much like rsync has --exclude?
  
By default, the zfs send -R will send all the snapshots, including 
swap and dump.  But you can do the following after taking the snapshot:


# zfs destroy rpool/d...@mmddhh
# zfs destroy rpool/s...@mmddhh

and then do the zfs send -R .  You'll get messages about the missing 
snapshots, but they can be ignored. 

In order to re-create a bootable pool from your backup, there are 
additional steps required.  A full description of a procedure similar to 
what you are attempting can be found here:


http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#ZFS_Root_Pool_Recovery


Lori


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs snapshoot of rpool/* to usb removable drives?

2009-07-08 Thread Daniel Carosone
 Thankyou!  Am I right in thinking that rpool
 snapshots will include things like swap?  If so, is
 there some way to exclude them? 

Hi Carl :)

You can't exclude them from the send -R with something like --exclude, but you 
can make sure there are no such snapshots (which aren't useful anyway) before 
sending, as noted.

As well as deleting them, another way to do this is to not create them in the 
first place.  If you use the snapshots created by tim's zfs-auto-snapshot 
service, that service observes a property on each dataset that excludes 
snapshots being taken on that dataset.

There are convenient hooks in that service that you can use to facilitate the 
sending step directly once the snapshots are taken, and to use incremental 
sends of the snapshots as well.

You might consider your replication schedule, too - for example, keep 
frequent and maybe even hourly snapshots only on the internal pool, and 
replicate daily and beyond snapshots to the external drive ready for removal. 
  If you arrange your schedule of swapping drives well enough, such that a 
drive returns from offsite storage and is reconnected while the most recent 
snapshot it contains is still present in rpool, then catching it up with the 
week of snapshots it missed while offsite can be quick.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single disk parity

2009-07-08 Thread Richard Elling

Haudy Kazemi wrote:

Daniel Carosone wrote:

Sorry, don't have a thread reference
to hand just now.



http://www.opensolaris.org/jive/thread.jspa?threadID=100296

Note that there's little empirical evidence that this is directly applicable to 
the kinds of errors (single bit, or otherwise) that a single failing disk 
medium would produce.  Modern disks already include and rely on a lot of ECC as 
part of ordinary operation, below the level usually seen by the host.  These 
mechanisms seem unlikely to return a read with just one (or a few) bit errors.

This strikes me, if implemented, as potentially more applicable to errors 
introduced from other sources (controller/bus transfer errors, non-ecc memory, 
weak power supply, etc).  Still handy.
  


Adding additional data protection options are commendable.  On the 
other hand I feel there are important gaps in the existing feature set 
that are worthy of a higher priority, not the least of which is the 
automatic recovery of uberblock / transaction group problems (see 
Victor Latushkin's recovery technique which I linked to in a recent 
post), 


This does not seem to be a widespread problem.  We do see the
occasional complaint on this forum, but considering the substantial
number of ZFS implementations in existence today, the rate seems
to be quite low.  In other words, the impact does not seem to be high.
Perhaps someone at Sun could comment on the call rate for such
conditions?

followed closely by a zpool shrink or zpool remove command that lets 
you resize pools and disconnect devices without replacing them.  I saw 
postings or blog entries from about 6 months ago that this code was 
'near' as part of solving a resilvering bug but have not seen anything 
else since.  I think many users would like to see improved resilience 
in the existing features and the addition of frequently long requested 
features before other new features are added.  (Exceptions can readily 
be made for new features that are trivially easy to implement and/or 
are not competing for developer time with higher priority features.)


In the meantime, there is the copies flag option that you can use on 
single disks.  With immense drives, even losing 1/2 the capacity to 
copies isn't as traumatic for many people as it was in days gone by.  
(E.g. consider a 500 gb hard drive with copies=2 versus a 128 gb 
SSD).  Of course if you need all that space then it is a no-go.


Space, performance, dependability: you can pick any two.



Related threads that also had ideas on using spare CPU cycles for 
brute force recovery of single bit errors using the checksum:


There is no evidence that the type of unrecoverable read errors we
see are single bit errors.  And while it is possible for an error handling
code to correct single bit flips, multiple bit flips would remain as a
large problem space.  There are error codes which can correct multiple
flips, but they quickly become expensive.  This is one reason why nobody
does RAID-2.

BTW, if you do have the case where unprotected data is not
readable, then I have a little DTrace script that I'd like you to run
which would help determine the extent of the corruption.  This is
one of those studies which doesn't like induced errors ;-)
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon

The data we do have suggests that magnetic hard disk failures tend
to be spatially clustered. So there is still the problem of spatial 
diversity

which is rather nicely handled by copies, today.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Booting from detached mirror disk

2009-07-08 Thread Sunil Sohani

Hi,

I have mirrored boot disk and I am able to boot from either disk. If I 
detach mirror would I be able to boot from detached disk?

Thanks.

Sunil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Booting from detached mirror disk

2009-07-08 Thread William Bauer
Did you run installgrub on both disks:

/usr/sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/cxtydzs0

Or the equivalent.  If you can't boot from either, how did either become your 
boot disk?

If you want  to use a single mirror member disk to boot from (i.e. for 
testing), I wouldn't detach it.  Boot from it and let ZFS complain about the 
missing member of the mirror for the short term.  That's my idea, but I'm not 
an expert.  This process works for me if I'm testing something regarding my 
mirror or testing a disk or controller.  I don't know how one would boot from a 
detached disk, but other folks will have more experience.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs

2009-07-08 Thread William Bauer
Just trying to help since no one has responded

Have you tried importing with an alternate root?  We don't know your setup, 
such as other pools, types of controllers and/or disks, or how your pool was 
constructed.

Try importing something like this:

zpool import -R /tank2 -f pool_numeric_identifier

Perhaps you have some overlap with your existing pools and bigtank, so this 
might help to track that down.  You can always export it and re-import with the 
correct root once you get to the bottom of this.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Booting from detached mirror disk

2009-07-08 Thread William Bauer
By the way, if you try my idea and both disks remain physically attached, both 
should be found and the mirror will be intact, regardless of which disk you 
boot from.  If one is physically disconnected, then you will have complaints 
about the missing disk, but it should still work if everything is configured 
correctly and your BIOS doesn't present you with any difficulties.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss