bug? ZFS crypto vs. scrub

2011-05-10 Thread Daniel Carosone
Sorry for abusing the mailing list, but I don't know how to report
bugs anymore and have no visibility of whether this is a
known/resolved issue.  So, just in case it is not...

With Solaris 11 Express, scrubbing a pool with encrypted datasets for
which no key is currently loaded, unrecoverable read errors are
reported. The error count applies to the pool, and not to any specific
device, which is also somewhat at odds with the helpful message text
for diagnostic status and suggested action:

  pool: geek
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scan: scrub repaired 0 in 3h8m with 280 errors on Tue May 10 17:12:15 2011
config:

NAME STATE READ WRITE CKSUM
geek ONLINE 280 0 0
  raidz2-0   ONLINE   0 0 0
c13t0d0  ONLINE   0 0 0
c13t1d0  ONLINE   0 0 0
c13t2d0  ONLINE   0 0 0
c13t3d0  ONLINE   0 0 0
c13t4d0  ONLINE   0 0 0
c13t5d0  ONLINE   0 0 0
c0t0d0   ONLINE   0 0 0
c0t1d0   ONLINE   0 0 0
c1t0d0   ONLINE   0 0 0
c1t1d0   ONLINE   0 0 0


Using -v lists an error for the same 2 hexid's in each snapshot, as
per the following example:

geek/crypt@zfs-auto-snap_weekly-2011-03-28-22h39:0xfffe
geek/crypt@zfs-auto-snap_weekly-2011-03-28-22h39:0x

When this has happened previously (on this and other pools) mounting
the dataset by supplying the key, and rerunning the scrub, removes the
errors.  

For some reason, I can't in this case (keeps complaining that
the key is wrong). That may be a different issue that has also
happened before, and I will post about separately, once I'm sure I
didn't just made a typo (twice) when first setting the key.

--
Dan.

pgpMX0o7N9c2w.pgp
Description: PGP signature
___
zfs-crypto-discuss mailing list
zfs-crypto-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-crypto-discuss


Re: [zfs-discuss] ZFS on HP MDS 600

2011-05-10 Thread Brandon High
On Mon, May 9, 2011 at 8:33 AM, Darren Honeyball ml...@spod.net wrote:
 I'm just mulling over the best configuration for this system - our work load 
 is mostly writing millions of small files (around 50k) with occasional reads 
  we need to keep as much space as possible.

If space is a priority, then raidz or raidz2 are probably the best
bets. If you're going to have a lot of random iops, then mirrors are
best.

You have some control over the performance : space ratio with raidz by
adjusting the width of the radiz vdevs. For instance, mirrors will
provide 34TB of space and best random iops. 24 x 3-disk raidz vdevs
will have 48TB of space but still have pretty strong random iops
performance. 13 x 5-disk raidz vdevs will give 52TB of space at the
lost of lower random iops.

Testing will help you find the best configuration for your environment.

 HP's recommendations for configuring the MDS 600 with ZFS is to let the P212 
 do the raid functions (raid 1+0 is recommended here) by configuring each half 
 of the MDS 600 as a single logical drive (35 drives)  then use a basic zfs 
 pool on top to provide the zfs functionality - to me this would seem to loose 
 a lot of the error checking functions of zfs?

If you configured the two logical drives as a mirror in ZFS, then
you'd still have full protection. Your overhead would be really high
though - 3/4 of your original capacity would be used for data
protection if I understand the recommendation correctly. (You'd use
1/2 of the original capacity for RAID1 in the MDS, then 1/2 of the
remaining for the ZFS mirror.) You could use non-redundant pool in ZFS
to reduce the overhead, but you sacrifice the self-healing properties
of ZFS when you do that.

 Another option is to use raidz and let zfs handle the smart stuff - as the 
 P212 doesn't support a true dumb JBOD function I'd need to create each drive 
 as a single raid 0 logical drive - are there any drawback to doing this? Or 
 would it be better to create slightly larger logical drives using say 2 
 physical drives per logical drive?

Single-device logical drives are required when you can't configure a
card or device as JBOD, and I believe its usually the recommended
solution. Once you have the LUNs created, you can use ZFS to create
mirrors or raidz vdevs.

 I'm planning on having 2 hot spares - one in each side of the MDS 600, is it 
 also worth using a dedicated ZIL spindle or 2?

It would depend on your workload. (How's that for helpful?)

If you're experiencing a lot of synchronous writes, then a ZIL will
help. If you aren't seeing a lot of sync writes, then a ZIL won't
help. The ZIL doesn't have to be very large, since it's flushed on a
regular basis. From the Best Practices guide:
For a target throughput of X MB/sec and given that ZFS pushes
transaction groups every 5 seconds (and have 2 outstanding), we also
expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service
100MB/sec of synchronous writes, 1 GB of log device should be
sufficient.

If the MDS has a non-volatile cache, there should be little or no need
to use a ZIL.

However, some reports have shown ZFS with a ZIL to be faster than
using non-volatile cache. You should test performance using your
workload.

 Is it worth tweaking zfs_nocacheflush or zfs_vdev_max_pending?

As I mentioned above, if the MDS has a non-volatile cache, then
setting zfs_nocacheflush might help performance.

If you're exporting one LUN per device then you shouldn't need to
adjust the max_pending. If you're exporting larger RAID10 luns from
the MDS, then increasing the value might help for read workloads.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] GPU acceleration of ZFS

2011-05-10 Thread Anatoly

Good day,

I think ZFS can take advantage of using GPU for sha256 calculation, 
encryption and maybe compression. Modern video card, like 5xxx or 6xxx 
ATI HD Series can do calculation of sha256 50-100 times faster than 
modern 4 cores CPU.


kgpu project for linux shows nice results.

'zfs scrub' would work freely on high performance ZFS pools.

The only problem that there is no AMD/Nvidia drivers for Solaris that 
support hardware-assisted OpenCL.


Is anyone interested in it?

Best regards,
Anatoly Legkodymov.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] GPU acceleration of ZFS

2011-05-10 Thread Krunal Desai
On Tue, May 10, 2011 at 11:29 AM, Anatoly legko...@fastmail.fm wrote:
 Good day,

 I think ZFS can take advantage of using GPU for sha256 calculation,
 encryption and maybe compression. Modern video card, like 5xxx or 6xxx ATI
 HD Series can do calculation of sha256 50-100 times faster than modern 4
 cores CPU.
Ignoring optimizations from SIMD extensions like SSE and friends, this
is probably true. However, the GPU also has to deal with the overhead
of data transfer to itself before it can even begin crunching data.
Granted, a Gen. 2 x16 link is quite speedy, but is CPU performance
really that poor where a GPU can still out-perform it? My undergrad
thesis dealt with computational acceleration utilizing CUDA, and the
datasets had to scale quite a ways before there was a noticeable
advantage in using a Tesla or similar over a bog-standard i7-920.

 The only problem that there is no AMD/Nvidia drivers for Solaris that
 support hardware-assisted OpenCL.
This, and keep in mind that most of the professional users here will
likely be using professional hardware, where a simple 8MB Rage XL gets
the job done thanks to the magic of out-of-band management cards and
other such facilities. Even as a home user, I have not placed a
high-end videocard into my machine, I use a $5 ATI PCI videocard that
saw about a hour of use whilst I installed Solaris 11.

-- 
--khd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] GPU acceleration of ZFS

2011-05-10 Thread Hung-Sheng Tsao (LaoTsao) Ph. D.


IMHO, zfs need to run in all kind of HW
T-series CMT server that can help sha calculation since T1 day, did not 
see any work in ZFS to take advantage it



On 5/10/2011 11:29 AM, Anatoly wrote:

Good day,

I think ZFS can take advantage of using GPU for sha256 calculation, 
encryption and maybe compression. Modern video card, like 5xxx or 6xxx 
ATI HD Series can do calculation of sha256 50-100 times faster than 
modern 4 cores CPU.


kgpu project for linux shows nice results.

'zfs scrub' would work freely on high performance ZFS pools.

The only problem that there is no AMD/Nvidia drivers for Solaris that 
support hardware-assisted OpenCL.


Is anyone interested in it?

Best regards,
Anatoly Legkodymov.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] GPU acceleration of ZFS

2011-05-10 Thread C Bergström
On Tue, May 10, 2011 at 10:29 PM, Anatoly legko...@fastmail.fm wrote:
 Good day,

 I think ZFS can take advantage of using GPU for sha256 calculation,
 encryption and maybe compression. Modern video card, like 5xxx or 6xxx ATI
 HD Series can do calculation of sha256 50-100 times faster than modern 4
 cores CPU.

 kgpu project for linux shows nice results.

 'zfs scrub' would work freely on high performance ZFS pools.

 The only problem that there is no AMD/Nvidia drivers for Solaris that
 support hardware-assisted OpenCL.

 Is anyone interested in it?

This isn't technically true.  The NVIDIA drivers support compute, but
there's other parts of the toolchain missing.  /* I don't know about
ATI/AMD, but I'd guess they likely don't support compute across
platforms */



/* Disclaimer - The company I work for has a working HMPP compiler for
Solaris/FreeBSD and we may soon support CUDA */
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Tuning disk failure detection?

2011-05-10 Thread Ray Van Dolson
We recently had a disk fail on one of our whitebox (SuperMicro) ZFS
arrays (Solaris 10 U9).

The disk began throwing errors like this:

May  5 04:33:44 dev-zfs4 scsi: [ID 243001 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0):
May  5 04:33:44 dev-zfs4mptsas_handle_event_sync: IOCStatus=0x8000, 
IOCLogInfo=0x31110610

And errors for the drive were incrementing in iostat -En output.
Nothing was seen in fmdump.

Unfortunately, it took about three hours for ZFS (or maybe it was MPT)
to decide the drive was actually dead:

May  5 07:41:06 dev-zfs4 scsi: [ID 107833 kern.warning] WARNING: 
/scsi_vhci/disk@g5000c5002cbc76c0 (sd4):
May  5 07:41:06 dev-zfs4drive offline

During this three hours the I/O performance on this server was pretty
bad and caused issues for us.  Once the drive failed completely, ZFS
pulled in a spare and all was well.

My question is -- is there a way to tune the MPT driver or even ZFS
itself to be more/less aggressive on what it sees as a failure
scenario?

I suppose this would have been handled differently / better if we'd
been using real Sun hardware?

Our other option is to watch better for log entries similar to the
above and either alert someone or take some sort of automated action
.. I'm hoping there's a better way to tune this via driver or ZFS
settings however.

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] primarycache=metadata seems to force behaviour of secondarycache=metadata

2011-05-10 Thread Brandon High
On Mon, May 9, 2011 at 2:54 PM, Tomas Ögren st...@acc.umu.se wrote:
 Slightly off topic, but we had an IBM RS/6000 43P with a PowerPC 604e
 cpu, which had about 60MB/s memory bandwidth (which is kind of bad for a
 332MHz cpu) and its disks could do 70-80MB/s or so.. in some other
 machine..

It wasn't that long ago when 66MB/s ATA was considered a waste because
no drive could use that much bandwidth. These days a slow drive has
max throughput greater than 110MB/s.

(OK, looking at some online reviews, it was about 13 years ago. Maybe
I'm just old.)

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] COW question

2011-05-10 Thread Francois Marcoux

przemol...@poczta.fm wrote:

On Fri, Jul 07, 2006 at 11:59:29AM +0800, Raymond Xiong wrote:
  

It doesn't. Page 11 of the following slides illustrates how COW
works in ZFS:

http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf

Blocks containing active data are never overwritten in place;
instead, a new block is allocated, modified data is written to
it, and then any metadata blocks referencing it are similarly
read, reallocated, and written. To reduce the overhead of this
process, multiple updates are grouped into transaction groups,
and an intent log is used when synchronous write semantics are
required.(from http://en.wikipedia.org/wiki/ZFS)

IN snapshot scenario, COW consumes much less disk space and is
much faster.



It says also that updating uberblock is an atomic operations. How is it
achieved ?

przemol
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  

The email thread and document below gives you information about that:

http://www.opensolaris.org/jive/thread.jspa?messageID=19264#19264
http://www.opensolaris.org/os/community/zfs/docs/ondiskformatfinal.pdf

Francois.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Storage

2011-05-10 Thread przemol...@poczta.fm
On Thu, Jun 29, 2006 at 10:01:15AM +0200, Robert Milkowski wrote:
 Hello przemolicc,
 
 Thursday, June 29, 2006, 8:01:26 AM, you wrote:
 
 ppf On Wed, Jun 28, 2006 at 03:30:28PM +0200, Robert Milkowski wrote:
  ppf What I wanted to point out is the Al's example: he wrote about 
  damaged data. Data
  ppf were damaged by firmware _not_ disk surface ! In such case ZFS 
  doesn't help. ZFS can
  ppf detect (and repair) errors on disk surface, bad cables, etc. But 
  cannot detect and repair
  ppf errors in its (ZFS) code.
  
  Not in its code but definitely in a firmware code in a controller.
 
 ppf As Jeff pointed out: if you mirror two different storage arrays.
 
 Not only I belive. There are some classes of problems that even in one
 array ZFS could help for fw problems (with many controllers in
 active-active config like Symetrix).

Any real example ?

przemol
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DTrace IO provider and oracle

2011-05-10 Thread przemol...@poczta.fm
On Tue, Aug 08, 2006 at 11:33:28AM -0500, Tao Chen wrote:
 On 8/8/06, przemol...@poczta.fm przemol...@poczta.fm wrote:
 
 Hello,
 
 Solaris 10 GA + latest recommended patches:
 
 while runing dtrace:
 
 bash-3.00# dtrace -n 'io:::start {@[execname, args[2]-fi_pathname] =
 count();}'
 ...
 
   oracle  none   
   2096052
 
 How can I interpret 'none' ? Is it possible to get full path (like in
 vim) ?
 
 
 Section 27.2.3 fileinfo_t of DTrace Guide
 explains in detail why you see 'none' in many cases.
 http://www.sun.com/bigadmin/content/dtrace/d10_latest.pdf
 or
 http://docs.sun.com/app/docs/doc/817-6223/6mlkidllf?a=view
 
 The execname part can also be misleading, as many I/O activities are
 asynchronous (including but not limited to Asynchronous I/O), so whatever
 thread running on CPU may have nothing to do with the I/O that's occuring.
 
 This is working as designed and not a problem that limited to ZFS, IMO.

Thanks Tao for the doc pointers. I haven't noticed them.

przemol
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz DEGRADED state

2011-05-10 Thread Thomas Garner
So there is no current way to specify the creation of a 3 disk raid-z
array with a known missing disk?

On 12/5/06, David Bustos david.bus...@sun.com wrote:
 Quoth Thomas Garner on Thu, Nov 30, 2006 at 06:41:15PM -0500:
  I currently have a 400GB disk that is full of data on a linux system.
  If I buy 2 more disks and put them into a raid-z'ed zfs under solaris,
  is there a generally accepted way to build an degraded array with the
  2 disks, copy the data to the new filesystem, and then move the
  original disk to complete the array?

 No, because we currently can't add disks to a raidz array.  You could
 create a mirror instead and then add in the other disk to make
 a three-way mirror, though.

 Even doing that would be dicey if you only have a single machine,
 though, since Solaris can't natively read the popular Linux filesystems.
 I believe there is freeware to do it, but nothing supported.


 David

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz DEGRADED state

2011-05-10 Thread Krzys
Ah, did not see your follow up. Thanks.

Chris


On Thu, 30 Nov 2006, Cindy Swearingen wrote:

 Sorry, Bart, is correct:

 If  new_device  is  not  specified,   it   defaults   to
  old_device.  This form of replacement is useful after an
  existing  disk  has  failed  and  has  been   physically
  replaced.  In  this case, the new disk may have the same
  /dev/dsk path as the old device, even though it is actu-
  ally a different disk. ZFS recognizes this.

 cs

 Cindy Swearingen wrote:
 One minor comment is to identify the replacement drive, like this:
 
 # zpool replace mypool2 c3t6d0 c3t7d0
 
 Otherwise, zpool will error...
 
 cs
 
 Bart Smaalders wrote:
 
 Krzys wrote:
 
 
 my drive did go bad on me, how do I replace it? I am sunning solaris 10 
 U2 (by the way, I thought U3 would be out in November, will it be out 
 soon? does anyone know?
 
 
 [11:35:14] server11: /export/home/me  zpool status -x
   pool: mypool2
  state: DEGRADED
 status: One or more devices could not be opened.  Sufficient replicas 
 exist for
 the pool to continue functioning in a degraded state.
 action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-D3
  scrub: none requested
 config:

 NAMESTATE READ WRITE CKSUM
 mypool2 DEGRADED 0 0 0
   raidz DEGRADED 0 0 0
 c3t0d0  ONLINE   0 0 0
 c3t1d0  ONLINE   0 0 0
 c3t2d0  ONLINE   0 0 0
 c3t3d0  ONLINE   0 0 0
 c3t4d0  ONLINE   0 0 0
 c3t5d0  ONLINE   0 0 0
 c3t6d0  UNAVAIL  0   679 0  cannot open
 
 errors: No known data errors
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 
 
 Shut down the machine, replace the drive, reboot
 and type:
 
 zpool replace mypool2 c3t6d0
 
 
 On earlier versions of ZFS I found it useful to do this
 at the login prompt; it seemed fairly memory intensive.
 
 - Bart
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 !DSPAM:122,456f1b0c21174266247132!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DTrace IO provider and oracle

2011-05-10 Thread Jim Litchfield
I use this construct to get something better than none

 args[2]-fi_pathname != none ? args[2]-fi_pathname : 
args[1]-dev_pathname

In the latest versions of Solaris 10, you'll see IOs not directly issued by the 
app
show up as being owned by 'zpool-POOLNAME' where POOLNAME is the real name of 
the
pool.

In this case, it appears the IOs are being done by the issuing process which 
means
they're almost certainly reads. If that is the case, you could capture the 
pathname
in the read call and pass that down to the start routine (left as an exercise 
for
the reader).

I also find, especially with oracle, that using the psargs string is much more
informative - curpsinfo-pr_psargs.

Jim
---



- Original Message -
From: przemol...@poczta.fm
To: zfs-discuss@opensolaris.org
Sent: Tuesday, May 10, 2011 10:27:55 AM GMT -08:00 US/Canada Pacific
Subject: Re: [zfs-discuss] DTrace IO provider and oracle

On Tue, Aug 08, 2006 at 11:33:28AM -0500, Tao Chen wrote:
 On 8/8/06, przemol...@poczta.fm przemol...@poczta.fm wrote:
 
 Hello,
 
 Solaris 10 GA + latest recommended patches:
 
 while runing dtrace:
 
 bash-3.00# dtrace -n 'io:::start {@[execname, args[2]-fi_pathname] =
 count();}'
 ...
 
   oracle  none   
   2096052
 
 How can I interpret 'none' ? Is it possible to get full path (like in
 vim) ?
 
 
 Section 27.2.3 fileinfo_t of DTrace Guide
 explains in detail why you see 'none' in many cases.
 http://www.sun.com/bigadmin/content/dtrace/d10_latest.pdf
 or
 http://docs.sun.com/app/docs/doc/817-6223/6mlkidllf?a=view
 
 The execname part can also be misleading, as many I/O activities are
 asynchronous (including but not limited to Asynchronous I/O), so whatever
 thread running on CPU may have nothing to do with the I/O that's occuring.
 
 This is working as designed and not a problem that limited to ZFS, IMO.

Thanks Tao for the doc pointers. I haven't noticed them.

przemol
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fuser vs. zfs

2011-05-10 Thread Tomas Ögren
On 23 November, 2005 - Benjamin Lewis sent me these 3,0K bytes:

 Hello,
 
 I'm running Solaris Express build 27a on an amd64 machine and
 fuser(1M) isn't behaving
 as I would expect for zfs filesystems.  Various google and
...
  #fuser -c /
  /:[lots of other PIDs] 20617tm [others] 20412cm [others]
  #fuser -c /opt
  /opt:
  #
 
 Nothing at all for /opt.  So it's safe to unmount? Nope:
...
 Has anyone else seen something like this?

Try something less ancient, Solaris 10u9 reports it just fine for
example. ZFS was pretty new-born when snv27 got out..

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fuser vs. zfs

2011-05-10 Thread Tomas Ögren
On 10 May, 2011 - Tomas Ögren sent me these 0,9K bytes:

 On 23 November, 2005 - Benjamin Lewis sent me these 3,0K bytes:
 
  Hello,
  
  I'm running Solaris Express build 27a on an amd64 machine and
  fuser(1M) isn't behaving
  as I would expect for zfs filesystems.  Various google and
 ...
   #fuser -c /
   /:[lots of other PIDs] 20617tm [others] 20412cm [others]
   #fuser -c /opt
   /opt:
   #
  
  Nothing at all for /opt.  So it's safe to unmount? Nope:
 ...
  Has anyone else seen something like this?
 
 Try something less ancient, Solaris 10u9 reports it just fine for
 example. ZFS was pretty new-born when snv27 got out..

And for someone who is able to read as well, that mail was from 2005 -
when snv27 actually was less ancient ;)

Seems like the moderator queue from yesteryears just got flushed..

Sorry for the noise from my side..

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Old posts to zfs-discuss

2011-05-10 Thread Bill Rushmore
Sorry for the old posts that some of you are seeing to zfs-discuss.  The 
link between Jive and mailman was broken so I fixed that.  However, once 
this was fixed Jive started sending every single post from the 
zfs-discuss board on Jive to the mail list.  Quite a few posts were sent 
before I realized what was happening and was able to kill the process.


Bill Rushmore
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-10 Thread Don
I've been going through my iostat, zilstat, and other outputs all to no avail. 
None of my disks ever seem to show outrageous service times, the load on the 
box is never high, and if the darned thing is CPU bound- I'm not even sure 
where to look.

(traversing DDT blocks even if in memory, etc - and kernel times indeed are 
above 50%) as I'm zeroing deleted blocks inside the internal pool. This 
took several days already, but recovered lots of space in my main pool also...
When you say you are zeroing deleted blocks- how are you going about doing that?

Despite claims to the contrary- I can understand ZFS needing some tuning. What 
I can't understand are the baffling differences in performance I see. For 
example- after deleting a large volume- suddenly my performance will skyrocket- 
then gradually degrade- but the question is why?

I'm not running dedup. My disks seem to be largely idle. I have 8 3GHz cores 
that also seem to be idle. I seem to have enough memory. What is ZFS doing 
during this time?

Everything I've read suggests one of two possible causes- too full, or bad 
hardware. Is there anything else that might be an issue here? Another ZFS 
factor I haven't taken into account?

Space seems to be the biggest factor in my performance difference- more free 
space = more performance- but as my fullest disks are less than 70% full, and 
my emptiest disks are less than 10% full- I can't understand why space is an 
issue.

I have a few hardware errors for one of my pool disks- but we're talking about 
a very small number of errors over a long period of time. I'm considering 
replacing this disk but the pool is so slow at times I'm loathe to slow it down 
further by doing a replace unless I can be more certain that is going to fix 
the problem.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-10 Thread Jim Klimov
Well, as I wrote in other threads - i have a pool named pool on physical 
disks, and a compressed volume in this pool which i loopback-mount over iSCSI 
to make another pool named dcpool.

When files in dcpool are deleted, blocks are not zeroed out by current ZFS 
and they are still allocated for the physical pool. Now i'm doing essentially 
this to clean up the parent pool:
# dd if=/dev/zero of=/dcpool/nodedup/bigzerofile

This file is in a non-deduped dataset, so to the point of view of dcpool, it 
has a growing huge file filled with zeroes - and its referenced blocks 
overwrite garbage left over from older deleted files and no longer referenced 
by dcpool. However for the pool this is a write of compressed zeroed block, 
which is not to be referenced, so the pool releases a volume block and its 
referencing metadata block.

This has already released over half a terabyte in my physical pool (compressed 
blocks filled with zeroes are a special case for ZFS and require none or 
less-than-usual reference metadata blocks) ;)

However, since I have millions of 4kb blocks for volume data and its metadata, 
I guess fragmentation is quite high, maybe even interlacing one-to-one? One way 
or another, this dcpool never saw IOs faster that say 15Mb/s, and usually 
lingers in 1-5Mb/s range, while I can get 30-50Mb/s in the pool easily in 
other datasets (with dynamic block sizes and lengthier contiguous data 
stretches).

Writes had been relatively quick for the first virtual terabyte or so, but it's 
doing the last 100gb for several days now, at several megabytes per minute in 
the dcpool iostat. There's several Mb/sec of IO's on hardware disks to back 
this deletion and clean-up, however (as in my examples in previous post)...

As for disks with different fill ratio - it is a commonly discussed performance 
problem. Seems to boil down to this: free space on all disks (actually on 
top-level VDEVs) is considered for round-robining writes to stripes. Disks that 
have been in use for a longer time may have very fragmented free space on one 
hand, and not so much of it on another, but ZFS is still trying to push bits 
around evenly. And while it's waiting on some disks, others may be blocked as 
well. Something like that...

People on this forum have seen and reported that adding a 100Mb file tanked 
their multiterabyte pool's performance, and removing the file boosted it back 
up.

I don't want to mix up other writers' findings, better search recent 5-10 pages 
of the forum post headings yourself. It's within the last hundred of threads, I 
think, maybe ;)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning disk failure detection?

2011-05-10 Thread Jim Klimov
In a recent post r-mexico wrote that they had to parse system messages and 
manually fail the drives on a similar, though different, occasion:

http://opensolaris.org/jive/message.jspa?messageID=515815#515815
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Modify stmf_sbd_lu properties

2011-05-10 Thread Jim Dunham
Don,

 Is it possible to modify the GUID associated with a ZFS volume imported into 
 STMF?
 
 To clarify- I have a ZFS volume I have imported into STMF and export via 
 iscsi. I have a number of snapshots of this volume. I need to temporarily go 
 back to an older snapshot without removing all the more recent ones. I can 
 delete the current sbd LU, clone the snapshot I want to test, and then bring 
 that back in to sbd.
 
 The problem is that you need to use sbdadm create-lu and that creates a new 
 GUID. (sbdadm import-lu on a clone will give you a metafile error).

Take a look at the command set associated with stmfadm, and you should see that 
it has taken on all sbdadm options, and more. I believe you are looking for the 
functionality associated with stmfadm offline-lu, ... online-lu.

- Jim

 Is it possible to change the GUID of the newly imported volume to match the 
 old volume (even if that means changing the guid of the old volume first)?
 
 I had hoped this could be done by dumping the stmf_sdb_lu property from zfs 
 and setting the clones property to this value- but that does not seem to work.
 
 Changing the guid is not an option for these tests. Any ieas?
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning disk failure detection?

2011-05-10 Thread Ray Van Dolson
On Tue, May 10, 2011 at 02:42:40PM -0700, Jim Klimov wrote:
 In a recent post r-mexico wrote that they had to parse system
 messages and manually fail the drives on a similar, though
 different, occasion:
 
 http://opensolaris.org/jive/message.jspa?messageID=515815#515815

Thanks Jim, good pointer.

It sounds like our use of SATA disks is likely the problem and we'd
have better error reporting with SAS or some of the nearline SAS
drives (SATA drives with a real SAS controller on them).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-10 Thread Hung-ShengTsao (Lao Tsao) Ph.D.

it is my understanding for write (fast) consider faster HDD (SSD) for ZIL
for read consider faster HDD(SSD) for L2ARC
There were many discussion for V12N env raid1 is better than raidz

On 5/10/2011 3:31 PM, Don wrote:

I've been going through my iostat, zilstat, and other outputs all to no avail. 
None of my disks ever seem to show outrageous service times, the load on the 
box is never high, and if the darned thing is CPU bound- I'm not even sure 
where to look.

(traversing DDT blocks even if in memory, etc - and kernel times indeed are above 50%) as I'm zeroing 
deleted blocks inside the internal pool. This took several days already, but 
recovered lots of space in my main pool also...
When you say you are zeroing deleted blocks- how are you going about doing that?

Despite claims to the contrary- I can understand ZFS needing some tuning. What 
I can't understand are the baffling differences in performance I see. For 
example- after deleting a large volume- suddenly my performance will skyrocket- 
then gradually degrade- but the question is why?

I'm not running dedup. My disks seem to be largely idle. I have 8 3GHz cores 
that also seem to be idle. I seem to have enough memory. What is ZFS doing 
during this time?

Everything I've read suggests one of two possible causes- too full, or bad 
hardware. Is there anything else that might be an issue here? Another ZFS 
factor I haven't taken into account?

Space seems to be the biggest factor in my performance difference- more free 
space = more performance- but as my fullest disks are less than 70% full, and 
my emptiest disks are less than 10% full- I can't understand why space is an 
issue.

I have a few hardware errors for one of my pool disks- but we're talking about 
a very small number of errors over a long period of time. I'm considering 
replacing this disk but the pool is so slow at times I'm loathe to slow it down 
further by doing a replace unless I can be more certain that is going to fix 
the problem.
attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning disk failure detection?

2011-05-10 Thread Ray Van Dolson
On Tue, May 10, 2011 at 03:57:28PM -0700, Brandon High wrote:
 On Tue, May 10, 2011 at 9:18 AM, Ray Van Dolson rvandol...@esri.com wrote:
  My question is -- is there a way to tune the MPT driver or even ZFS
  itself to be more/less aggressive on what it sees as a failure
  scenario?
 
 You didn't mention what drives you had attached, but I'm guessing they
 were normal desktop drives.
 
 I suspect (but can't confirm) that using enterprise drives with TLER /
 ERC / CCTL would have reported the failure up the stack faster than a
 consumer drive. The drives will report an error after 7 seconds rather
 than retry for several minutes.
 
 You may be able to enable the feature on your drives, depending on the
 manufacturer and firmware revision.
 
 -B

Yup, shoulda included that.  These are regular SATA drives --
supposedly Enterprise whatever that gives us (most likely a higher
MTBF number).

We'll probably look at going with nearline SAS drives (only increases
cost slightly) and write a small SEC rule on our syslog server to watch
for 0x3000 errors on servers with SATA disks only so we can at
least be alerted more quickly.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-10 Thread Don
 # dd if=/dev/zero of=/dcpool/nodedup/bigzerofile
Ahh- I misunderstood your pool layout earlier. Now I see what you were doing.

People on this forum have seen and reported that adding a 100Mb file tanked 
their
 multiterabyte pool's performance, and removing the file boosted it back up.
Sadly I think several of those posts were mine or those of coworkers.

 Disks that have been in use for a longer time may have very fragmented free
 space on one hand, and not so much of it on another, but ZFS is still trying 
 to push
 bits around evenly. And while it's waiting on some disks, others may be 
 blocked as
 well. Something like that...
This could explain why performance would go up after a large delete but I've 
not seen large wait times for any of my disks. The service time, percent busy, 
and every other metric continues to show nearly idle disks.

If this is the problem- it would be nice if there were a simple zfs or dtrace 
query that would show it to you.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss