Re: [zfs-discuss] zfs/sol10u8 less stable than in sol10u5?

2010-02-13 Thread sean walmsley
We recently patched our X4500 from Sol10 U6 to Sol10 U8 and have not noticed 
anything like what you're seeing. We do not have any SSD devices installed.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] building zpools on device aliases

2009-11-20 Thread Sean Walmsley
Cindy:

Thanks for your reply. These units are located at a remote site
300km away, so you're right about the main issue being able to
map the OS and/or ZFS device to a physical disk. The use of
alias devices was one way we thought to make this mapping more
intuitive, although of couse we'd always attempt to double check
via the JBOD LEDs.

I know that there are big improvements in Opensolaris related to
device enumeration (as per Eric Schrock's blog), but we're running
running Solaris 10 U6 for which:

  - fmtopo -V doesn't (yet) produce any output for disks
  - fmdump doesn't produce any human readable disk ids, only
guids which then have to be correlated via a zdb -c

Sean 


Date: Tue, 17 Nov 2009 16:18:52 -0700
From: Cindy Swearingen cindy.swearin...@sun.com
Subject: Re: [zfs-discuss] building zpools on device aliases
To: sean walmsley s...@fpp.nuclearsafetysolutions.com
Cc: zfs-discuss@opensolaris.org

Hi Sean,

I sympathize with your intentions but providing pseudo-names for these
disks might cause more confusion than actual help.

The c4t5... name isn't so bad. I've seen worse. :-)

Here are the issues with using the aliases:

- If a device fails on a J4200, a LED will indicate which disk has
failed but will not identify the alias name.
- To prevent confusion with drive failures, you will need to map your
aliases to the disk names, possibly pulling out the disks and relabeling
them with the alias name. You might be able to use
/usr/lib/fm/fmd/fmtopo -V | grep disks to do these mappings online.
- We don't know what fmdump will indicate if a disk has problems, the
dev alias or the real disk name.
- We don't know what else might fail

The stress of mapping the alias names to real disk names, might happen
under duress, like when a disk fails. The physical disk ID on the disk
will be included in the expanded name, not the alias name.

The hardest part is getting the right disks in the pool. After that, it
gets easier. ZFS does a good job of identifying the devices in a pool
and will identify which disk is having problems as will the fmdump
command. The LEDs on the disks themselves also help disk replacements.

The wheels are turning to make device administration easier. We're just
not there yet.

Cindy

On 11/16/09 19:17, sean walmsley wrote:
 We have a number of Sun J4200 SAS JBOD arrays which we have multipathed 
 using 
Sun's MPxIO facility. While this is great for reliability, it results in the 
/dev/dsk device IDs changing from cXtYd0 to something virtually unreadable like 
c4t5000C5000B21AC63d0s3.
 
 Since the entries in /dev/{rdsk,dsk} are simply symbolic links anyway, would 
there be any problem with adding alias links to /devices there and building 
our zpools on them? We've tried this and it seems to work fine producing a 
zpool 
status similar to the following:
 
 ...
 NAMESTATE READ WRITE CKSUM
 vol01   ONLINE   0 0 0
   mirrorONLINE   0 0 0
 top00   ONLINE   0 0 0
 bot00   ONLINE   0 0 0
   mirrorONLINE   0 0 0
 top01   ONLINE   0 0 0
 bot01   ONLINE   0 0 0
 ...
 
 Here our aliases are topnn and botnn to denote the disks in the top and 
bottom JBODs.
 
 The obvious question is what happens if the alias link disappears?. We've 
tested this, and ZFS seems to handle it quite nicely by finding the normal 
/dev/dsk link and simply working with that (although it's more difficult to get 
ZFS to use the alias again once it is recreated).
 
 If anyone can think of anything really nasty that we've missed, we'd 
appreciate knowing about it. Alternatively, if there is a better supported 
means 
of having ZFS display human-readable device ids we're all ears :-)
 
 Perhaps an MPxIO RFE for vanity device names would be in order?

=
Sean Walmsleysean at fpp . nuclearsafetysolutions dot com
Nuclear Safety Solutions Ltd.  416-592-4608 (V)  416-592-5528 (F)
700 University Ave M/S H04 J19, Toronto, Ontario, M5G 1X6, CANADA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] building zpools on device aliases

2009-11-16 Thread sean walmsley
We have a number of Sun J4200 SAS JBOD arrays which we have multipathed using 
Sun's MPxIO facility. While this is great for reliability, it results in the 
/dev/dsk device IDs changing from cXtYd0 to something virtually unreadable like 
c4t5000C5000B21AC63d0s3.

Since the entries in /dev/{rdsk,dsk} are simply symbolic links anyway, would 
there be any problem with adding alias links to /devices there and building 
our zpools on them? We've tried this and it seems to work fine producing a 
zpool status similar to the following:

...
NAMESTATE READ WRITE CKSUM
vol01   ONLINE   0 0 0
  mirrorONLINE   0 0 0
top00   ONLINE   0 0 0
bot00   ONLINE   0 0 0
  mirrorONLINE   0 0 0
top01   ONLINE   0 0 0
bot01   ONLINE   0 0 0
...

Here our aliases are topnn and botnn to denote the disks in the top and 
bottom JBODs.

The obvious question is what happens if the alias link disappears?. We've 
tested this, and ZFS seems to handle it quite nicely by finding the normal 
/dev/dsk link and simply working with that (although it's more difficult to get 
ZFS to use the alias again once it is recreated).

If anyone can think of anything really nasty that we've missed, we'd appreciate 
knowing about it. Alternatively, if there is a better supported means of having 
ZFS display human-readable device ids we're all ears :-)

Perhaps an MPxIO RFE for vanity device names would be in order?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] cryptic vdev name from fmdump

2009-10-23 Thread sean walmsley
This morning we got a fault management message from one of our production 
servers stating that a fault in one of our pools had been detected and fixed. 
Looking into the error using fmdump gives:

fmdump -v -u 90ea244e-1ea9-4bd6-d2be-e4e7a021f006
TIME UUID SUNW-MSG-ID
Oct 22 09:29:05.3448 90ea244e-1ea9-4bd6-d2be-e4e7a021f006 FMD-8000-4M Repaired
  100%  fault.fs.zfs.device

Problem in: zfs://pool=vol02/vdev=179e471c0732582
   Affects:   zfs://pool=vol02/vdev=179e471c0732582
   FRU: -
  Location: -

My question is: how do I relate the vdev name above (179e471c0732582) with an 
actual drive? I've checked these id's against the device ids (cXtYdZ - 
obviously no match) and against all of the disk serial numbers. I've also tried 
all of the zpool list and zpool status options with no luck.

I'm sure I'm missing something obvious here, but if anyone can point me in the 
right direction I'd appreciate it!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cryptic vdev name from fmdump

2009-10-23 Thread sean walmsley
Thanks for this information.

We have a weekly scrub schedule, but I ran another just to be sure :-) It 
completed with 0 errors.

Running fmdump -eV gives:
TIME   CLASS
fmdump: /var/fm/fmd/errlog is empty

Dumping the faultlog (no -e) does give some output, but again there are no 
human readable identifiers:

... (some stuff omitted)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.fs.zfs.device
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x4fcdc2c9d60a5810
vdev = 0x179e471c0732582
(end asru)

resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x4fcdc2c9d60a5810
vdev = 0x179e471c0732582
(end resource)

(end fault-list[0])

So, I'm still stumped.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cryptic vdev name from fmdump

2009-10-23 Thread sean walmsley
Eric and Richard - thanks for your responses.

I tried both:

  echo ::spa -c | mcb
  zdb -C (not much of a man page for this one!)

and was able to match the POOL id from the log (hex 4fcdc2c9d60a5810) with both 
outputs. As Richard pointed out, I needed to convert the hex value to decimal 
to get a match with the zdb output.

In neither case, however, was I able to get a match with the disk vdev id from 
the fmdump output.

It turns out that a disk in this machine was replaced about a month ago, and 
sure enough the vdev that was complaining at the time was the 0x179e471c0732582 
vdev that is now missing.
What's confusing is that the fmd message I posted about is dated Oct 22 whereas 
the original error and replacement happened back in September. An fmadm 
faulty on the machine currently doesn't return any issues.

After physically replacing the bad drive and issuing the zpool replace 
command, I think that we probably issued the fmadm repair uuid command in 
line with what Sun has asked us to do in the past. In our experience, if you 
don't do this then fmd will re-issue duplicate complaints regarding hardware 
failures after every reboot until you do. In this case, perhaps a repair 
wasn't really the appropriate command since we actually replaced the drive. 
Would a fmadm flush have been better? Perhaps a clean reboot is in order?

So, it looks like the root problem here is that fmd is confused rather than 
there being a real issue with ZFS. Despite this, we're happy to know that we 
can now match vdevs against physical devices using either the mdb trick or zdb.

We've followed Eric's work on ZFS device enumeration for the Fishwork project 
with great interest - hopefully this will eventually get extended to the fmdump 
output as suggested.

Sean Walmsley
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-13 Thread sean walmsley
Sun X4500 (thumper) with 16Gb of memory running Solaris 10 U6 with patches 
current to the end of Feb 2009.

Current ARC size is ~6Gb.

ZFS filesystem created in a ~3.2 Tb pool consisting of 7 sets of mirrored 500Gb 
SATA drives.

I used 4000 8Mb files for a total of 32Gb.

run 1: ~140M/s average according to zpool iostat
real4m1.11s
user0m10.44s
sys 0m50.76s

run 2: ~37M/s average according to zpool iostat
real13m53.43s
user0m10.62s
sys 0m55.80s

A zfs unmount followed by a mount of the filesystem returned the performance to 
the run 1 case.

real3m58.16s
user0m11.54s
sys 0m51.95s

In summary, the second run performance drops to about 30% of the original run.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions regarding RFE 6334757 and CR 6322205 disk write cache. thanks (case 11356581)

2009-07-13 Thread sean walmsley
1) Turning on write caching is potentially dangerous because the disk will 
indicate that data has been written (to cache) before it has actually been 
written to non-volatile storage (disk). Since the factory has no way of knowing 
how you'll use your T5140, I'm guessing that they set the disk write caches off 
by default.

2) Since ZFS knows about disk caches and ensures that it issues synchronous 
writes where required, it is safe to turn on write caching when the *ENTIRE* 
disk is used for ZFS. Accordingly, ZFS will attempt to turn on a disk's write 
cache whenever you add the *ENTIRE* disk to a zpool. If you add only a disk 
slice to a zpool, ZFS will not try to turn on write caching since it doesn't 
know whether other portions of the disk will be used for applications which are 
not write-cache safe.

zpool create pool01 c0t0d0- ZFS will try to turn on disk write cache 
since using entire disk

zpool create pool02 c0t0d0s1- ZFS will not try to turn on disk write cache 
(only using 1 slice)

To avoid future disk replacement problems (e.g. if the replacement disk is 
slightly smaller), we generally create a single disk slice that takes up almost 
the entire disk and then build our pools on these slices. ZFS doesn't turn on 
the write cache in this case, but since we know that the disk is only being 
used for ZFS we can (and do!) safety turn on the write cache manually.

3) You can change the write (and read) cache settings using the cache submenu 
of the format -e command. If you disable the write cache where it could 
safely be enabled you will only reduce the performance of the system. If you 
enable the write cache where it should not be enabled, you run the risk of data 
loss and/or corruption in the event of a power loss.

4) I wouldn't assume any particular setting for FRU parts, although I believe 
that Sun parts generally ship with the write caches disabled. Better to 
explicitly check using format -e.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions regarding RFE 6334757 and CR 6322205 disk write cache. thanks (case 11356581)

2009-07-13 Thread sean walmsley
Something caused my original message to get cut off. Here is the full post:

1) Turning on write caching is potentially dangerous because the disk will 
indicate that data has been written (to cache) before it has actually been 
written to non-volatile storage (disk). Since the factory has no way of knowing 
how you'll use your T5140, I'm guessing that they set the disk write caches off 
by default.

2) Since ZFS knows about disk caches and ensures that it issues synchronous 
writes where required, it is safe to turn on write caching when the *ENTIRE* 
disk is used for ZFS. Accordingly, ZFS will attempt to turn on a disk's write 
cache whenever you add the
*ENTIRE* disk to a zpool. If you add only a disk slice to a zpool, ZFS will not 
try to turn on write caching since it doesn't know whether other portions of 
the disk will be used for applications which are not write-cache safe.

zpool create pool01 c0t0d0ZFS will try to turn on disk write cache 
since using entire disk

zpool create pool02 c0t0d0s1  ZFS will not try to turn on disk write cache 
(only using 1 slice)

To avoid future disk replacement problems (e.g. if the replacement disk is 
slightly smaller), we generally create a single disk slice that takes up almost 
the entire disk and then build our pools on these slices. ZFS doesn't turn on 
the write cache in this case, but since we know that the disk is only being 
used for ZFS we can (and do!) safety turn on the write cache manually.

3) You can change the write (and read) cache settings using the cache submenu 
of the format -e command. If you disable the write cache where it could 
safely be enabled you will only reduce the performance of the system. If you 
enable the write cache where
it should not be enabled, you run the risk of data loss and/or corruption in 
the event of a power loss.

4) I wouldn't assume any particular setting for FRU parts, although I believe 
that Sun parts generally ship with the write caches
disabled. Better to explicitly check using format -e.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharing UFS root and ZFS pool

2008-05-12 Thread sean walmsley
Some additional information: I should have noted that the client could not see 
the thumper1 shares via the automounter.

I've played around with this setup a bit more and it appears that I can 
manually mount both filesystems (e.g. on /tmp/troot and /tmp/tpool), so the ZFS 
and UFS volumes are being shared properly, it's just the automounter that 
doesn't want to deal with both at once.

Does the automounter have issues with picking up UFS and ZFS volumes at the 
same time?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] sharing UFS root and ZFS pool

2008-05-11 Thread sean walmsley
I have a server thumper1 which exports its root (UFS) filesystem to one 
specific server hoss via /etc/dfs/dfstab so that we can backup various system 
files. When I added a ZFS pool mypool to this system, I shared it to hoss and 
several other machines using the ZFS sharenfs property.

Prior to adding the pool, hoss could see thumper1's / directory. After adding 
the ZFS pool, hoss could only see the thumper1:/mypool (i.e. the ZFS share) 
directory. Using unshare/share and zfs unshare/zfs share in various orders, I 
can get hoss to see *EITHER* thumper1's UFS root directory *OR* the ZFS mypool 
directory, but never both at the same time. On thumper1, doing a share command 
shows either / or /mypool shared, but never both at the same time.

This is a bit of a surprise to me since sharing / and /mypool at the same time 
via dfstab would not be an issue if both filesystems were UFS (we've done this 
in the past with no problems).

Any suggestions on how to get this working would be much appreciated.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Administration

2008-04-09 Thread sean walmsley
I haven't used it myself, but the following blog describes an automatic 
snapshot facility:

http://blogs.sun.com/timf/entry/zfs_automatic_snapshots_0_10

I agree that it would be nice to have this type of functionality built into the 
base product, however.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS needs a viable backup mechanism

2007-08-28 Thread sean walmsley
We mostly rely on AMANDA, but for a simple, compressed, encrypted, 
tape-spanning alternative backup (intended for disaster recovery) we use:

tar cf - files | lzf (quick compression utility) | ssl (to encrypt) | mbuffer 
(which writes to tape and looks after tape changes)

Recovery is exactly the opposite, i.e:

mbuffer | ssl | lzf | tar xf -

The mbuffer utility (http://www.maier-komor.de/mbuffer.html) has the ability to 
call a script to change tapes, so if you have the mtx utility you're in 
business. Mbuffer's other great advantage is that it buffers reads and/or 
writes, so you can make sure that your tape is writing in decent sized chunks 
(i.e. isn't shoe-shining) even if your system can't keep up with your shiny new 
LTO-4 drive :-)

We did have a problem with mbuffer not automatically detecting EOT on our 
drives, but since we're compressing as part of the pipeline rather than in the 
drive itself we just told mbuffer to swap tapes at ~98% of the physical tape 
size.

Since mbuffer doesn't care where your data stream comes from, I would think 
that you could easily do something like:

zfs send | mbuffer (writes to tape and looks after tape changes)

and

mbuffer (reads from tape and looks after tape changes) | zfs receive
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss