Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Ross Smith
Ok, thanks everyone then (but still thanks to Victor for the heads up)  :-)


On Mon, Nov 2, 2009 at 4:03 PM, Victor Latushkin
victor.latush...@sun.com wrote:
 On 02.11.09 18:38, Ross wrote:

 Double WOHOO!  Thanks Victor!

 Thanks should go to Tim Haley, Jeff Bonwick and George Wilson ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tunable iSCSI timeouts - ZFS over iSCSI fix

2009-07-29 Thread Ross Smith
Yup, somebody pointed that out to me last week and I can't wait :-)


On Wed, Jul 29, 2009 at 7:48 PM, Davedave-...@dubkat.com wrote:
 Anyone (Ross?) creating ZFS pools over iSCSI connections will want to pay
 attention to snv_121 which fixes the 3 minute hang after iSCSI disk
 problems:

 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=649

 Yay!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-14 Thread Ross Smith
Hey guys,

I'll let this die in a sec, but I just wanted to say that I've gone
and read the on disk document again this morning, and to be honest
Richard, without the description you just wrote, I really wouldn't
have known that uberblocks are in a 128 entry circular queue that's 4x
redundant.

Please understand that I'm not asking for answers to these notes, this
post is purely to illustrate to you ZFS guys that much as I appreciate
having the ZFS docs available, they are very tough going for anybody
who isn't a ZFS developer.  I consider myself well above average in IT
ability, and I've really spent quite a lot of time in the past year
reading around ZFS, but even so I would definitely have come to the
wrong conclusion regarding uberblocks.

Richard's post I can understand really easily, but in the on disk
format docs, that information is spread over 7 pages of really quite
technical detail, and to be honest, for a user like myself raises as
many questions as it answers:

On page 6 I learn that labels are stored on each vdev, as well as each
disk.  So there will be a label on the pool, mirror (or raid group),
and disk.  I know the disk ones are at the start and end of the disk,
and it sounds like the mirror vdev is in the same place, but where is
the root vdev label?  The example given doesn't mention its location
at all.

Then, on page 7 it sounds like the entire label is overwriten whenever
on-disk data is updated - any time on-disk data is overwritten, there
is potential for error.  To me, it sounds like it's not a 128 entry
queue, but just a group of 4 labels, all of which are overwritten as
data goes to disk.

Then finally, on page 12 the uberblock is mentioned (although as an
aside, the first time I read these docs I had no idea what the
uberblock actually was).  It does say that only one uberblock is
active at a time, but with it being part of the label I'd just assume
these were overwritten as a group..

And that's why I'll often throw ideas out - I can either rely on my
own limited knowledge of ZFS to say if it will work, or I can take
advantage of the excellent community we have here, and post the idea
for all to see.  It's a quick way for good ideas to be improved upon,
and bad ideas consigned to the bin.  I've done it before in my rather
lengthly 'zfs availability' thread.  My thoughts there were thrashed
out nicely, with some quite superb additions (namely the concept of
lop sided mirrors which I think are a great idea).

Ross

PS.  I've also found why I thought you had to search for these blocks,
it was after reading this thread where somebody used mdb to search a
corrupt pool to try to recover data:
http://opensolaris.org/jive/message.jspa?messageID=318009







On Fri, Feb 13, 2009 at 11:09 PM, Richard Elling
richard.ell...@gmail.com wrote:
 Tim wrote:


 On Fri, Feb 13, 2009 at 4:21 PM, Bob Friesenhahn
 bfrie...@simple.dallas.tx.us mailto:bfrie...@simple.dallas.tx.us wrote:

On Fri, 13 Feb 2009, Ross Smith wrote:

However, I've just had another idea.  Since the uberblocks are
pretty
vital in recovering a pool, and I believe it's a fair bit of
work to
search the disk to find them.  Might it be a good idea to
allow ZFS to
store uberblock locations elsewhere for recovery purposes?


Perhaps it is best to leave decisions on these issues to the ZFS
designers who know how things work.

Previous descriptions from people who do know how things work
didn't make it sound very difficult to find the last 20
uberblocks.  It sounded like they were at known points for any
given pool.

Those folks have surely tired of this discussion by now and are
working on actual code rather than reading idle discussion between
several people who don't know the details of how things work.



 People who don't know how things work often aren't tied down by the
 baggage of knowing how things work.  Which leads to creative solutions those
 who are weighed down didn't think of.  I don't think it hurts in the least
 to throw out some ideas.  If they aren't valid, it's not hard to ignore them
 and move on.  It surely isn't a waste of anyone's time to spend 5 minutes
 reading a response and weighing if the idea is valid or not.

 OTOH, anyone who followed this discussion the last few times, has looked
 at the on-disk format documents, or reviewed the source code would know
 that the uberblocks are kept in an 128-entry circular queue which is 4x
 redundant with 2 copies each at the beginning and end of the vdev.
 Other metadata, by default, is 2x redundant and spatially diverse.

 Clearly, the failure mode being hashed out here has resulted in the defeat
 of those protections. The only real question is how fast Jeff can roll out
 the
 feature to allow reverting to previous uberblocks.  The procedure for doing
 this by hand has long been known, and was posted on this forum -- though
 it is tedious.
 -- richard

Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 7:41 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 13 Feb 2009, Ross wrote:

 Something like that will have people praising ZFS' ability to safeguard
 their data, and the way it recovers even after system crashes or when
 hardware has gone wrong.  You could even have a common causes of this
 are... message, or a link to an online help article if you wanted people to
 be really impressed.

 I see a career in politics for you.  Barring an operating system
 implementation bug, the type of problem you are talking about is due to
 improperly working hardware.  Irreversibly reverting to a previous
 checkpoint may or may not obtain the correct data.  Perhaps it will produce
 a bunch of checksum errors.

Yes, the root cause is improperly working hardware (or an OS bug like
6424510), but with ZFS being a copy on write system, when errors occur
with a recent write, for the vast majority of the pools out there you
still have huge amounts of data that is still perfectly valid and
should be accessible.  Unless I'm misunderstanding something,
reverting to a previous checkpoint gets you back to a state where ZFS
knows it's good (or at least where ZFS can verify whether it's good or
not).

You have to consider that even with improperly working hardware, ZFS
has been checksumming data, so if that hardware has been working for
any length of time, you *know* that the data on it is good.

Yes, if you have databases or files there that were mid-write, they
will almost certainly be corrupted.  But at least your filesystem is
back, and it's in as good a state as it's going to be given that in
order for your pool to be in this position, your hardware went wrong
mid-write.

And as an added bonus, if you're using ZFS snapshots, now your pool is
accessible, you have a bunch of backups available so you can probably
roll corrupted files back to working versions.

For me, that is about as good as you can get in terms of handling a
sudden hardware failure.  Everything that is known to be saved to disk
is there, you can verify (with absolute certainty) whether data is ok
or not, and you have backup copies of damaged files.  In the old days
you'd need to be reverting to tape backups for both of these, with
potentially hours of downtime before you even know where you are.
Achieving that in a few seconds (or minutes) is a massive step
forwards.

 There are already people praising ZFS' ability to safeguard their data, and
 the way it recovers even after system crashes or when hardware has gone
 wrong.

Yes there are, but the majority of these are praising the ability of
ZFS checksums to detect bad data, and to repair it when you have
redundancy in your pool.  I've not seen that many cases of people
praising ZFS' recovery ability - uberblock problems seem to have a
nasty habit of leaving you with tons of good, checksummed data on a
pool that you can't get to, and while many hardware problems are dealt
with, others can hang your entire pool.



 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 13 Feb 2009, Ross Smith wrote:

 You have to consider that even with improperly working hardware, ZFS
 has been checksumming data, so if that hardware has been working for
 any length of time, you *know* that the data on it is good.

 You only know this if the data has previously been read.

 Assume that the device temporarily stops pysically writing, but otherwise
 responds normally to ZFS.  Then the device starts writing again (including a
 recent uberblock), but with a large gap in the writes.  Then the system
 loses power, or crashes.  What happens then?

Well in that case you're screwed, but if ZFS is known to handle even
corrupted pools automatically, when that happens the immediate
response on the forums is going to be something really bad has
happened to your hardware, followed by troubleshooting to find out
what.  Instead of the response now, where we all know there's every
chance the data is ok, and just can't be gotten to without zdb.

Also, that's a pretty extreme situation since you'd need a device that
is being written to but not read from to fail in this exact way.  It
also needs to have no scrubbing being run, so the problem has remained
undetected.

However, even in that situation, if we assume that it happened and
that these recovery tools are available, ZFS will either report that
your pool is seriously corrupted, indicating a major hardware problem
(and ZFS can now state this with some confidence), or ZFS will be able
to open a previous uberblock, mount your pool and begin a scrub, at
which point all your missing writes will be found too and reported.

And then you can go back to your snapshots.  :-D



 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 13 Feb 2009, Ross Smith wrote:

 You have to consider that even with improperly working hardware, ZFS
 has been checksumming data, so if that hardware has been working for
 any length of time, you *know* that the data on it is good.

 You only know this if the data has previously been read.

 Assume that the device temporarily stops pysically writing, but otherwise
 responds normally to ZFS.  Then the device starts writing again (including a
 recent uberblock), but with a large gap in the writes.  Then the system
 loses power, or crashes.  What happens then?

Hey Bob,

Thinking about this a bit more, you've given me an idea:  Would it be
worth ZFS occasionally reading previous uberblocks from the pool, just
to check they are there and working ok?

I wonder if you could do this after a few uberblocks have been
written.  It would seem to be a good way of catching devices that
aren't writing correctly early on, as well as a way of guaranteeing
that previous uberblocks are available to roll back to should a write
go wrong.

I wonder what the upper limits for this kind of write failure is going
to be.  I've seen 30 second delays mentioned in this thread.  How
often are uberblocks written?  Is there any guarantee that we'll
always have more than 30 seconds worth of uberblocks on a drive?
Should ZFS be set so that it keeps either a given number of
uberblocks, or 5 minutes worth of uberblocks, whichever is the larger?

Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
You don't, but that's why I was wondering about time limits.  You have
to have a cut off somewhere, but if you're checking the last few
minutes of uberblocks that really should cope with a lot.  It seems
like a simple enough thing to implement, and if a pool still gets
corrupted with these checks in place, you can absolutely, positively
blame it on the hardware.  :D

However, I've just had another idea.  Since the uberblocks are pretty
vital in recovering a pool, and I believe it's a fair bit of work to
search the disk to find them.  Might it be a good idea to allow ZFS to
store uberblock locations elsewhere for recovery purposes?

This could be as simple as a USB stick plugged into the server, a
separate drive, or a network server.  I guess even the ZIL device
would work if it's separate hardware.  But knowing the locations of
the uberblocks would save yet more time should recovery be needed.



On Fri, Feb 13, 2009 at 8:59 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Fri, 13 Feb 2009, Ross Smith wrote:

 Thinking about this a bit more, you've given me an idea:  Would it be
 worth ZFS occasionally reading previous uberblocks from the pool, just
 to check they are there and working ok?

 That sounds like a good idea.  However, how do you know for sure that the
 data returned is not returned from a volatile cache?  If the hardware is
 ignoring cache flush requests, then any data returned may be from a volatile
 cache.

 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

2009-02-12 Thread Ross Smith
Heh, yeah, I've thought the same kind of thing in the past.  The
problem is that the argument doesn't really work for system admins.

As far as I'm concerned, the 7000 series is a new hardware platform,
with relatively untested drivers, running a software solution that I
know is prone to locking up when hardware faults are handled badly by
drivers.  Fair enough, that actual solution is out of our price range,
but I would still be very dubious about purchasing it.  At the very
least I'd be waiting a year for other people to work the kinks out of
the drivers.

Which is a shame, because ZFS has so many other great features it's
easily our first choice for a storage platform.  The one and only
concern we have is its reliability.  We have snv_106 running as a test
platform now.  If I felt I could trust ZFS 100% I'd roll it out
tomorrow.



On Thu, Feb 12, 2009 at 4:25 PM, Tim t...@tcsac.net wrote:


 On Thu, Feb 12, 2009 at 9:25 AM, Ross myxi...@googlemail.com wrote:

 This sounds like exactly the kind of problem I've been shouting about for
 6 months or more.  I posted a huge thread on availability on these forums
 because I had concerns over exactly this kind of hanging.

 ZFS doesn't trust hardware or drivers when it comes to your data -
 everything is checksummed.  However, when it comes to seeing whether devices
 are responding, and checking for faults, it blindly trusts whatever the
 hardware or driver tells it.  Unfortunately, that means ZFS is vulnerable to
 any unexpected bug or error in the storage chain.  I've encountered at least
 two hang conditions myself (and I'm not exactly a heavy user), and I've seen
 several others on the forums, including a few on x4500's.

 Now, I do accept that errors like this will be few and far between, but
 they still means you have the risk that a badly handled error condition can
 hang your entire server, instead of just one drive.  Solaris can handle
 things like CPU's or Memory going faulty for crying out loud.  Its raid
 storage system had better be able to handle a disk failing.

 Sun seem to be taking the approach that these errors should be dealt with
 in the driver layer.  And while that's technically correct, a reliable
 storage system had damn well better be able to keep the server limping along
 while we wait for patches to the storage drivers.

 ZFS absolutely needs an error handling layer between the volume manager
 and the devices.  It needs to timeout items that are not responding, and it
 needs to drop bad devices if they could cause problems elsewhere.

 And yes, I'm repeating myself, but I can't understand why this is not
 being acted on.  Right now the error checking appears to be such that if an
 unexpected, or badly handled error condition occurs in the driver stack, the
 pool or server hangs.  Whereas the expected behavior would be for just one
 drive to fail.  The absolute worst case scenario should be that an entire
 controller has to be taken offline (and I would hope that the controllers in
 an x4500 would be running separate instances of the driver software).

 None one of those conditions should be fatal, good storage designs cope
 with them all, and good error handling at the ZFS layer is absolutely vital
 when you have projects like Comstar introducing more and more types of
 storage device for ZFS to work with.

 Each extra type of storage introduces yet more software into the equation,
 and increases the risk of finding faults like this.  While they will be
 rare, they should be expected, and ZFS should be designed to handle them.


 I'd imagine for the exact same reason short-stroking/right-sizing isn't a
 concern.

 We don't have this problem in the 7000 series, perhaps you should buy one
 of those.

 ;)

 --Tim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-12 Thread Ross Smith
That would be the ideal, but really I'd settle for just improved error
handling and recovery for now.  In the longer term, disabling write
caching by default for USB or Firewire drives might be nice.


On Thu, Feb 12, 2009 at 8:35 PM, Gary Mills mi...@cc.umanitoba.ca wrote:
 On Thu, Feb 12, 2009 at 11:53:40AM -0500, Greg Palmer wrote:
 Ross wrote:
 I can also state with confidence that very, very few of the 100 staff
 working here will even be aware that it's possible to unmount a USB volume
 in windows.  They will all just pull the plug when their work is saved,
 and since they all come to me when they have problems, I think I can
 safely say that pulling USB devices really doesn't tend to corrupt
 filesystems in Windows.  Everybody I know just waits for the light on the
 device to go out.
 
 The key here is that Windows does not cache writes to the USB drive
 unless you go in and specifically enable them. It caches reads but not
 writes. If you enable them you will lose data if you pull the stick out
 before all the data is written. This is the type of safety measure that
 needs to be implemented in ZFS if it is to support the average user
 instead of just the IT professionals.

 That implies that ZFS will have to detect removable devices and treat
 them differently than fixed devices.  It might have to be an option
 that can be enabled for higher performance with reduced data security.

 --
 -Gary Mills--Unix Support--U of M Academic Computing and Networking-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss bug - sidelined??

2009-02-06 Thread Ross Smith
I can check on Monday, but the system will probably panic... which
doesn't really help :-)

Am I right in thinking failmode=wait is still the default?  If so,
that should be how it's set as this testing was done on a clean
install of snv_106.  From what I've seen, I don't think this is a
problem with the zfs failmode.  It's more of an issue of what happens
in the period *before* zfs realises there's a problem and applies the
failmode.

This time there was just a window of a couple of minutes while
commands would continue.  In the past I've managed to stretch it out
to hours.

To me the biggest problems are:
- ZFS accepting writes that don't happen (from both before and after
the drive is removed)
- No logging or warning of this in zpool status

I appreciate that if you're using cache, some data loss is pretty much
inevitable when a pool fails, but that should be a few seconds worth
of data at worst, not minutes or hours worth.

Also, if a pool fails completely and there's data in the cache that
hasn't been committed to disk, it would be great if Solaris could
respond by:

- immediately dumping the cache to any (all?) working storage
- prompting the user to fix the pool, or save the cache before
powering down the system

Ross


On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling richard.ell...@gmail.com wrote:
 Ross, this is a pretty good description of what I would expect when
 failmode=continue. What happens when failmode=panic?
 -- richard


 Ross wrote:

 Ok, it's still happening in snv_106:

 I plugged a USB drive into a freshly installed system, and created a
 single disk zpool on it:
 # zpool create usbtest c1t0d0

 I opened the (nautilus?) file manager in gnome, and copied the /etc/X11
 folder to it.  I then copied the /etc/apache folder to it, and at 4:05pm,
 disconnected the drive.

 At this point there are *no* warnings on screen, or any indication that
 there is a problem.  To check that the pool was still working, I created
 duplicates of the two folders on that drive.  That worked without any
 errors, although the drive was physically removed.

 4:07pm
 I ran zpool status, the pool is actually showing as unavailable, so at
 least that has happened faster than my last test.

 The folder is still open in gnome, however any attempt to copy files to or
 from it just hangs the file transfer operation window.

 4:09pm
 /usbtest is still visible in gnome
 Also, I can still open a console and use the folder:

 # cd usbtest
 # ls
 X11X11 (copy) apache apache (copy)

 I also tried:
 # mv X11 X11-test

 That hung, but I saw the X11 folder disappear from the graphical file
 manager, so the system still believes something is working with this pool.

 The main GUI is actually a little messed up now.  The gnome file manager
 window looking at the /usbtest folder has hung.  Also, right-clicking the
 desktop to open a new terminal hangs, leaving the right-click menu on
 screen.

 The main menu still works though, and I can still open a new terminal.

 4:19pm
 Commands such as ls are finally hanging on the pool.

 At this point I tried to reboot, but it appears that isn't working.  I
 used system monitor to kill everything I had running and tried again, but
 that didn't help.

 I had to physically power off the system to reboot.

 After the reboot, as expected, /usbtest still exists (even though the
 drive is disconnected).  I removed that folder and connected the drive.

 ZFS detects the insertion and automounts the drive, but I find that
 although the pool is showing as online, and the filesystem shows as mounted
 at /usbtest.  But the /usbtest directory doesn't exist.

 I had to export and import the pool to get it available, but as expected,
 I've lost data:
 # cd usbtest
 # ls
 X11

 even worse, zfs is completely unaware of this:
 # zpool status -v usbtest
  pool: usbtest
  state: ONLINE
  scrub: none requested
 config:

NAMESTATE READ WRITE CKSUM
usbtest ONLINE   0 0 0
  c1t0d0ONLINE   0 0 0

 errors: No known data errors


 So in summary, there are a good few problems here, many of which I've
 already reported as bugs:

 1. ZFS still accepts read and write operations for a faulted pool, causing
 data loss that isn't necessarily reported by zpool status.
 2. Even after writes start to hang, it's still possible to continue
 reading data from a faulted pool.
 3. A faulted pool causes unwanted side effects in the GUI, making the
 system hard to use, and impossible to reboot.
 4. After a hard reset, ZFS does not recover cleanly.  Unused mountpoints
 are left behind.
 5. Automatic mounting of pools doesn't seem to work reliably.
 6. zfs status doesn't inform of any problems mounting the pool.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss bug - sidelined??

2009-02-06 Thread Ross Smith
Something to do with cache was my first thought.  It seems to be able
to read and write from the cache quite happily for some time,
regardless of whether the pool is live.

If you're reading or writing large amounts of data, zfs starts
experiencing IO faults and offlines the pool pretty quickly.  If
you're just working with small datasets, or viewing files that you've
recently opened, it seems you can stretch it out for quite a while.

But yes, it seems that it doesn't enter failmode until the cache is
full.  I would expect it to hit this within 5 seconds (since I believe
that is how often the cache should be writing).


On Fri, Feb 6, 2009 at 7:04 PM, Brent Jones br...@servuhome.net wrote:
 On Fri, Feb 6, 2009 at 10:50 AM, Ross Smith myxi...@googlemail.com wrote:
 I can check on Monday, but the system will probably panic... which
 doesn't really help :-)

 Am I right in thinking failmode=wait is still the default?  If so,
 that should be how it's set as this testing was done on a clean
 install of snv_106.  From what I've seen, I don't think this is a
 problem with the zfs failmode.  It's more of an issue of what happens
 in the period *before* zfs realises there's a problem and applies the
 failmode.

 This time there was just a window of a couple of minutes while
 commands would continue.  In the past I've managed to stretch it out
 to hours.

 To me the biggest problems are:
 - ZFS accepting writes that don't happen (from both before and after
 the drive is removed)
 - No logging or warning of this in zpool status

 I appreciate that if you're using cache, some data loss is pretty much
 inevitable when a pool fails, but that should be a few seconds worth
 of data at worst, not minutes or hours worth.

 Also, if a pool fails completely and there's data in the cache that
 hasn't been committed to disk, it would be great if Solaris could
 respond by:

 - immediately dumping the cache to any (all?) working storage
 - prompting the user to fix the pool, or save the cache before
 powering down the system

 Ross


 On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 Ross, this is a pretty good description of what I would expect when
 failmode=continue. What happens when failmode=panic?
 -- richard


 Ross wrote:

 Ok, it's still happening in snv_106:

 I plugged a USB drive into a freshly installed system, and created a
 single disk zpool on it:
 # zpool create usbtest c1t0d0

 I opened the (nautilus?) file manager in gnome, and copied the /etc/X11
 folder to it.  I then copied the /etc/apache folder to it, and at 4:05pm,
 disconnected the drive.

 At this point there are *no* warnings on screen, or any indication that
 there is a problem.  To check that the pool was still working, I created
 duplicates of the two folders on that drive.  That worked without any
 errors, although the drive was physically removed.

 4:07pm
 I ran zpool status, the pool is actually showing as unavailable, so at
 least that has happened faster than my last test.

 The folder is still open in gnome, however any attempt to copy files to or
 from it just hangs the file transfer operation window.

 4:09pm
 /usbtest is still visible in gnome
 Also, I can still open a console and use the folder:

 # cd usbtest
 # ls
 X11X11 (copy) apache apache (copy)

 I also tried:
 # mv X11 X11-test

 That hung, but I saw the X11 folder disappear from the graphical file
 manager, so the system still believes something is working with this pool.

 The main GUI is actually a little messed up now.  The gnome file manager
 window looking at the /usbtest folder has hung.  Also, right-clicking the
 desktop to open a new terminal hangs, leaving the right-click menu on
 screen.

 The main menu still works though, and I can still open a new terminal.

 4:19pm
 Commands such as ls are finally hanging on the pool.

 At this point I tried to reboot, but it appears that isn't working.  I
 used system monitor to kill everything I had running and tried again, but
 that didn't help.

 I had to physically power off the system to reboot.

 After the reboot, as expected, /usbtest still exists (even though the
 drive is disconnected).  I removed that folder and connected the drive.

 ZFS detects the insertion and automounts the drive, but I find that
 although the pool is showing as online, and the filesystem shows as mounted
 at /usbtest.  But the /usbtest directory doesn't exist.

 I had to export and import the pool to get it available, but as expected,
 I've lost data:
 # cd usbtest
 # ls
 X11

 even worse, zfs is completely unaware of this:
 # zpool status -v usbtest
  pool: usbtest
  state: ONLINE
  scrub: none requested
 config:

NAMESTATE READ WRITE CKSUM
usbtest ONLINE   0 0 0
  c1t0d0ONLINE   0 0 0

 errors: No known data errors


 So in summary, there are a good few problems here, many of which I've
 already reported as bugs:

 1. ZFS

Re: [zfs-discuss] Any way to set casesensitivity=mixed on the main pool?

2009-02-04 Thread Ross Smith
It's not intuitive because when you know that -o sets options, an
error message saying that it's not a valid property makes you think
that it's not possible to do what you're trying.

Documented and intuitive are very different things.  I do appreciate
that the details are there in the manuals, but for items like this
where it's very easy to pick the wrong one, it helps if the commands
can work with you.

The difference between -o and -O is pretty subtle, I just think that
extra sentence in the error message could save a lot of frustration
when people get mixed up.

Ross



On Wed, Feb 4, 2009 at 11:14 AM, Darren J Moffat
darr...@opensolaris.org wrote:
 Ross wrote:

 Good god.  Talk about non intuitive.  Thanks Darren!

 Why isn't that intuitive ?  It is even documented in the man page.

 zpool create [-fn] [-o property=value] ... [-O file-system-
 property=value] ... [-m mountpoint] [-R root] pool vdev ...


 Is it possible for me to suggest a quick change to the zpool error message
 in solaris?  Should I file that as an RFE?  I'm just wondering if the error
 message could be changed to something like:
 property 'casesensitivity' is not a valid pool property.  Did you mean to
 use -O?

 It's just a simple change, but it makes it obvious that it can be done,
 instead of giving the impression that it's not possible.

 Feel free to log the RFE in defect.opensolaris.org.

 --
 Darren J Moffat

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD drives in Sun Fire X4540 or X4500 for dedicated ZIL device

2009-01-23 Thread Ross Smith
That's my understanding too.  One (STEC?) drive as a write cache,
basically a write optimised SSD.  And cheaper, larger, read optimised
SSD's for the read cache.

I thought it was an odd strategy until I read into SSD's a little more
and realised you really do have to think about your usage cases with
these.  SSD's are very definitely not all alike.


On Fri, Jan 23, 2009 at 4:33 PM, Greg Mason gma...@msu.edu wrote:
 If i'm not mistaken (and somebody please correct me if i'm wrong), the Sun
 7000 series storage appliances (the Fishworks boxes) use enterprise SSDs,
 with dram caching. One such product is made by STEC.

 My understanding is that the Sun appliances use one SSD for the ZIL, and one
 as a read cache. For the 7210 (which is basically a Sun Fire X4540), that
 gives you 46 disks and 2 SSDs.

 -Greg


 Bob Friesenhahn wrote:

 On Thu, 22 Jan 2009, Ross wrote:

 However, now I've written that, Sun use SATA (SAS?) SSD's in their high
 end fishworks storage, so I guess it definately works for some use cases.

 But the fishworks (Fishworks is a development team, not a product) write
 cache device is not based on FLASH.  It is based on DRAM.  The difference is
 like night and day. Apparently there can also be a read cache which is based
 on FLASH.

 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs list improvements?

2009-01-10 Thread Ross Smith
Hmm... that's a tough one.  To me, it's a trade off either way, using
a -r parameter to specify the depth for zfs list feels more intuitive
than adding extra commands to modify the -r behaviour, but I can see
your point.

But then, using -c or -d means there's an optional parameter for zfs
list that you don't have in the other commands anyway.  And would you
have to use -c or -d with -r, or would they work on their own,
providing two ways to achieve very similar functionality.

Also, now you've mentioned that you want to keep things consistent
among all the commands, keeping -c and -d free becomes more important
to me.  You don't know if you might want to use these for another
command later on.

It sounds to me that whichever way you implement it there's going to
be some potential for confusion, but personally I'd stick with using
-r.  It leaves you with a single syntax for viewing children.  The -r
on the other commands can be modified to give an error message if they
don't support this extra parameter, and it leaves both -c and -d free
to use later on.

Ross



On Fri, Jan 9, 2009 at 7:16 PM, Richard Morris - Sun Microsystems -
Burlington United States richard.mor...@sun.com wrote:
 On 01/09/09 01:44, Ross wrote:

 Can I ask why we need to use -c or -d at all?  We already have -r to
 recursively list children, can't we add an optional depth parameter to that?

 You then have:
 zfs list : shows current level (essentially -r 0)
 zfs list -r : shows all levels (infinite recursion)
 zfs list -r 2 : shows 2 levels of children

 An optional depth argument to -r has already been suggested:
 http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/054241.html

 However, other zfs subcommands such as destroy, get, rename, and snapshot
 also provide -r options without optional depth arguments.  And its probably
 good to keep the zfs subcommand option syntax consistent.  On the other
 hand,
 if all of the zfs subcommands were modified to accept an optional depth
 argument
 to -r, then this would not be an issue.  But, for example, the top level(s)
 of
 datasets cannot be destroyed if that would leave orphaned datasets.

 BTW, when no dataset is specified, zfs list is the same as zfs list -r
 (infinite
 recursion).  When a dataset is specified then it shows only the current
 level.

 Does anyone have any non-theoretical situations where a depth option other
 than
 1 or 2 would be used?  Are scripts being used to work around this problem?

 -- Rich









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-22 Thread Ross Smith
On Fri, Dec 19, 2008 at 6:47 PM, Richard Elling richard.ell...@sun.com wrote:
 Ross wrote:

 Well, I really like the idea of an automatic service to manage
 send/receives to backup devices, so if you guys don't mind, I'm going to
 share some other ideas for features I think would be useful.


 cool.

 One of the first is that you need some kind of capacity management and
 snapshot deletion.  Eventually backup media are going to fill and you need
 to either prompt the user to remove snapshots, or even better, you need to
 manage the media automatically and remove old snapshots to make space for
 new ones.


 I've implemented something like this for a project I'm working on.
 Consider this a research project at this time, though I hope to
 leverage some of the things we learn as we scale up, out, and
 refine the operating procedures.

Way cool :D

 There is a failure mode lurking here.  Suppose you take two sets
 of snapshots: local and remote.  You want to do an incremental
 send, for efficiency.  So you look at the set of snapshots on both
 machines and find the latest, common snapshot.  You will then
 send the list of incrementals from the latest, common through the
 latest snapshot.  On the remote machine, if there are any other
 snapshots not in the list you are sending and newer than the latest,
 common snapshot, then the send/recv will fail.  In practice, this
 means that if you use the zfs-auto-snapshot feature, which will
 automatically destroy older snapshots as it goes (eg. the default
 policy for frequent is take snapshots every 15 minutes, keep 4).

 If you never have an interruption in your snapshot schedule, you
 can merrily cruise along and not worry about this.  But if there is
 an interruption (for maintenance, perhaps) and a snapshot is
 destroyed on the sender, then you also must make sure it gets
 destroyed on the receiver.  I just polished that code yesterday,
 and it seems to work fine... though it makes folks a little nervous.
 Anyone with an operations orientation will recognize that there
 needs to be a good process wrapped around this, but I haven't
 worked through all of the scenarios on the receiver yet.

Very true.  In this context I think this would be fine.  You would
want a warning to pop up saying that a snapshot has been deleted
locally and will have to be overwritten on the backup, but I think
that would be ok.  If necessary you could have a help page explaining
why - essentially this is a copy of your pool, not just a backup of
your files, and to work it needs an accurate copy of your snapshots.
If you wanted to be really fancy, you could have an option for the
user to view the affected files, but I think that's probably over
complicating things.

I don't suppose there's any way the remote snapshot can be cloned /
separated from the pool just in case somebody wanted to retain access
to the files within it?


 I'm thinking that a setup like time slider would work well, where you
 specify how many of each age of snapshot to keep.  But I would want to be
 able to specify different intervals for different devices.

 eg. I might want just the latest one or two snapshots on a USB disk so I
 can take my files around with me.  On a removable drive however I'd be more
 interested in preserving a lot of daily / weekly backups.  I might even have
 an archive drive that I just store monthly snapshots on.

 What would be really good would be a GUI that can estimate how much space
 is going to be taken up for any configuration.  You could use the existing
 snapshots on disk as a guide, and take an average size for each interval,
 giving you average sizes for hourly, daily, weekly, monthly, etc...


 ha ha, I almost blew coffee out my nose ;-)  I'm sure that once
 the forward time-slider functionality is implemented, it will be
 much easier to manage your storage utilization :-)  So, why am
 I giggling?  My wife just remembered that she hadn't taken her
 photos off the camera lately... 8 GByte SD cards are the vehicle
 of evil destined to wreck your capacity planning :-)

Haha, that's a great image, but I've got some food for thought even with this.

If you think about it, even though 8GB sounds a lot, it's barely over
1% of a 500GB drive, so it's not an unmanageable blip as far as
storage goes.

Also, if you're using the default settings for Tim's backups, you'll
be taking snapshots every 15 minutes, hour, day, week and month.  Now,
when you start you're not going to have any sensible averages for your
monthly snapshot sizes, but you're very rapidly going to get a set of
figures for your 15 minute snapshots.

What I would suggest is to use those to extrapolate forwards to give
very rough estimates of usage early on, with warnings as to how rough
these are.  In time these estimates will improve in accuracy, and your
8GB photo 'blip' should be relatively easily incorporated.

What you could maybe do is have a high and low usage estimate shown in
the GUI.  Early on these will be quite a 

Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
 Absolutely.

 The tool shouldn't need to know that the backup disk is accessed via
 USB, or whatever.  The GUI should, however, present devices
 intelligently, not as cXtYdZ!

Yup, and that's easily achieved by simply prompting for a user
friendly name as devices are attached.  Now you could store that
locally, but it would be relatively easy to drop an XML configuration
file on the device too, allowing the same friendly name to be shown
wherever it's connected.

And this is sounding more and more like something I was thinking of
developing myself.  A proper Sun version would be much better though
(not least before I've never developed anything for Solaris!).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
On Thu, Dec 18, 2008 at 7:11 PM, Nicolas Williams
nicolas.willi...@sun.com wrote:
 On Thu, Dec 18, 2008 at 07:05:44PM +, Ross Smith wrote:
  Absolutely.
 
  The tool shouldn't need to know that the backup disk is accessed via
  USB, or whatever.  The GUI should, however, present devices
  intelligently, not as cXtYdZ!

 Yup, and that's easily achieved by simply prompting for a user
 friendly name as devices are attached.  Now you could store that
 locally, but it would be relatively easy to drop an XML configuration
 file on the device too, allowing the same friendly name to be shown
 wherever it's connected.

 I was thinking more something like:

  - find all disk devices and slices that have ZFS pools on them
  - show users the devices and pool names (and UUIDs and device paths in
   case of conflicts)..

I was thinking that device  pool names are too variable, you need to
be reading serial numbers or ID's from the device and link to that.

  - let the user pick one.

  - in the case that the user wants to initialize a drive to be a backup
   you need something more complex.

- one possibility is to tell the user when to attach the desired
  backup device, in which case the GUI can detect the addition and
  then it knows that that's the device to use (but be careful to
  check that the user also owns the device so that you don't pick
  the wrong one on multi-seat systems)

- another is to be much smarter about mapping topology to physical
  slots and present a picture to the user that makes sense to the
  user, so the user can click on the device they want.  This is much
  harder.

I was actually thinking of a resident service.  Tim's autobackup
script was capable of firing off backups when it detected the
insertion of a USB drive, and if you've got something sitting there
monitoring drive insertions you could have it prompt the user when new
drives are detected, asking if they should be used for backups.

Of course, you'll need some settings for this so it's not annoying if
people don't want to use it.  A simple tick box on that pop up dialog
allowing people to say don't ask me again would probably do.

You'd then need a second way to assign drives if the user changed
their mind.  I'm thinking this would be to load the software and
select a drive.  Mapping to physical slots would be tricky, I think
you'd be better with a simple view that simply names the type of
interface, the drive size, and shows any current disk labels.  It
would be relatively easy then to recognise the 80GB USB drive you've
just connected.

Also, because you're formatting these drives as ZFS, you're not
restricted to just storing your backups on them.  You can create a
root pool (to contain the XML files, etc), and the backups can then be
saved to a filesystem within that.

That means the drive then functions as both a removable drive, and as
a full backup for your system.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
 Of course, you'll need some settings for this so it's not annoying if
 people don't want to use it.  A simple tick box on that pop up dialog
 allowing people to say don't ask me again would probably do.

 I would like something better than that.  Don't ask me again sucks
 when much, much later you want to be asked and you don't know how to get the 
 system to ask you.

Only if your UI design doesn't make it easy to discover how to add
devices another way, or turn this setting back on.

My thinking is that this actually won't be the primary way of adding
devices.  It's simply there for ease of use for end users, as an easy
way for them to discover that they can use external drives to backup
their system.

Once you have a backup drive configured, most of the time you're not
going to want to be prompted for other devices.  Users will generally
setup a single external drive for backups, and won't want prompting
every time they insert a USB thumb drive, a digital camera, phone,
etc.

So you need that initial prompt to make the feature discoverable, and
then an easy and obvious way to configure backup devices later.

 You'd then need a second way to assign drives if the user changed
 their mind.  I'm thinking this would be to load the software and
 select a drive.  Mapping to physical slots would be tricky, I think
 you'd be better with a simple view that simply names the type of
 interface, the drive size, and shows any current disk labels.  It
 would be relatively easy then to recognise the 80GB USB drive you've
 just connected.

 Right, so do as I suggested: tell the user to remove the device if it's
 plugged in, then plug it in again.  That way you can known unambiguously
 (unless the user is doing this with more than one device at a time).

That's horrible from a users point of view though.  Possibly worth
having as a last resort, but I'd rather just let the user pick the
device.  This does have potential as a help me find my device
feature though.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
 I was thinking more something like:

  - find all disk devices and slices that have ZFS pools on them
  - show users the devices and pool names (and UUIDs and device paths in
  case of conflicts)..


 I was thinking that device  pool names are too variable, you need to
 be reading serial numbers or ID's from the device and link to that.


 Device names are, but there's no harm in showing them if there's
 something else that's less variable.  Pool names are not very variable
 at all.


 I was thinking of something a little different.  Don't worry about
 devices, because you don't send to a device (rather, send to a pool).
 So a simple list of source file systems and a list of destinations
 would do.  I suppose you could work up something with pictures
 and arrows, like Nautilus, but that might just be more confusing
 than useful.

True, but if this is an end user service, you want something that can
create the filesystem for them on their devices.  An advanced mode
that lets you pick any destination filesystem would be good for
network admins, but for end users they're just going to want to point
this at their USB drive.

 But that is the easy part.  The hard part is dealing with the plethora
 of failure modes...
 -- richard

Heh, my response to this is who cares? :-D

This is a high level service, it's purely concerned with backup
succeeded or backup failed, possibly with an overdue for backup
prompt if you want to help the user manage the backups.

Any other failure modes can be dealt with by the lower level services
or by the user.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-15 Thread Ross Smith
Forgive me for not understanding the details, but couldn't you also
work backwards through the blocks with ZFS and attempt to recreate the
uberblock?

So if you lost the uberblock, could you (memory and time allowing)
start scanning the disk, looking for orphan blocks that aren't
refernced anywhere else and piece together the top of the tree?

Or roll back to a previous uberblock (or a snapshot uberblock), and
then look to see what blocks are on the disk but not referenced
anywhere.  Is there any way to intelligently work out where those
blocks would be linked by looking at how they interact with the known
data?

Of course, rolling back to a previous uberblock would still be a
massive step forward, and something I think would do much to improve
the perception of ZFS as a tool to reliably store data.

You cannot understate the difference to the end user between a file
system that on boot says:
Sorry, can't read your data pool.

With one that says:
Whoops, the uberblock, and all the backups are borked.  Would you
like to roll back to a backup uberblock, or leave the filesystem
offline to repair manually?

As much as anything else, a simple statement explaining *why* a pool
is inaccessible, and saying just how badly things have gone wrong
helps tons.  Being able to recover anything after that is just the
icing on the cake, especially if it can be done automatically.

Ross

PS.  Sorry for the duplicate Casper, I forgot to cc the list.



On Mon, Dec 15, 2008 at 10:30 AM,  casper@sun.com wrote:

I think the problem for me is not that there's a risk of data loss if
a pool becomes corrupt, but that there are no recovery tools
available.  With UFS, people expect that if the worst happens, fsck
will be able to recover their data in most cases.

 Except, of course, that fsck lies.  In fixes the meta data and the
 quality of the rest is unknown.

 Anyone using UFS knows that UFS file corruption are common; specifically,
 when using a UFS root and the system panic's when trying to
 install a device driver, there's a good chance that some files in
 /etc are corrupt. Some were application problems (some code used
 fsync(fileno(fp)); fclose(fp); it doesn't guarantee anything)


With ZFS you have no such tools, yet Victor has on at least two occasions
shown that it's quite possible to recover pools that were completely unusable
(I believe by making use of old / backup copies of the uberblock).

 True; and certainly ZFS should be able backtrack.  But it's
 much more likely to happen automatically then using a recovery
 tool.

 See, fsck could only be written because specific corruption are known
 and the patterns they have.   With ZFS, you can only backup to
 a certain uberblock and the pattern will be a surprise.

 Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-15 Thread Ross Smith
I'm not sure I follow how that can happen, I thought ZFS writes were
designed to be atomic?  They either commit properly on disk or they
don't?


On Mon, Dec 15, 2008 at 6:34 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Mon, 15 Dec 2008, Ross wrote:

 My concern is that ZFS has all this information on disk, it has the
 ability to know exactly what is and isn't corrupted, and it should (at least
 for a system with snapshots) have many, many potential uberblocks to try.
  It should be far, far better than UFS at recovering from these things, but
 for a certain class of faults, when it hits a problem it just stops dead.

 While ZFS knows if a data block is retrieved correctly from disk, a
 correctly retrieved data block does not indicate that the pool isn't
 corrupted.  A block written in the wrong order is a form of corruption.

 Bob
 ==
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs not yet suitable for HA applications?

2008-12-05 Thread Ross Smith
Hi Dan, replying in line:

On Fri, Dec 5, 2008 at 9:19 PM, David Anderson [EMAIL PROTECTED] wrote:
 Trying to keep this in the spotlight. Apologies for the lengthy post.

Heh, don't apologise, you should see some of my posts... o_0

 I'd really like to see features as described by Ross in his summary of the
 Availability: ZFS needs to handle disk removal / driver failure better
  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031#274031 ).
 I'd like to have these/similar features as well. Has there already been
 internal discussions regarding adding this type of functionality to ZFS
 itself, and was there approval, disapproval or no decision?

 Unfortunately my situation has put me in urgent need to find workarounds in
 the meantime.

 My setup: I have two iSCSI target nodes, each with six drives exported via
 iscsi (Storage Nodes). I have a ZFS Node that logs into each target from
 both Storage Nodes and creates a mirrored Zpool with one drive from each
 Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors).

 My problem: If a Storage Node crashes completely, is disconnected from the
 network, iscsitgt core dumps, a drive is pulled, or a drive has a problem
 accessing data (read retries), then my ZFS Node hangs while ZFS waits
 patiently for the layers below to report a problem and timeout the devices.
 This can lead to a roughly 3 minute or longer halt when reading OR writing
 to the Zpool on the ZFS node. While this is acceptable in certain
 situations, I have a case where my availability demand is more severe.

 My goal: figure out how to have the zpool pause for NO LONGER than 30
 seconds (roughly within a typical HTTP request timeout) and then issue
 reads/writes to the good devices in the zpool/mirrors while the other side
 comes back online or is fixed.

 My ideas:
  1. In the case of the iscsi targets disappearing (iscsitgt core dump,
 Storage Node crash, Storage Node disconnected from network), I need to lower
 the iSCSI login retry/timeout values. Am I correct in assuming the only way
 to accomplish this is to recompile the iscsi initiator? If so, can someone
 help point me in the right direction (I have never compiled ONNV sources -
 do I need to do this or can I just recompile the iscsi initiator)?

I believe it's possible to just recompile the initiator and install
the new driver.  I have some *very* rough notes that were sent to me
about a year ago, but I've no experience compiling anything in
Solaris, so don't know how useful they will be.  I'll try to dig them
out in case they're useful.


   1.a. I'm not sure in what Initiator session states iscsi_sess_max_delay is
 applicable - only for the initial login, or also in the case of reconnect?
 Ross, if you still have your test boxes available, can you please try
 setting set iscsi:iscsi_sess_max_delay = 5 in /etc/system, reboot and try
 failing your iscsi vdevs again? I can't find a case where this was tested
 quick failover.

Will gladly have a go at this on Monday.

1.b. I would much prefer to have bug 649 addressed and fixed rather
 than having to resort to recompiling the iscsi initiator (if
 iscsi_sess_max_delay) doesn't work. This seems like a trivial feature to
 implement. How can I sponsor development?

  2. In the case of the iscsi target being reachable, but the physical disk
 is having problems reading/writing data (retryable events that take roughly
 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with
 mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently
 in the thread referenced above (with value 15), which resulted in a 60
 second hang. How did you offline the iscsi vol to test this failure? Unless
 iscsi uses a multiple of the value for retries, then maybe the way you
 failed the disk caused the iscsi system to follow a different failure path?
 Unfortunately I don't know of a way to introduce read/write retries to a
 disk while the disk is still reachable and presented via iscsitgt, so I'm
 not sure how to test this.

So far I've just been shutting down the Solaris box hosting the iSCSI
target.  Next step will involve pulling some virtual cables.
Unfortunately I don't think I've got a physical box handy to test
drive failures right now, but my previous testing (of simply pulling
drives) showed that it can be hit and miss as to how well ZFS detects
these types of 'failure'.

Like you I don't know yet how to simulate failures, so I'm doing
simple tests right now, offlining entire drives or computers.
Unfortunately I've found more than enough problems with just those
tests to keep me busy.


2.a With the fix of
 http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set
 sd_retry_count along with sd_io_time to cause I/O failure when a command
 takes longer than sd_retry_count * sd_io_time. Can (or should) these
 tunables be set on the imported iscsi disks in the ZFS Node, or can/should
 they be applied only to the local disk on 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross Smith
Yeah, thanks Maurice, I just saw that one this afternoon.  I guess you
can't reboot with iscsi full stop... o_0

And I've seen the iscsi bug before (I was just too lazy to look it up
lol), I've been complaining about that since February.

In fact it's been a bad week for iscsi here, I've managed to crash the
iscsi client twice in the last couple of days too (full kernel dump
crashes), so I'll be filing a bug report on that tomorrow morning when
I get back to the office.

Ross


On Wed, Dec 3, 2008 at 7:39 PM, Maurice Volaski [EMAIL PROTECTED] wrote:
 2.  With iscsi, you can't reboot with sendtargets enabled, static
 discovery still seems to be the order of the day.

 I'm seeing this problem with static discovery:
 http://bugs.opensolaris.org/view_bug.do?bug_id=6775008.

 4.  iSCSI still has a 3 minute timeout, during which time your pool will
 hang, no matter how many redundant drives you have available.

 This is CR 649, http://bugs.opensolaris.org/view_bug.do?bug_id=649,
 which is separate from the boot time timeout, though, and also one that Sun
 so far has been unable to fix!
 --

 Maurice Volaski, [EMAIL PROTECTED]
 Computing Support, Rose F. Kennedy Center
 Albert Einstein College of Medicine of Yeshiva University

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hey folks,

I've just followed up on this, testing iSCSI with a raided pool, and
it still appears to be struggling when a device goes offline.

 I don't see how this could work except for mirrored pools.  Would that
 carry enough market to be worthwhile?
 -- richard


 I have to admit, I've not tested this with a raided pool, but since
 all ZFS commands hung when my iSCSI device went offline, I assumed
 that you would get the same effect of the pool hanging if a raid-z2
 pool is waiting for a response from a device.  Mirrored pools do work
 particularly well with this since it gives you the potential to have
 remote mirrors of your data, but if you had a raid-z2 pool, you still
 wouldn't want that hanging if a single device failed.


 zpool commands hanging is CR6667208, and has been fixed in b100.
 http://bugs.opensolaris.org/view_bug.do?bug_id=6667208

 I will go and test the raid scenario though on a current build, just to be
 sure.


 Please.
 -- richard


I've just created a pool using three snv_103 iscsi Targets, with a
fourth install of snv_103 collating those targets into a raidz pool,
and sharing that out over CIFS.

To test the server, while transferring files from a windows
workstation, I powered down one of the three iSCSI targets.  It took a
few minutes to shutdown, but once that happened the windows copy
halted with the error:
The specified network name is no longer available.

At this point, the zfs admin tools still work fine (which is a huge
improvement, well done!), but zpool status still reports that all
three devices are online.

A minute later, I can open the share again, and start another copy.

Thirty seconds after that, zpool status finally reports that the iscsi
device is offline.

So it looks like we have the same problems with that 3 minute delay,
with zpool status reporting wrong information, and the CIFS service
having problems tool.

At this point I restarted the iSCSI target, but had problems bringing
it back online.  It appears there's a bug in the initiator, but it's
easily worked around:
http://www.opensolaris.org/jive/thread.jspa?messageID=312981#312981

What was great was that as soon as the iSCSI initiator reconnected,
ZFS started resilvering.

What might not be so great is the fact that all three devices are
showing that they've been resilvered:

# zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h2m with 0 errors on Tue Dec  2 11:04:10 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  179K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   5
9.88K 0  311M resilvered
c2t600144F04934119E50569675FF00d0  ONLINE   0
0 0  179K resilvered

errors: No known data errors

It's proving a little hard to know exactly what's happening when,
since I've only got a few seconds to log times, and there are delays
with each step.  However, I ran another test using robocopy and was
able to observe the behaviour a little more closely:

Test 2:  Using robocopy for the transfer, and iostat plus zpool status
on the server

10:46:30 - iSCSI server shutdown started
10:52:20 - all drives still online according to zpool status
10:53:30 - robocopy error - The specified network name is no longer available
 - zpool status shows all three drives as online
 - zpool iostat appears to have hung, taking much longer than the 30s
specified to return a result
 - robocopy is now retrying the file, but appears to have hung
10:54:30 - robocopy, CIFS and iostat all start working again, pretty
much simultaneously
 - zpool status now shows the drive as offline

I could probably do with using DTrace to get a better look at this,
but I haven't learnt that yet.  My guess as to what's happening would
be:

- iSCSI target goes offline
- ZFS will not be notified for 3 minutes, but I/O to that device is
essentially hung
- CIFS times out (I suspect this is on the client side with around a
30s timeout, but I can't find the timeout documented anywhere).
- zpool iostat is now waiting, I may be wrong but this doesn't appear
to have benefited from the changes to zpool status
- After 3 minutes, the iSCSI drive goes offline.  The pool carries on
with the remaining two drives, CIFS carries on working, iostat carries
on working.  zpool status however is still out of date.
- zpool status eventually catches up, and reports that the drive has
gone 

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hi Richard,

Thanks, I'll give that a try.  I think I just had a kernel dump while
trying to boot this system back up though, I don't think it likes it
if the iscsi targets aren't available during boot.  Again, that rings
a bell, so I'll go see if that's another known bug.

Changing that setting on the fly didn't seem to help, if anything
things are worse this time around.  I changed the timeout to 15
seconds, but didn't restart any services:

# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:180
# echo iscsi_rx_max_window/W0t15 | mdb -kw
iscsi_rx_max_window:0xb4=   0xf
# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:15

After making those changes, and repeating the test, offlining an iscsi
volume hung all the commands running on the pool.  I had three ssh
sessions open, running the following:
# zpool iostats -v iscsipool 10 100
# format  /dev/null
# time zpool status

They hung for what felt a minute or so.
After that, the CIFS copy timed out.

After the CIFS copy timed out, I tried immediately restarting it.  It
took a few more seconds, but restarted no problem.  Within a few
seconds of that restarting, iostat recovered, and format returned it's
result too.

Around 30 seconds later, zpool status reported two drives, paused
again, then showed the status of the third:

# time zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934119E50569675FF00d0  ONLINE   0
200 0  24K resilvered

errors: No known data errors

real3m51.774s
user0m0.015s
sys 0m0.100s

Repeating that a few seconds later gives:

# time zpool status
  pool: iscsipool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  DEGRADED 0 0 0
  raidz1   DEGRADED 0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934119E50569675FF00d0  UNAVAIL  3
5.80K 0  cannot open

errors: No known data errors

real0m0.272s
user0m0.029s
sys 0m0.169s




On Tue, Dec 2, 2008 at 3:58 PM, Richard Elling [EMAIL PROTECTED] wrote:

..

 iSCSI timeout is set to 180 seconds in the client code.  The only way
 to change is to recompile it, or use mdb.  Since you have this test rig
 setup, and I don't, do you want to experiment with this timeout?
 The variable is actually called iscsi_rx_max_window so if you do
   echo iscsi_rx_max_window/D | mdb -k
 you should see 180
 Change it using something like:
   echo iscsi_rx_max_window/W0t30 | mdb -kw
 to set it to 30 seconds.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross Smith
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling [EMAIL PROTECTED] wrote:
 Ross wrote:

 Well, you're not alone in wanting to use ZFS and iSCSI like that, and in
 fact my change request suggested that this is exactly one of the things that
 could be addressed:

 The idea is really a two stage RFE, since just the first part would have
 benefits.  The key is to improve ZFS availability, without affecting it's
 flexibility, bringing it on par with traditional raid controllers.

 A.  Track response times, allowing for lop sided mirrors, and better
 failure detection.

 I've never seen a study which shows, categorically, that disk or network
 failures are preceded by significant latency changes.  How do we get
 better failure detection from such measurements?

Not preceded by as such, but a disk or network failure will certainly
cause significant latency changes.  If the hardware is down, there's
going to be a sudden, and very large change in latency.  Sure, FMA
will catch most cases, but we've already shown that there are some
cases where it doesn't work too well (and I would argue that's always
going to be possible when you are relying on so many different types
of driver).  This is there to ensure that ZFS can handle *all* cases.


  Many people have requested this since it would facilitate remote live
 mirrors.


 At a minimum, something like VxVM's preferred plex should be reasonably
 easy to implement.

 B.  Use response times to timeout devices, dropping them to an interim
 failure mode while waiting for the official result from the driver.  This
 would prevent redundant pools hanging when waiting for a single device.


 I don't see how this could work except for mirrored pools.  Would that
 carry enough market to be worthwhile?
 -- richard

I have to admit, I've not tested this with a raided pool, but since
all ZFS commands hung when my iSCSI device went offline, I assumed
that you would get the same effect of the pool hanging if a raid-z2
pool is waiting for a response from a device.  Mirrored pools do work
particularly well with this since it gives you the potential to have
remote mirrors of your data, but if you had a raid-z2 pool, you still
wouldn't want that hanging if a single device failed.

I will go and test the raid scenario though on a current build, just to be sure.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
Hey Jeff,

Good to hear there's work going on to address this.

What did you guys think to my idea of ZFS supporting a waiting for a
response status for disks as an interim solution that allows the pool
to continue operation while it's waiting for FMA or the driver to
fault the drive?

I do appreciate that it's hard to come up with a definative it's dead
Jim answer, and I agree that long term the FMA approach will pay
dividends.  But I still feel this is a good short term solution, and
one that would also compliment your long term plans.

My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok

And for the state where it's not returning data, you can again split
that in two:
- returns wrong data
- doesn't return data

The first of these is already covered by ZFS with its checksums (with
FMA doing the extra work to fault drives), so it's just the second
that needs immediate attention, and for the life of me I can't think
of any situation that a simple timeout wouldn't catch.

Personally I'd love to see two parameters, allowing this behavior to
be turned on if desired, and allowing timeouts to be configured:

zfs-auto-device-timeout
zfs-auto-device-timeout-fail-delay

The first sets whether to use this feature, and configures the maximum
time ZFS will wait for a response from a device before putting it in a
waiting status.  The second would be optional and is the maximum
time ZFS will wait before faulting a device (at which point it's
replaced by a hot spare).

The reason I think this will work well with the FMA work is that you
can implement this now and have a real improvement in ZFS
availability.  Then, as the other work starts bringing better modeling
for drive timeouts, the parameters can be either removed, or set
automatically by ZFS.

Long term I guess there's also the potential to remove the second
setting if you felt FMA etc ever got reliable enough, but personally I
would always want to have the final fail delay set.  I'd maybe set it
to a long value such as 1-2 minutes to give FMA, etc a fair chance to
find the fault.  But I'd be much happier knowing that the system will
*always* be able to replace a faulty device within a minute or two, no
matter what the FMA system finds.

The key thing is that you're not faulting devices early, so FMA is
still vital.  The idea is purely to let ZFS to keep the pool active by
removing the need for the entire pool to wait on the FMA diagnosis.

As I said before, the driver and firmware are only aware of a single
disk, and I would imagine that FMA also has the same limitation - it's
only going to be looking at a single item and trying to determine
whether it's faulty or not.  Because of that, FMA is going to be
designed to be very careful to avoid false positives, and will likely
take it's time to reach an answer in some situations.

ZFS however has the benefit of knowing more about the pool, and in the
vast majority of situations, it should be possible for ZFS to read or
write from other devices while it's waiting for an 'official' result
from any one faulty component.

Ross


On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote:
 I think we (the ZFS team) all generally agree with you.  The current
 nevada code is much better at handling device failures than it was
 just a few months ago.  And there are additional changes that were
 made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
 product line that will make things even better once the FishWorks team
 has a chance to catch its breath and integrate those changes into nevada.
 And then we've got further improvements in the pipeline.

 The reason this is all so much harder than it sounds is that we're
 trying to provide increasingly optimal behavior given a collection of
 devices whose failure modes are largely ill-defined.  (Is the disk
 dead or just slow?  Gone or just temporarily disconnected?  Does this
 burst of bad sectors indicate catastrophic failure, or just localized
 media errors?)  The disks' SMART data is notoriously unreliable, BTW.
 So there's a lot of work underway to model the physical topology of
 the hardware, gather telemetry from the devices, the enclosures,
 the environmental sensors etc, so that we can generate an accurate
 FMA fault diagnosis and then tell ZFS to take appropriate action.

 We have some of this today; it's just a lot of work to complete it.

 Oh, and regarding the original post -- as several readers correctly
 surmised, we weren't faking anything, we just didn't want to wait
 for all the device timeouts.  Because the disks were on USB, which
 is a hotplug-capable bus, unplugging the dead disk generated an
 interrupt that bypassed the timeout.  We could have waited it out,
 but 60 seconds is an eternity on stage.

 Jeff

 On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.

 Can you state that absolutely, 

Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
PS.  I think this also gives you a chance at making the whole problem
much simpler.  Instead of the hard question of is this faulty,
you're just trying to say is it working right now?.

In fact, I'm now wondering if the waiting for a response flag
wouldn't be better as possibly faulty.  That way you could use it
with checksum errors too, possibly with settings as simple as errors
per minute or error percentage.  As with the timeouts, you could
have it off by default (or provide sensible defaults), and let
administrators tweak it for their particular needs.

Imagine a pool with the following settings:
- zfs-auto-device-timeout = 5s
- zfs-auto-device-checksum-fail-limit-epm = 20
- zfs-auto-device-checksum-fail-limit-percent = 10
- zfs-auto-device-fail-delay = 120s

That would allow the pool to flag a device as possibly faulty
regardless of the type of fault, and take immediate proactive action
to safeguard data (generally long before the device is actually
faulted).

A device triggering any of these flags would be enough for ZFS to
start reading from (or writing to) other devices first, and should you
get multiple failures, or problems on a non redundant pool, you always
just revert back to ZFS' current behaviour.

Ross





On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick [EMAIL PROTECTED] wrote:
 I think we (the ZFS team) all generally agree with you.  The current
 nevada code is much better at handling device failures than it was
 just a few months ago.  And there are additional changes that were
 made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
 product line that will make things even better once the FishWorks team
 has a chance to catch its breath and integrate those changes into nevada.
 And then we've got further improvements in the pipeline.

 The reason this is all so much harder than it sounds is that we're
 trying to provide increasingly optimal behavior given a collection of
 devices whose failure modes are largely ill-defined.  (Is the disk
 dead or just slow?  Gone or just temporarily disconnected?  Does this
 burst of bad sectors indicate catastrophic failure, or just localized
 media errors?)  The disks' SMART data is notoriously unreliable, BTW.
 So there's a lot of work underway to model the physical topology of
 the hardware, gather telemetry from the devices, the enclosures,
 the environmental sensors etc, so that we can generate an accurate
 FMA fault diagnosis and then tell ZFS to take appropriate action.

 We have some of this today; it's just a lot of work to complete it.

 Oh, and regarding the original post -- as several readers correctly
 surmised, we weren't faking anything, we just didn't want to wait
 for all the device timeouts.  Because the disks were on USB, which
 is a hotplug-capable bus, unplugging the dead disk generated an
 interrupt that bypassed the timeout.  We could have waited it out,
 but 60 seconds is an eternity on stage.

 Jeff

 On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.

 Can you state that absolutely, categorically, there is no failure mode out 
 there (caused by hardware faults, or bad drivers) that won't lock a drive up 
 for hours?  You can't, obviously, which is why we keep saying that ZFS 
 should have this kind of timeout feature.

 For once I agree with Miles, I think he's written a really good writeup of 
 the problem here.  My simple view on it would be this:

 Drives are only aware of themselves as an individual entity.  Their job is 
 to save  restore data to themselves, and drivers are written to minimise 
 any chance of data loss.  So when a drive starts to fail, it makes complete 
 sense for the driver and hardware to be very, very thorough about trying to 
 read or write that data, and to only fail as a last resort.

 I'm not at all surprised that drives take 30 seconds to timeout, nor that 
 they could slow a pool for hours.  That's their job.  They know nothing else 
 about the storage, they just have to do their level best to do as they're 
 told, and will only fail if they absolutely can't store the data.

 The raid controller on the other hand (Netapp / ZFS, etc) knows all about 
 the pool.  It knows if you have half a dozen good drives online, it knows if 
 there are hot spares available, and it *should* also know how quickly the 
 drives under its care usually respond to requests.

 ZFS is perfectly placed to spot when a drive is starting to fail, and to 
 take the appropriate action to safeguard your data.  It has far more 
 information available than a single drive ever will, and should be designed 
 accordingly.

 Expecting the firmware and drivers of individual drives to control the 
 failure modes of your redundant pool is just crazy imo.  You're throwing 
 away some of the biggest benefits of using multiple drives in the first 
 place.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 

Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
No, I count that as doesn't return data ok, but my post wasn't very
clear at all on that.

Even for a write, the disk will return something to indicate that the
action has completed, so that can also be covered by just those two
scenarios, and right now ZFS can lock the whole pool up if it's
waiting for that response.

My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

For write operations, the data can be safely committed to the rest of
the pool, with just the outstanding writes for the drive left waiting.
 Then as soon as the device is faulted, the hot spare can kick in, and
the outstanding writes quickly written to the spare.

For single parity, or non redundant volumes there's some benefit in
this.  For dual parity pools there's a massive benefit as your pool
stays available, and your data is still well protected.

Ross



On Tue, Nov 25, 2008 at 10:44 AM,  [EMAIL PROTECTED] wrote:


My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok


 I think you're missing won't write.

 There's clearly a difference between get data from a different copy
 which you can fix but retrying data to a different part of the redundant
 data and writing data: the data which can't be written must be kept
 until the drive is faulted.


 Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
Hmm, true.  The idea doesn't work so well if you have a lot of writes,
so there needs to be some thought as to how you handle that.

Just thinking aloud, could the missing writes be written to the log
file on the rest of the pool?  Or temporarily stored somewhere else in
the pool?  Would it be an option to allow up to a certain amount of
writes to be cached in this way while waiting for FMA, and only
suspend writes once that cache is full?

With a large SSD slog device would it be possible to just stream all
writes to the log?  As a further enhancement, might it be possible to
commit writes to the working drives, and just leave the writes for the
bad drive(s) in the slog (potentially saving a lot of space)?

For pools without log devices, I suspect that you would probably need
the administrator to specify the behavior as I can see several options
depending on the raid level and that pools priorities for data
availability / integrity:

Drive fault write cache settings:
default - pool waits for device, no writes occur until device or spare
comes online
slog - writes are cached to slog device until full, then pool reverts
to default behavior (could this be the default with slog devices
present?)
pool - writes are cached to the pool itself, up to a set maximum, and
are written to the device or spare as soon as possible.  This assumes
a single parity pool with the other devices available.  If the upper
limit is reached, or another devices goes faulty, pool reverts to
default behaviour.

Storing directly to the rest of the pool would probably want to be off
by default on single parity pools, but I would imagine that it could
be on by default on dual parity pools.

Would that be enough to allow writes to continue in most circumstances
while the pool waits for FMA?

Ross



On Tue, Nov 25, 2008 at 10:55 AM,  [EMAIL PROTECTED] wrote:


My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

 Except when you're writing a lot; 3 minutes can cause a 20GB backlog
 for a single disk.

 Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
 The shortcomings of timeouts have been discussed on this list before. How do
 you tell the difference between a drive that is dead and a path that is just
 highly loaded?

A path that is dead is either returning bad data, or isn't returning
anything.  A highly loaded path is by definition reading  writing
lots of data.  I think you're assuming that these are file level
timeouts, when this would actually need to be much lower level.


 Sounds good - devil, meet details, etc.

Yup, I imagine there are going to be a few details to iron out, many
of which will need looking at by somebody a lot more technical than
myself.

Despite that I still think this is a discussion worth having.  So far
I don't think I've seen any situation where this would make things
worse than they are now, and I can think of plenty of cases where it
would be a huge improvement.

Of course, it also probably means a huge amount of work to implement.
I'm just hoping that it's not prohibitively difficult, and that the
ZFS team see the benefits as being worth it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Ross Smith
I disagree Bob, I think this is a very different function to that
which FMA provides.

As far as I know, FMA doesn't have access to the big picture of pool
configuration that ZFS has, so why shouldn't ZFS use that information
to increase the reliability of the pool while still using FMA to
handle device failures?

The flip side of the argument is that ZFS already checks the data
returned by the hardware.  You might as well say that FMA should deal
with that too since it's responsible for all hardware failures.

The role of ZFS is to manage the pool, availability should be part and
parcel of that.


On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn
[EMAIL PROTECTED] wrote:
 On Tue, 25 Nov 2008, Ross Smith wrote:

 Good to hear there's work going on to address this.

 What did you guys think to my idea of ZFS supporting a waiting for a
 response status for disks as an interim solution that allows the pool
 to continue operation while it's waiting for FMA or the driver to
 fault the drive?

 A stable and sane system never comes with two brains.  It is wrong to put
 this sort of logic into ZFS when ZFS is already depending on FMA to make the
 decisions and Solaris already has an infrastructure to handle faults.  The
 more appropriate solution is that this feature should be in FMA.

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
 Snapshots are not replacements for traditional backup/restore features.
 If you need the latter, use what is currently available on the market.
 -- richard

I'd actually say snapshots do a better job in some circumstances.
Certainly they're being used that way by the desktop team:
http://blogs.sun.com/erwann/entry/zfs_on_the_desktop_zfs

None of this is stuff I'm after personally btw.  This was just my
attempt to interpret the request of the OP.

Although having said that, the ability to restore single files as fast
as you can restore a whole snapshot would be a nice feature.  Is that
something that would be possible?

Say you had a ZFS filesystem containing a 20GB file, with a recent
snapshot.  Is it technically feasible to restore that file by itself
in the same way a whole filesystem is rolled back with zfs restore?
If the file still existed, would this be a case of redirecting the
file's top level block (dnode?) to the one from the snapshot?  If the
file had been deleted, could you just copy that one block?

Is it that simple, or is there a level of interaction between files
and snapshots that I've missed (I've glanced through the tech specs,
but I'm a long way from fully understanding them).

Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
 If the file still existed, would this be a case of redirecting the
 file's top level block (dnode?) to the one from the snapshot?  If the
 file had been deleted, could you just copy that one block?

 Is it that simple, or is there a level of interaction between files
 and snapshots that I've missed (I've glanced through the tech specs,
 but I'm a long way from fully understanding them).


 It is as simple as a cp, or drag-n-drop in Nautilus.  The snapshot is
 read-only, so
 there is no need to cp, as long as you don't want to modify it or destroy
 the snapshot.
 -- richard

But that's missing the point here, which was that we want to restore
this file without having to copy the entire thing back.

Doing a cp or a drag-n-drop creates a new copy of the file, taking
time to restore, and allocating extra blocks.  Not a problem for small
files, but not ideal if you're say using ZFS to store virtual
machines, and want to roll back a single 20GB file from a 400GB
filesystem.

My question was whether it's technically feasible to roll back a
single file using the approach used for restoring snapshots, making it
an almost instantaneous operation?

ie:  If a snapshot exists that contains the file you want, you know
that all the relevant blocks are already on disk.  You don't want to
copy all of the blocks, but since ZFS follows a tree structure,
couldn't you restore the file by just restoring the one master block
for that file?

I'm just thinking that if it's technically feasible, I might raise an
RFE for this.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
Hi Darren,

That's storing a dump of a snapshot on external media, but files
within it are not directly accessible.  The work Tim et all are doing
is actually putting a live ZFS filesystem on external media and
sending snapshots to it.

A live ZFS filesystem is far more useful (and reliable) than a dump,
and having the ability to restore individual files from that would be
even better.

It still doesn't help the OP, but I think that's what he was after.

Ross



On Mon, Nov 3, 2008 at 9:55 AM, Darren J Moffat [EMAIL PROTECTED] wrote:
 Ross wrote:

 Ok, I see where you're coming from now, but what you're talking about
 isn't zfs send / receive.  If I'm interpreting correctly, you're talking
 about a couple of features, neither of which is in ZFS yet, and I'd need the
 input of more technical people to know if they are possible.

 1.  The ability to restore individual files from a snapshot, in the same
 way an entire snapshot is restored - simply using the blocks that are
 already stored.

 2.  The ability to store (and restore from) snapshots on external media.

 What makes you say this doesn't work ?  Exactly what do you mean here
 because this will work:

$ zfs send [EMAIL PROTECTED] | dd of=/dev/tape

 Sure it might not be useful and I don't think that is what you mean here  so
 can you expand on sotre snapshots on external media.

 --
 Darren J Moffat

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis

2008-10-23 Thread Ross Smith
No problem.  I didn't use mirrored slogs myself, but that's certainly
a step up for reliability.

It's pretty easy to create a boot script to re-create the ramdisk and
re-attach it to the pool too.  So long as you use the same device name
for the ramdisk you can add it each time with a simple zpool replace
pool ramdisk


On Thu, Oct 23, 2008 at 1:56 PM, Constantin Gonzalez
[EMAIL PROTECTED] wrote:
 Hi,

 yes, using slogs is the best solution.

 Meanwhile, using mirrored slogs from other servers' RAM-Disks running on
 UPSs
 seem like an interesting idea, if the reliability of UPS-backed RAM is
 deemed
 reliable enough for the purposes of the NFS server.

 Thanks for siggesting this!

 Cheers,
   Constantin

 Ross wrote:

 Well, it might be even more of a bodge than disabling the ZIL, but how
 about:

 - Create a 512MB ramdisk, use that for the ZIL
 - Buy a Micro Memory nvram PCI card for £100 or so.
 - Wait 3-6 months, hopefully buy a fully supported PCI-e SSD to replace
 the Micro Memory card.

 The ramdisk isn't an ideal solution, but provided you don't export the
 pool with it offline, it does work.  We used it as a stop gap solution for a
 couple of weeks while waiting for a Micro Memory nvram card.

 Our reasoning was that our server's on a UPS and we figured if something
 crashed badly enough to take out something like the UPS, the motherboard,
 etc, we'd be loosing data anyway.  We just made sure we had good backups in
 case the pool got corrupted and crossed our fingers.

 The reason I say wait 3-6 months is that there's a huge amount of activity
 with SSD's at the moment.  Sun said that they were planning to have flash
 storage launched by Christmas, so I figure there's a fair chance that we'll
 see some supported PCIe cards by next Spring.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 --
 Constantin Gonzalez  Sun Microsystems GmbH,
 Germany
 Principal Field Technologist
  http://blogs.sun.com/constantin
 Tel.: +49 89/4 60 08-25 91
 http://google.com/search?q=constantin+gonzalez

 Sitz d. Ges.: Sun Microsystems GmbH, Sonnenallee 1, 85551
 Kirchheim-Heimstetten
 Amtsgericht Muenchen: HRB 161028
 Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
 Vorsitzender des Aufsichtsrates: Martin Haering

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Ross Smith

 Try to separate the two things:  (1) Try /dev/zero - mbuffer --- network 
 --- mbuffer  /dev/null
 That should give you wirespeed
I tried that already.  It still gets just 10-11MB/s from this server.
I can get zfs send / receive and mbuffer working at 30MB/s though from a couple 
of test servers (with much lower specs).
 
 (2) Try zfs send | mbuffer  /dev/null That should give you an idea how fast 
 zfs send really is locally.
Hmm, that's better than 10MB/s, but the average is still only around 20MB/s:
summary:  942 MByte in 47.4 sec - average of 19.9 MB/s
 
I think that points to another problem though as the send mbuffer is 100% full. 
 Certainly the pool itself doesn't appear under any strain at all while this is 
going on:
 
   capacity operationsbandwidthpool used  avail   
read  write   read  write--  -  -  -  -  -  
-rc-pool  732G  1.55T171 85  21.3M  1.01M  mirror 144G   
320G 38  0  4.78M  0c1t1d0  -  -  6  0   779K   
   0c1t2d0  -  - 17  0  2.17M  0c2t1d0  -  
- 14  0  1.85M  0  mirror 146G   318G 39  0  4.89M  
0c1t3d0  -  - 20  0  2.50M  0c2t2d0  -  -   
  13  0  1.63M  0c2t0d0  -  -  6  0   779K  0  
mirror 146G   318G 34  0  4.35M  0c2t3d0  -  - 
19  0  2.39M  0c1t5d0  -  -  7  0  1002K  0
c1t4d0  -  -  7  0  1002K  0  mirror 148G   316G 23 
 0  2.93M  0c2t4d0  -  -  8  0  1.09M  0
c2t5d0  -  -  6  0   890K  0c1t6d0  -  -  7 
 0  1002K  0  mirror 148G   316G 35  0  4.35M  0
c1t7d0  -  -  6  0   779K  0c2t6d0  -  - 12 
 0  1.52M  0c2t7d0  -  - 17  0  2.07M  0  
c3d1p0  12K   504M  0 85  0  1.01M--  -  -  
-  -  -  -
Especially when compared to the zfs send stats on my backup server which 
managed 30MB/s via mbuffer (Being received on a single virtual SATA disk):
   capacity operationsbandwidthpool used  avail   
read  write   read  write--  -  -  -  -  -  
-rpool   5.12G  42.6G  0  5  0  27.1K  c4t0d0s0  5.12G  
42.6G  0  5  0  27.1K--  -  -  -  -  -  
-zfspool  431G  4.11T261  0  31.4M  0  raidz2 431G  
4.11T261  0  31.4M  0c4t1d0  -  -155  0  6.28M  
0c4t2d0  -  -155  0  6.27M  0c4t3d0  -  
-155  0  6.27M  0c4t4d0  -  -155  0  6.27M  
0c4t5d0  -  -155  0  6.27M  0--  -  -  
-  -  -  -
The really ironic thing is that the 30MB/s send / receive was sending to a 
virtual SATA disk which is stored (via sync NFS) on the server I'm having 
problems with...
 
Ross
 
 
_
Win New York holidays with Kellogg’s  Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Ross Smith


Oh dear god.  Sorry folks, it looks like the new hotmail really doesn't play 
well with the list.  Trying again in plain text:
 
 
 Try to separate the two things:
 
 (1) Try /dev/zero - mbuffer --- network --- mbuffer /dev/null
 That should give you wirespeed
 
I tried that already.  It still gets just 10-11MB/s from this server.
I can get zfs send / receive and mbuffer working at 30MB/s though from a couple 
of test servers (with much lower specs).
 
 (2) Try zfs send | mbuffer /dev/null
 That should give you an idea how fast zfs send really is locally.
 
Hmm, that's better than 10MB/s, but the average is still only around 20MB/s:
summary:  942 MByte in 47.4 sec - average of 19.9 MB/s
 
I think that points to another problem though as the send mbuffer is 100% full. 
 Certainly the pool itself doesn't appear under any strain at all while this is 
going on:
 
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rc-pool  732G  1.55T171 85  21.3M  1.01M
  mirror 144G   320G 38  0  4.78M  0
c1t1d0  -  -  6  0   779K  0
c1t2d0  -  - 17  0  2.17M  0
c2t1d0  -  - 14  0  1.85M  0
  mirror 146G   318G 39  0  4.89M  0
c1t3d0  -  - 20  0  2.50M  0
c2t2d0  -  - 13  0  1.63M  0
c2t0d0  -  -  6  0   779K  0
  mirror 146G   318G 34  0  4.35M  0
c2t3d0  -  - 19  0  2.39M  0
c1t5d0  -  -  7  0  1002K  0
c1t4d0  -  -  7  0  1002K  0
  mirror 148G   316G 23  0  2.93M  0
c2t4d0  -  -  8  0  1.09M  0
c2t5d0  -  -  6  0   890K  0
c1t6d0  -  -  7  0  1002K  0
  mirror 148G   316G 35  0  4.35M  0
c1t7d0  -  -  6  0   779K  0
c2t6d0  -  - 12  0  1.52M  0
c2t7d0  -  - 17  0  2.07M  0
  c3d1p0  12K   504M  0 85  0  1.01M
--  -  -  -  -  -  -
 
Especially when compared to the zfs send stats on my backup server which 
managed 30MB/s via mbuffer (Being received on a single virtual SATA disk):
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rpool   5.12G  42.6G  0  5  0  27.1K
  c4t0d0s0  5.12G  42.6G  0  5  0  27.1K
--  -  -  -  -  -  -
zfspool  431G  4.11T261  0  31.4M  0
  raidz2 431G  4.11T261  0  31.4M  0
c4t1d0  -  -155  0  6.28M  0
c4t2d0  -  -155  0  6.27M  0
c4t3d0  -  -155  0  6.27M  0
c4t4d0  -  -155  0  6.27M  0
c4t5d0  -  -155  0  6.27M  0
--  -  -  -  -  -  -
The really ironic thing is that the 30MB/s send / receive was sending to a 
virtual SATA disk which is stored (via sync NFS) on the server I'm having 
problems with...
 
Ross

 

 Date: Thu, 16 Oct 2008 14:27:49 +0200
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Improving zfs send performance
 
 Hi Ross
 
 Ross wrote:
 Now though I don't think it's network at all. The end result from that 
 thread is that we can't see any errors in the network setup, and using 
 nicstat and NFS I can show that the server is capable of 50-60MB/s over the 
 gigabit link. Nicstat also shows clearly that both zfs send / receive and 
 mbuffer are only sending 1/5 of that amount of data over the network.
 
 I've completely run out of ideas of my own (but I do half expect there's a 
 simple explanation I haven't thought of). Can anybody think of a reason why 
 both zfs send / receive and mbuffer would be so slow?
 
 Try to separate the two things:
 
 (1) Try /dev/zero - mbuffer --- network --- mbuffer /dev/null
 
 That should give you wirespeed
 
 (2) Try zfs send | mbuffer /dev/null
 
 That should give you an idea how fast zfs send really is locally.
 
 Carsten
_
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-15 Thread Ross Smith

I'm using 2008-05-07 (latest stable), am I right in assuming that one is ok?


 Date: Wed, 15 Oct 2008 13:52:42 +0200
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Improving zfs send performance
 
 Thomas Maier-Komor schrieb:
 BTW: I release a new version of mbuffer today.
 
 WARNING!!!
 
 Sorry people!!!
 
 The latest version of mbuffer has a regression that can CORRUPT output
 if stdout is used. Please fall back to the last version. A fix is on the
 way...
 
 - Thomas

_
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-15 Thread Ross Smith

Thanks, that got it working.  I'm still only getting 10MB/s, so it's not solved 
my problem - I've still got a bottleneck somewhere, but mbuffer is a huge 
improvement over standard zfs send / receive.  It makes such a difference when 
you can actually see what's going on.



 Date: Wed, 15 Oct 2008 12:08:14 +0200
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Improving zfs send performance
 
 Ross schrieb:
 Hi,
 
 I'm just doing my first proper send/receive over the network and I'm getting 
 just 9.4MB/s over a gigabit link.  Would you be able to provide an example 
 of how to use mbuffer / socat with ZFS for a Solaris beginner?
 
 thanks,
 
 Ross
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 receiver mbuffer -I sender:1 -s 128k -m 512M | zfs receive
 
 sender zfs send mypool/[EMAIL PROTECTED] | mbuffer -s 128k -m
 512M -O receiver:1
 
 BTW: I release a new version of mbuffer today.
 
 HTH,
 Thomas

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Mirrors braindead?

2008-10-07 Thread Ross Smith

Oh cool, that's great news.  Thanks Eric.



 Date: Tue, 7 Oct 2008 11:50:08 -0700
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] ZFS Mirrors braindead?
 
 On Tue, Oct 07, 2008 at 11:42:57AM -0700, Ross wrote:
 
 Running zpool status is a complete no no if your array is degraded
 in any way.  This is capable of locking up zfs even when it would
 otherwise have recovered itself.  If you had zpool status hang, this
 probably happened to you.
 
 FYI, this is bug 6667208 fixed in build 100 of nevada.
 
 - Eric
 
 --
 Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

_
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-02 Thread Ross Smith

Thinking about it, we could make use of this too.  The ability to add a
remote iSCSI mirror to any pool without sacrificing local performance
could be a huge benefit.


 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
 Subject: Re: Availability: ZFS needs to handle disk removal / driver failure 
 better
 Date: Fri, 29 Aug 2008 09:15:41 +1200
 
 Eric Schrock writes:
  
  A better option would be to not use this to perform FMA diagnosis, but
  instead work into the mirror child selection code.  This has already
  been alluded to before, but it would be cool to keep track of latency
  over time, and use this to both a) prefer one drive over another when
  selecting the child and b) proactively timeout/ignore results from one
  child and select the other if it's taking longer than some historical
  standard deviation.  This keeps away from diagnosing drives as faulty,
  but does allow ZFS to make better choices and maintain response times.
  It shouldn't be hard to keep track of the average and/or standard
  deviation and use it for selection; proactively timing out the slow I/Os
  is much trickier. 
  
 This would be a good solution to the remote iSCSI mirror configuration.  
 I've been working though this situation with a client (we have been 
 comparing ZFS with Cleversafe) and we'd love to be able to get the read 
 performance of the local drives from such a pool. 
 
  As others have mentioned, things get more difficult with writes.  If I
  issue a write to both halves of a mirror, should I return when the first
  one completes, or when both complete?  One possibility is to expose this
  as a tunable, but any such best effort RAS is a little dicey because
  you have very little visibility into the state of the pool in this
  scenario - is my data protected? becomes a very difficult question to
  answer. 
  
 One solution (again, to be used with a remote mirror) is the three way 
 mirror.  If two devices are local and one remote, data is safe once the two 
 local writes return.  I guess the issue then changes from is my data safe 
 to how safe is my data.  I would be reluctant to deploy a remote mirror 
 device without local redundancy, so this probably won't be an uncommon 
 setup.  There would have to be an acceptable window of risk when local data 
 isn't replicated. 
 
 Ian

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Ross Smith

Hey Tim,

I'll admit I just quoted the blog without checking, I seem to remember the 
sales rep I spoke to recommending putting aside 20-50% of my disk for 
snapshots.  Compared to ZFS where I don't need to reserve any space it feels 
very old fashioned.  With ZFS, snapshots just take up as much space as I want 
them to.

The problem though for our usage with NetApp was that we actually couldn't 
reserve enough space for snapshots.  50% of the pool was their maximum, and 
we're interested in running ten years worth of snapshots here, which could see 
us with a pool with just 10% of live data and 90% of the space taken up by 
snapshots.  The NetApp approach was just too restrictive.

Ross


 Date: Sun, 31 Aug 2008 08:08:09 -0700
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] EMC - top of the table for efficiency, how well 
 would ZFS do?
 
 Netapp does NOT recommend 100 percent.  Perhaps you should talk to
 netapp or one of their partners who know their tech instead of their
 competitors next time.
 
 Zfs, the way its currently implemented will require roughly the same
 as netapp... Which still isn't 100.
 
 
 
 On 8/30/08, Ross [EMAIL PROTECTED] wrote:
  Just saw this blog post linked from the register, it's EMC pointing out that
  their array wastes less disk space than either HP or NetApp.  I'm loving the
  10% of space they have to reserve for snapshots, and you can't add more o_0.
 
  HP similarly recommend 20% of reserved space for snapshots, and NetApp
  recommend a whopping 100% (that was one reason we didn't buy NetApp
  actually).
 
  Could anybody say how ZFS would match up to these figures?  I'd have thought
  a 14+2 raid-z2 scheme similar to NFS' would probably be fairest.
 
  http://chucksblog.typepad.com/chucks_blog/2008/08/your-storage-mi.html
 
  Ross
  --
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

_
Make a mini you on Windows Live Messenger!
http://clk.atdmt.com/UKM/go/107571437/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Ross Smith

Dear god.  Thanks Tim, that's useful info.

The sales rep we spoke to was really trying quite hard to persuade us that 
NetApp was the best solution for us, they spent a couple of months working with 
us, but ultimately we were put off because of those 'limitations'.  They knew 
full well that those were two of our major concerns, but never had an answer 
for us.  That was a big part of the reason we started seriously looking into 
ZFS instead of NetApp.

If nothing else at least I now know a firm to avoid when buying NetApp...

Date: Sun, 31 Aug 2008 11:06:16 -0500
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: [zfs-discuss] EMC - top of the table for efficiency, how well 
would ZFS do?
CC: zfs-discuss@opensolaris.org



On Sun, Aug 31, 2008 at 10:39 AM, Ross Smith [EMAIL PROTECTED] wrote:






Hey Tim,

I'll admit I just quoted the blog without checking, I seem to remember the 
sales rep I spoke to recommending putting aside 20-50% of my disk for 
snapshots.  Compared to ZFS where I don't need to reserve any space it feels 
very old fashioned.  With ZFS, snapshots just take up as much space as I want 
them to.

Your sales rep was an idiot then.  Snapshot reserve isn't required at all. It 
isn't necessary to take snapshots.  It's simply a portion of space out of a 
volume that can only be used for snapshots, live data cannot enter into this 
space.  Snapshots, however, can exist on a volume with no snapshot reserve.  
They are in no way limited to the snapshot reserve you've set. Snapshot 
reserve is a guaranteed minimum amount of space out of a volume.  You can set 
it 90% as you mention below, and it will work just fine.


ZFS is no different than NetApp when it comes to snapshots.  I suggest until 
you have a basic understanding of how NetApp software works, not making ANY 
definitive statements about them.  You're sounding like a fool and/or someone 
working for one of their competitors.

 

The problem though for our usage with NetApp was that we actually couldn't 
reserve enough space for snapshots.  50% of the pool was their maximum, and 
we're interested in running ten years worth of snapshots here, which could see 
us with a pool with just 10% of live data and 90% of the space taken up by 
snapshots.  The NetApp approach was just too restrictive.


Ross
 There is not, and never has been a 50% of the pool maximum.  That's also a 
lie.  If you want snapshots to take up 90% of the pool, ONTAP will GLADLY do 
so.  I've got a filer sitting in my lab and would be MORE than happy to post 
the df output of a volume that has snapshots taking up 90% of the volume.



--Tim




_
Win a voice over part with Kung Fu Panda  Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ross Smith

Triple mirroring you say?  That'd be me then :D

The reason I really want to get ZFS timeouts sorted is that our long term goal 
is to mirror that over two servers too, giving us a pool mirrored across two 
servers, each of which is actually a zfs iscsi volume hosted on triply mirrored 
disks.

Oh, and we'll have two sets of online off-site backups running raid-z2, plus a 
set of off-line backups too.

All in all I'm pretty happy with the integrity of the data, wouldn't want to 
use anything other than ZFS for that now.  I'd just like to get the 
availability working a bit better, without having to go back to buying raid 
controllers.  We have big plans for that too; once we get the iSCSI / iSER 
timeout issue sorted our long term availability goals are to have the setup I 
mentioned above hosted out from a pair of clustered Solaris NFS / CIFS servers.

Failover time on the cluster is currently in the order of 5-10 seconds, if I 
can get the detection of a bad iSCSI link down under 2 seconds we'll 
essentially have a worst case scenario of  15 seconds downtime.  Downtime that 
low means it's effectively transparent for our users as all of our applications 
can cope with that seamlessly, and I'd really love to be able to do that this 
calendar year.

Anyway, getting back on topic, it's a good point about moving forward while 
redundancy exists.  I think the flag for specifying the write behavior should 
have that as the default, with the optional setting being to allow the pool to 
continue accepting writes while the pool is in a non redundant state.

Ross

 Date: Sat, 30 Aug 2008 10:59:19 -0500
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / 
 driver failure better
 
 On Sat, 30 Aug 2008, Ross wrote:
  while the problem is diagnosed. - With that said, could the write 
  timeout default to on when you have a slog device?  After all, the 
  data is safely committed to the slog, and should remain there until 
  it's written to all devices.  Bob, you seemed the most concerned 
  about writes, would that be enough redundancy for you to be happy to 
  have this on by default?  If not, I'd still be ok having it off by 
  default, we could maybe just include it in the evil tuning guide 
  suggesting that this could be turned on by anybody who has a 
  separate slog device.
 
 It is my impression that the slog device is only used for synchronous 
 writes.  Depending on the system, this could be just a small fraction 
 of the writes.
 
 In my opinion, ZFS's primary goal is to avoid data loss, or 
 consumption of wrong data.  Availability is a lesser goal.
 
 If someone really needs maximum availability then they can go to 
 triple mirroring or some other maximally redundant scheme.  ZFS should 
 to its best to continue moving forward as long as some level of 
 redundancy exists.  There could be an option to allow moving forward 
 with no redundancy at all.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 

_
Win a voice over part with Kung Fu Panda  Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross Smith

Hi guys,

Bob, my thought was to have this timeout as something that can be optionally 
set by the administrator on a per pool basis.  I'll admit I was mainly thinking 
about reads and hadn't considered the write scenario, but even having thought 
about that it's still a feature I'd like.  After all, this would be a timeout 
set by the administrator based on the longest delay they can afford for that 
storage pool.

Personally, if a SATA disk wasn't responding to any requests after 2 seconds I 
really don't care if an error has been detected, as far as I'm concerned that 
disk is faulty.  I'd be quite happy for the array to drop to a degraded mode 
based on that and for writes to carry on with the rest of the array.

Eric, thanks for the extra details, they're very much appreciated.  It's good 
to hear you're working on this, and I love the idea of doing a B_FAILFAST read 
on both halves of the mirror.

I do have a question though.  From what you're saying, the response time can't 
be consistent across all hardware, so you're once again at the mercy of the 
storage drivers.  Do you know how long does B_FAILFAST takes to return a 
response on iSCSI?  If that's over 1-2 seconds I would still consider that too 
slow I'm afraid.

I understand that Sun in general don't want to add fault management to ZFS, but 
I don't see how this particular timeout does anything other than help ZFS when 
it's dealing with such a diverse range of media.  I agree that ZFS can't know 
itself what should be a valid timeout, but that's exactly why this needs to be 
an optional administrator set parameter.  The administrator of a storage array 
who wants to set this certainly knows what a valid timeout is for them, and 
these timeouts are likely to be several orders of magnitude larger than the 
standard response times.  I would configure very different values for my SATA 
drives as for my iSCSI connections, but in each case I would be happier knowing 
that ZFS has more of a chance of catching bad drivers or unexpected scenarios.

I very much doubt hardware raid controllers would wait 3 minutes for a drive to 
return a response, they will have their own internal timeouts to know when a 
drive has failed, and while ZFS is dealing with very different hardware I can't 
help but feel it should have that same approach to management of its drives.

However, that said, I'll be more than willing to test the new
B_FAILFAST logic on iSCSI once it's released.  Just let me know when
it's out.


Ross





 Date: Thu, 28 Aug 2008 11:29:21 -0500
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / 
 driver failure better
 
 On Thu, 28 Aug 2008, Ross wrote:
 
  I believe ZFS should apply the same tough standards to pool 
  availability as it does to data integrity.  A bad checksum makes ZFS 
  read the data from elsewhere, why shouldn't a timeout do the same 
  thing?
 
 A problem is that for some devices, a five minute timeout is ok.  For 
 others, there must be a problem if the device does not respond in a 
 second or two.
 
 If the system or device is simply overwelmed with work, then you would 
 not want the system to go haywire and make the problems much worse.
 
 Which of these do you prefer?
 
o System waits substantial time for devices to (possibly) recover in
  order to ensure that subsequently written data has the least
  chance of being lost.
 
o System immediately ignores slow devices and switches to
  non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
  mode.  When system is under intense load, it automatically
  switches to the may-lose-your-data mode.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS automatic snapshots 0.11 Early Access

2008-08-27 Thread Ross Smith

That sounds absolutely perfect Tim, thanks.
 
Yes, we'll be sending these to other zfs filesystems, although I haven't looked 
at the send/receive part of your service yet.  What I'd like to do is stage the 
send/receive as files on an external disk, and then receive them remotely from 
that.  I've tested the concept works with a single send/receive operation, but 
haven't looked into the automation yet.
 
The plan is to use usb/firewire/esata disks to do the data transfers rather 
than doing it all over the wire.  We'll do the initial full send/receive 
locally over gigabit to prepare the remote system, and from that point on it 
will just be incremental daily or weekly transfers which should fit fine on an 
80-200GB external drive.
 
When I get around to it I'll be pulling apart your automatic backup code to see 
if I can't get it to fire off the incremental zfs send (or receive) as soon as 
the system detects that the external drive has been attached.
 
We will be using tape backups too, but those will be our disaster recovery plan 
in case ZFS itself fails, so those will be backups of the raw files, possibly 
using something as simple as tar.  We don't expect to ever need those, but at 
least we'll be safe should we ever experience pool corruption on all four 
servers.
 
And yes, you could say we're paranoid :D
 
Ross
 Date: Wed, 27 Aug 2008 12:14:10 +0100 From: [EMAIL PROTECTED] Subject: Re: 
 [zfs-discuss] ZFS automatic snapshots 0.11 Early Access To: [EMAIL 
 PROTECTED] CC: zfs-discuss@opensolaris.org  On Wed, 2008-08-27 at 03:53 
 -0700, Ross wrote:  We're looking at autohome folders for windows users 
 over CIFS, but I'm  wondering how that is going to affect our backup 
 strategy. I was  hoping to be able to use your automatic snapshot service 
 on these  servers, do you know how that service would work with the 
 autohome  service when filesystems are being created on demand?  If 
 you're using 0.11ea, and you're creating filesystems on the fly, so long as 
 the parent filesystem you're creating a child in has a com.sun:auto-snapshot 
 property set, the child will inherit that zfs user-property, and snapshots 
 will automatically get taken for that child too, no user intervention 
 needed.  The automatic backup stuff in the service (not turned by default) 
 should handle incremental vs. full send/recvs, even on newly created 
 filesystems. If it finds an earlier snapshot for the filesystem, it'll do an 
 incremental send/recv, otherwise it'll send a full snapshot stream first, 
 followed by incremental send/recvs after that.  [ of course, you'd be 
 sending these streams to other zfs filesystems, not just saving the flat zfs 
 send streams to tape, right? ]  cheers, tim  
_
Win a voice over part with Kung Fu Panda  Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best layout for 15 disks?

2008-08-22 Thread Ross Smith

Yup, you got it, and an 8 disk raid-z2 array should still fly for a home system 
:D  I'm guessing you're on gigabit there?  I don't see you having any problems 
hitting the bandwidth limit on it.

Ross


 Date: Fri, 22 Aug 2008 11:11:21 -0700
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Subject: Re: [zfs-discuss] Best layout for 15 disks?
 CC: zfs-discuss@opensolaris.org
 
 On 8/22/08, Ross [EMAIL PROTECTED] wrote:
 
  Yes, that looks pretty good mike.  There are a few limitations to that as 
  you add the 2nd raidz2 set, but nothing major.  When you add the extra 
  disks, your original data will still be stored on the first set of disks, 
  if you've any free space left on those you'll then get some data stored 
  across all the disks, and then I think that once the first set are full, 
  zfs will just start using the free space on the newer 8.
 
  It shouldn't be a problem for a home system, and all that will happen 
  silently in the background.  It's just worth knowing that you don't 
  necessarily get the full performance of a 16 disk array when you do it in 
  two stages like that.
 
 that's fine. I'll basically be getting the performance of an 8 disk
 raidz2 at worst, yeah? i'm fine with how the space will be
 distributed. after all this is still a huge improvement over my
 current haphazard setup :P

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

2008-08-20 Thread Ross Smith

  Without fail, cfgadm changes the status from disk to sata-port when I
  unplug a device attached to port 6 or 7, but most of the time unplugging
  disks 0-5 results in no change in cfgadm, until I also attach disk 6 or 7.
 
 That does seem inconsistent, or at least, it's not what I'd expect.

Yup, was an absolute nightmare to diagnose on top of everything else.  
Definitely doesn't happen in windows too.  I really want somebody to try snv_94 
on a Thumper to see if you get the same behaviour there, or whether it's unique 
to Supermicro's Marvell card.

  Often the system hung completely when you pulled one of the disks 0-5,
  and wouldn't respond again until you re-inserted it.
  
  I'm 99.99% sure this is a driver issue for this controller.
 
 Have you logged a bug on it yet?

Yup, 6735931.  Added the information about it working in Windows today too.

Ross

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

2008-08-15 Thread Ross Smith

Oh god no, I'm already learning three new operating systems, now is not a good 
time to add a fourth.
 
Ross-- Windows admin now working with Ubuntu, OpenSolaris and ESX



Date: Fri, 15 Aug 2008 10:07:31 -0500From: [EMAIL PROTECTED]: [EMAIL 
PROTECTED]: Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive 
removedCC: zfs-discuss@opensolaris.org
You could always try FreeBSD :)--Tim
On Fri, Aug 15, 2008 at 9:44 AM, Ross [EMAIL PROTECTED] wrote:
Haven't a clue, but I've just gotten around to installing windows on this box 
to test and I can confirm that hot plug works just fine in windows.Drives 
appear and dissappear in device manager the second I unplug the hardware.  Any 
drive, either controller.  So far I've done a couple of dozen removals, pulling 
individual drives, or as many as half a dozen at once.  I've even gone as far 
as to immediately pull a drive I only just connected.  Windows has no problems 
at all.Unfortunately for me, Windows doesn't support ZFS...  right now it's 
looking a whole load more stable.Ross
_
Win a voice over part with Kung Fu Panda  Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working - I broke my pool...

2008-08-06 Thread Ross Smith

Hmm... got a bit more information for you to add to that bug I think.
 
Zpool import also doesn't work if you have mirrored log devices and either one 
of them is offline.
 
I created two ramdisks with:
# ramdiskadm -a rc-pool-zil-1 256m
# ramdiskadm -a rc-pool-zil-2 256m
 
And added them to the pool with:
# zpool add rc-pool log mirror /dev/ramdisk/rc-pool-zil-1 
/dev/ramdisk/rc-pool-zil-2
 
I can reboot fine, the pool imports ok without the ZIL and I have a script that 
recreates the ramdisks and adds them back to the pool:#!/sbin/shstate=$1case 
$state in'start')   echo 'Starting Ramdisks'   /usr/sbin/ramdiskadm -a 
rc-pool-zil-1 256m   /usr/sbin/ramdiskadm -a rc-pool-zil-2 256m   echo 
'Attaching to ZFS ZIL'   /usr/sbin/zpool replace test 
/dev/ramdisk/rc-pool-zil-1   /usr/sbin/zpool replace test 
/dev/ramdisk/rc-pool-zil-2   ;;'stop')   ;;esac
 
However, if I export the pool, and delete one ramdisk to check that the 
mirroring works fine, the import fails:
# zpool export rc-pool
# ramdiskadm -d rc-pool-zil-1
# zpool import rc-pool
cannot import 'rc-pool': one or more devices is currently unavailable
 
Ross
 Date: Mon, 4 Aug 2008 10:42:43 -0600 From: [EMAIL PROTECTED] Subject: Re: 
 [zfs-discuss] Zpool import not working - I broke my pool... To: [EMAIL 
 PROTECTED]; [EMAIL PROTECTED] CC: zfs-discuss@opensolaris.orgRichard 
 Elling wrote:  Ross wrote:  I'm trying to import a pool I just exported 
 but I can't, even -f doesn't help. Every time I try I'm getting an error:  
 cannot import 'rc-pool': one or more devices is currently unavailable  
  Now I suspect the reason it's not happy is that the pool used to have a 
 ZIL :)  Correct. What you want is CR 6707530, log device failure 
 needs some work  http://bugs.opensolaris.org/view_bug.do?bug_id=6707530  
 which Neil has been working on, scheduled for b96.  Actually no. That CR 
 mentioned the problem and talks about splitting out the bug, as it's really 
 a separate problem. I've just done that and here's the new CR which probably 
 won't be visible immediately to you:  6733267 Allow a pool to be imported 
 with a missing slog  Here's the Description:  --- This 
 CR is being broken out from 6707530 log device failure needs some work  
 When Separate Intent logs (slogs) were designed they were given equal status 
 in the pool device tree. This was because they can contain committed changes 
 to the pool. So if one is missing it is assumed to be important to the 
 integrity of the application(s) that wanted the data committed 
 synchronously, and thus a pool cannot be imported with a missing slog. 
 However, we do allow a pool to be missing a slog on boot up if it's in the 
 /etc/zfs/zpool.cache file. So this sends a mixed message.  We should allow 
 a pool to be imported without a slog if -f is used and to not import without 
 -f but perhaps with a better error message.  It's the guidsum check that 
 actually rejects imports with missing devices. We could have a separate 
 guidsum for the main pool devices (non slog/cache). --- 
_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working - I broke my pool...

2008-08-05 Thread Ross Smith

Just a thought, before I go and wipe this zpool, is there any way to manually 
recreate the /etc/zfs/zpool.cache file?
 
Ross Date: Mon, 4 Aug 2008 10:42:43 -0600 From: [EMAIL PROTECTED] Subject: 
Re: [zfs-discuss] Zpool import not working - I broke my pool... To: [EMAIL 
PROTECTED]; [EMAIL PROTECTED] CC: zfs-discuss@opensolaris.orgRichard 
Elling wrote:  Ross wrote:  I'm trying to import a pool I just exported 
but I can't, even -f doesn't help. Every time I try I'm getting an error:  
cannot import 'rc-pool': one or more devices is currently unavailable   
Now I suspect the reason it's not happy is that the pool used to have a ZIL :) 
 Correct. What you want is CR 6707530, log device failure needs some 
work  http://bugs.opensolaris.org/view_bug.do?bug_id=6707530  which Neil 
has been working on, scheduled for b96.  Actually no. That CR mentioned the 
problem and talks about splitting out the bug, as it's really a separate 
problem. I've just done that and here's the new CR which probably won't be 
visible immediately to you:  6733267 Allow a pool to be imported with a 
missing slog  Here's the Description:  --- This CR is 
being broken out from 6707530 log device failure needs some work  When 
Separate Intent logs (slogs) were designed they were given equal status in the 
pool device tree. This was because they can contain committed changes to the 
pool. So if one is missing it is assumed to be important to the integrity of 
the application(s) that wanted the data committed synchronously, and thus a 
pool cannot be imported with a missing slog. However, we do allow a pool to be 
missing a slog on boot up if it's in the /etc/zfs/zpool.cache file. So this 
sends a mixed message.  We should allow a pool to be imported without a slog 
if -f is used and to not import without -f but perhaps with a better error 
message.  It's the guidsum check that actually rejects imports with missing 
devices. We could have a separate guidsum for the main pool devices (non 
slog/cache). --- 
_
Win a voice over part with Kung Fu Panda  Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working - I broke my pool...

2008-08-05 Thread Ross Smith

No, but that's a great idea!  I'm on a UFS root at the moment, will have a look 
at using ZFS next time I re-install.
 Date: Tue, 5 Aug 2008 07:59:35 -0700 From: [EMAIL PROTECTED] Subject: Re: 
 [zfs-discuss] Zpool import not working - I broke my pool... To: [EMAIL 
 PROTECTED] CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org  Ross Smith 
 wrote:  Just a thought, before I go and wipe this zpool, is there any way 
 to   manually recreate the /etc/zfs/zpool.cache file?  Do you have a copy 
 in a snapshot? ZFS for root is awesome! -- richard Ross
 Date: Mon, 4 Aug 2008 10:42:43 -0600   From: [EMAIL PROTECTED]   
 Subject: Re: [zfs-discuss] Zpool import not working - I broke my pool...   
 To: [EMAIL PROTECTED]; [EMAIL PROTECTED]   CC: 
 zfs-discuss@opensolaris.org Richard Elling wrote:
 Ross wrote:I'm trying to import a pool I just exported but I can't, 
 even -f   doesn't help. Every time I try I'm getting an error:
 cannot import 'rc-pool': one or more devices is currently   unavailable 
   Now I suspect the reason it's not happy is that the pool used 
 to   have a ZIL :)  Correct. What you want is CR 
 6707530, log device failure needs   some work
 http://bugs.opensolaris.org/view_bug.do?bug_id=6707530which Neil has 
 been working on, scheduled for b96. Actually no. That CR mentioned 
 the problem and talks about splitting out   the bug, as it's really a 
 separate problem. I've just done that and   here's   the new CR which 
 probably won't be visible immediately to you: 6733267 Allow a pool 
 to be imported with a missing slog Here's the Description:
  ---   This CR is being broken out from 6707530 log 
 device failure needs   some work When Separate Intent logs 
 (slogs) were designed they were given   equal status in the pool device 
 tree.   This was because they can contain committed changes to the pool. 
   So if one is missing it is assumed to be important to the integrity   
 of the   application(s) that wanted the data committed synchronously, and 
 thus   a pool cannot be imported with a missing slog.   However, we do 
 allow a pool to be missing a slog on boot up if   it's in the 
 /etc/zfs/zpool.cache file. So this sends a mixed message. We should 
 allow a pool to be imported without a slog if -f is used   and to not 
 import without -f but perhaps with a better error message. It's 
 the guidsum check that actually rejects imports with missing   devices.  
  We could have a separate guidsum for the main pool devices (non   
 slog/cache).   ---  
   
 Get Hotmail on your mobile from Vodafone Try it Now!   
 http://clk.atdmt.com/UKM/go/107571435/direct/01/ 
_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] are these errors dangerous

2008-08-03 Thread Ross Smith

Hi Matt,
 
If it's all 3 disks, I wouldn't have thought it likely to be disk errors, and I 
don't think it's a ZFS fault as such.  You might be better posting the question 
in the storage or help forums to see if anybody there can shed more light on 
this.
 
Ross
 Date: Sun, 3 Aug 2008 16:48:03 +0100 From: [EMAIL PROTECTED] To: [EMAIL 
 PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] are 
 these errors dangerous  Ross wrote:  Hi,First of all, I really 
 should warn you that I'm very new to Solaris, I'll happily share my thoughts 
 but be aware that there's not a lot of experience backing them up.   
 From what you've said, and the logs you've posted I suspect you're hitting 
 recoverable read errors. ZFS wouldn't flag these as no corrupt data has been 
 encountered, but I suspect the device driver is logging them anyway.
 The log you posted all appears to refer to one disk (sd0), my guess would be 
 that you have some hardware faults on that device and if it were me I'd 
 probably be replacing it before it actually fails.I'd check your logs 
 before replacing that disk though, you need to see if it's just that one 
 disk, or if others are affected. Provided you have a redundant ZFS pool, it 
 may be worth offlining that disk, unconfiguring it with cfgadm, and then 
 pulling the drive to see if that does cure the warnings you're getting in the 
 logs.Whatever you do, please keep me posted. Your post has already 
 made me realise it would be a good idea to have a script watching log file 
 sizes to catch problems like this early.Ross  Thanks for your 
 insights, I'm also relatively new to solaris but i've  been on linux for 
 years. I've just read more into the logs and its  giving these errors for 
 all 3 of my disks (sd0,1,2). I'm running a  raidz1, unfortunately without 
 any spares and I'm not too keen on  removing the parity from my pool as I've 
 got a lot of important files  stored there.  I would agree that this seems 
 to be a recoverable error and nothing is  getting corrupted thanks to ZFS. 
 The thing I'm worried about is if the  entire batch is failing slowly and 
 will all die at the same time.  Hopefully some ZFS/hardware guru can 
 comment on this before the world  ends for me :P  Thanks  Matt  No 
 virus found in this outgoing message. Checked by AVG - http://www.avg.com  
 Version: 8.0.138 / Virus Database: 270.5.10/1587 - Release Date: 02/08/2008 
 17:30  
_
Win a voice over part with Kung Fu Panda  Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I trust ZFS?

2008-08-01 Thread Ross Smith

Hey Brent,
 
On the Sun hardware like the Thumper you do get a nice bright blue ready to 
remove led as soon as you issue the cfgadm -c unconfigure xxx command.  On 
other hardware it takes a little more care, I'm labelling our drive bays up 
*very* carefully to ensure we always remove the right drive.  Stickers are your 
friend, mine will probably be labelled sata1/0, sata1/1, sata1/2, etc.
 
I know Sun are working to improve the LED support, but I don't know whether 
that support will ever be extended to 3rd party hardware:
http://blogs.sun.com/eschrock/entry/external_storage_enclosures_in_solaris
 
I'd love to use Sun hardware for this, but while things like x2200 servers are 
great value for money, Sun don't have anything even remotely competative to a 
standard 3U server with 16 SATA bays.  The x4240 is probably closest, but is at 
least double the price.  Even the J4200 arrays are more expensive than this 
entire server.
 
Ross
 
PS.  Once you've tested SCSI removal, could you add your results to my thread, 
would love to hear how that 
went.http://www.opensolaris.org/jive/thread.jspa?threadID=67837tstart=0
 
 
 This conversation piques my interest.. I have been reading a lot about 
 Opensolaris/Solaris for the last few weeks. Have even spoken to Sun storage 
 techs about bringing in Thumper/Thor for our storage needs. I have recently 
 brought online a Dell server with a DAS (14 SCSI drives). This will be part 
 of my tests now, 
 physically removing a member of the pool before issuing the removal command 
 for that particular drive.
 One other issue I have now also, how do you physically locate a 
 failing/failed drive in ZFS?
 With hardware RAID sets, if the RAID controller itself detects the error, it 
 will inititate a BLINK command to that 
 drive, so the individual drive is now flashing red/amber/whatever on the RAID 
 enclosure. How would this be possible with ZFS? Say you have a JBOD 
 enclosure, (14, hell maybe 48 drives). Knowing c0d0xx failed is no longer 
 helpful, if only ZFS catches an error. Will you be able to isolate the drive 
 quickly, to replace it? Or will you be going does the enclosure start at 
 logical zero... left to right.. hrmmm
 Thanks
 --  Brent Jones [EMAIL PROTECTED]
 
_
100’s of Nikon cameras to be won with Live Search
http://clk.atdmt.com/UKM/go/101719808/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing the boot HDDs in x4500

2008-08-01 Thread Ross Smith

Sorry Ian, I was posting on the forum and missed the word disks from my 
previous post.  I'm still not used to Sun's mutant cross of a message board / 
mailing list.
 
Ross
 Date: Fri, 1 Aug 2008 21:08:08 +1200 From: [EMAIL PROTECTED] To: [EMAIL 
 PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] 
 Replacing the boot HDDs in x4500  Ross wrote:  Wipe the snv_70b disks I 
 meant. What disks? This message makes no sense without context.  
 Context free messages are a pain in the arse for those of us who use the 
 mail list.  Ian
_
Make a mini you on Windows Live Messenger!
http://clk.atdmt.com/UKM/go/107571437/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-31 Thread Ross Smith
 0K   
  0K 0%/dev/fdswap   4.7G48K   4.7G 1%
/tmpswap   4.7G76K   4.7G 1%
/var/run/dev/dsk/c1t0d0s7  425G   4.8G   416G 2%/export/home
 
6. 10:35am  It's now been two hours, neither zpool status nor zfs list have 
ever finished.  The file copy attempt has also been hung for over an hour 
(although that's not unexpected with 'wait' as the failmode).
 
Richard, you say ZFS is not silently failing, well for me it appears that it 
is.  I can't see any warnings from ZFS, I can't get any status information.  I 
see no way that I could find out what files are going to be lost on this server.
 
Yes, I'm now aware that the pool has hung since file operations are hanging, 
however had that been my first indication of a problem I believe I am now left 
in a position where I cannot find out either the cause, nor the files affected. 
 I don't believe I have any way to find out which operations had completed 
without error, but are not currently committed to disk.  I certainly don't get 
the status message you do saying permanent errors have been found in files.
 
I plugged the USB drive back in now, Solaris detected it ok, but ZFS is still 
hung.  The rest of /var/adm/messages is:
Jul 31 09:39:44 unknown smbd[603]: [ID 766186 daemon.error] 
NbtDatagramDecode[11]: too small packetJul 31 09:45:22 unknown 
/sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on 
nge0 missing mandatory lease option, ignoredJul 31 09:45:38 unknown last 
message repeated 5 timesJul 31 09:51:44 unknown smbd[603]: [ID 766186 
daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:03:44 unknown 
last message repeated 2 timesJul 31 10:14:27 unknown /sbin/dhcpagent[95]: [ID 
732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory 
lease option, ignoredJul 31 10:14:45 unknown last message repeated 5 timesJul 
31 10:15:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: 
too small packetJul 31 10:27:45 unknown smbd[603]: [ID 766186 daemon.error] 
NbtDatagramDecode[11]: too small packet
Jul 31 10:36:25 unknown usba: [ID 691482 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED] (scsa2usb0): 
Reinserted device is accessible again.Jul 31 10:39:45 unknown smbd[603]: [ID 
766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:45:53 
unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK 
packet on nge0 missing mandatory lease option, ignoredJul 31 10:46:09 unknown 
last message repeated 5 timesJul 31 10:51:45 unknown smbd[603]: [ID 766186 
daemon.error] NbtDatagramDecode[11]: too small packet
 
7. 10:55am  Gave up on ZFS ever recovering.  A shutdown attempt hung as 
expected.  I hard-reset the computer.
 
Ross
 
 
 Date: Wed, 30 Jul 2008 11:17:08 -0700 From: [EMAIL PROTECTED] Subject: Re: 
 [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed To: [EMAIL 
 PROTECTED] CC: zfs-discuss@opensolaris.org  I was able to reproduce this 
 in b93, but might have a different interpretation of the conditions. More 
 below...  Ross Smith wrote:  A little more information today. I had a 
 feeling that ZFS would   continue quite some time before giving an error, 
 and today I've shown   that you can carry on working with the filesystem 
 for at least half an   hour with the disk removed.I suspect on a 
 system with little load you could carry on working for   several hours 
 without any indication that there is a problem. It   looks to me like ZFS 
 is caching reads  writes, and that provided   requests can be fulfilled 
 from the cache, it doesn't care whether the   disk is present or not.  In 
 my USB-flash-disk-sudden-removal-while-writing-big-file-test, 1. I/O to the 
 missing device stopped (as I expected) 2. FMA kicked in, as expected. 3. 
 /var/adm/messages recorded Command failed to complete... device gone. 4. 
 After exactly 9 minutes, 17,951 e-reports had been processed and the 
 diagnosis was complete. FMA logged the following to /var/adm/messages  Jul 
 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING:  /[EMAIL 
 PROTECTED],0/pci1458,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 (sd1): Jul 30 10:33:44 grond Command failed to 
 complete...Device is gone Jul 30 10:42:31 grond fmd: [ID 441519 
 daemon.error] SUNW-MSG-ID:  ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: 
 Major Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008 Jul 30 
 10:42:31 grond PLATFORM: , CSN: , HOSTNAME: grond Jul 30 10:42:31 grond 
 SOURCE: zfs-diagnosis, REV: 1.0 Jul 30 10:42:31 grond EVENT-ID: 
 d99769aa-28e8-cf16-d181-945592130525 Jul 30 10:42:31 grond DESC: The number 
 of I/O errors associated with a  ZFS device exceeded Jul 30 10:42:31 grond 
 acceptable levels. Refer to  http://sun.com/msg/ZFS-8000-FD for more 
 information. Jul 30 10:42:31 grond AUTO-RESPONSE: The device has been 
 offlined and  marked

Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-30 Thread Ross Smith

I agree that device drivers should perform the bulk of the fault monitoring, 
however I disagree that this absolves ZFS of any responsibility for checking 
for errors.  The primary goal of ZFS is to be a filesystem and maintain data 
integrity, and that entails both reading and writing data to the devices.  It 
is no good having checksumming when reading data if you are loosing huge 
amounts of data when a disk fails.
 
I'm not saying that ZFS should be monitoring disks and drivers to ensure they 
are working, just that if ZFS attempts to write data and doesn't get the 
response it's expecting, an error should be logged against the device 
regardless of what the driver says.  If ZFS is really about end-to-end data 
integrity, then you do need to consider the possibility of a faulty driver.  
Now I don't know what the root cause of this error is, but I suspect it will be 
either a bad response from the SATA driver, or something within ZFS that is not 
working correctly.  Either way however I believe ZFS should have caught this.
 
It's similar to the iSCSI problem I posted a few months back where the ZFS pool 
hangs for 3 minutes when a device is disconnected.  There's absolutely no need 
for the entire pool to hang when the other half of the mirror is working fine.  
ZFS is often compared to hardware raid controllers, but so far it's ability to 
handle problems is falling short.
 
Ross
 
 Date: Wed, 30 Jul 2008 09:48:34 -0500 From: [EMAIL PROTECTED] To: [EMAIL 
 PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] 
 Supermicro AOC-SAT2-MV8 hang when drive removed  On Wed, 30 Jul 2008, Ross 
 wrote:   Imagine you had a raid-z array and pulled a drive as I'm doing 
 here.   Because ZFS isn't aware of the removal it keeps writing to that   
 drive as if it's valid. That means ZFS still believes the array is   online 
 when in fact it should be degrated. If any other drive now   fails, ZFS 
 will consider the status degrated instead of faulted, and   will continue 
 writing data. The problem is, ZFS is writing some of   that data to a drive 
 which doesn't exist, meaning all that data will   be lost on reboot.  
 While I do believe that device drivers. or the fault system, should  notify 
 ZFS when a device fails (and ZFS should appropriately react), I  don't think 
 that ZFS should be responsible for fault monitoring. ZFS  is in a rather 
 poor position for device fault monitoring, and if it  attempts to do so then 
 it will be slow and may misbehave in other  ways. The software which 
 communicates with the device (i.e. the  device driver) is in the best 
 position to monitor the device.  The primary goal of ZFS is to be able to 
 correctly read data which was  successfully committed to disk. There are 
 programming interfaces  (e.g. fsync(), msync()) which may be used to ensure 
 that data is  committed to disk, and which should return an error if there 
 is a  problem. If you were performing your tests over an NFS mount then the 
  results should be considerably different since NFS requests that its  data 
 be committed to disk.  Bob == Bob 
 Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ 
 GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ 
_
Find the best and worst places on the planet
http://clk.atdmt.com/UKM/go/101719807/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-29 Thread Ross Smith

A little more information today.  I had a feeling that ZFS would continue quite 
some time before giving an error, and today I've shown that you can carry on 
working with the filesystem for at least half an hour with the disk removed.
 
I suspect on a system with little load you could carry on working for several 
hours without any indication that there is a problem.  It looks to me like ZFS 
is caching reads  writes, and that provided requests can be fulfilled from the 
cache, it doesn't care whether the disk is present or not.
 
I would guess that ZFS is attempting to write to the disk in the background, 
and that this is silently failing.
 
Here's the log of the tests I did today.  After removing the drive, over a 
period of 30 minutes I copied folders to the filesystem, created an archive, 
set permissions, and checked properties.  I did this both in the command line 
and with the graphical file manager tool in Solaris.  Neither reported any 
errors, and all the data could be read  written fine.  Until the reboot, at 
which point all the data was lost, again without error.
 
If you're not interested in the detail, please skip to the end where I've got 
some thoughts on just how many problems there are here.
 
 
# zpool status test  pool: test state: ONLINE scrub: none requestedconfig:
NAMESTATE READ WRITE CKSUMtestONLINE   
0 0 0  c2t7d0ONLINE   0 0 0
errors: No known data errors# zfs list testNAME   USED  AVAIL  REFER  
MOUNTPOINTtest   243M   228G   242M  /test# zpool list testNAME   SIZE   USED  
AVAILCAP  HEALTH  ALTROOTtest   232G   243M   232G 0%  ONLINE  -
 
-- drive removed --
 
# cfgadm |grep sata1/7sata1/7sata-portempty
unconfigured ok
 
 
-- cfgadmin knows the drive is removed.  How come ZFS does not? --
 
# cp -r /rc-pool/copytest /test/copytest# zpool list testNAME  SIZE   USED  
AVAILCAP  HEALTH  ALTROOTtest  232G  73.4M   232G 0%  ONLINE  -# 
zfs list testNAME   USED  AVAIL  REFER  MOUNTPOINTtest   142K   228G18K  
/test
 
 
-- Yup, still up.  Let's start the clock --
 
# dateTue Jul 29 09:31:33 BST 2008# du -hs /test/copytest 667K /test/copytest
 
 
-- 5 minutes later, still going strong --
 
# dateTue Jul 29 09:36:30 BST 2008# zpool list testNAME  SIZE   USED  AVAIL 
   CAP  HEALTH  ALTROOTtest  232G  73.4M   232G 0%  ONLINE  -# cp -r 
/rc-pool/copytest /test/copytest2# ls /testcopytest   copytest2# du -h -s /test 
1.3M /test# zpool list testNAME   SIZE   USED  AVAILCAP  HEALTH  
ALTROOTtest   232G  73.4M   232G 0%  ONLINE  -# find /test | wc -l  
   2669# find //test/copytest | wc -l1334# find 
/rc-pool/copytest | wc -l1334# du -h -s /rc-pool/copytest 5.3M 
/rc-pool/copytest
 
 
-- Not sure why the original pool has 5.3MB of data when I use du. --
-- File Manager reports that they both have the same size --
 
 
-- 15 minutes later it's still working.  I can read data fine --
# dateTue Jul 29 09:43:04 BST 2008# chmod 777 /test/*# mkdir /rc-pool/test2# cp 
-r /test/copytest2 /rc-pool/test2/copytest2# find /rc-pool/test2/copytest2 | wc 
-l1334# zpool list testNAME  SIZE   USED  AVAILCAP  HEALTH  
ALTROOTtest  232G  73.4M   232G 0%  ONLINE  -
 
 
-- and yup, the drive is still offline --
 
# cfgadm | grep sata1/7sata1/7sata-portempty
unconfigured ok
-- And finally, after 30 minutes the pool is still going strong --
 
# dateTue Jul 29 09:59:56 BST 2008
# tar -cf /test/copytest.tar /test/copytest/*# ls -ltotal 3drwxrwxrwx   3 root  
   root   3 Jul 29 09:30 copytest-rwxrwxrwx   1 root root 
4626432 Jul 29 09:59 copytest.tardrwxrwxrwx   3 root root   3 Jul 
29 09:39 copytest2# zpool list testNAME   SIZE   USED  AVAILCAP  HEALTH  
ALTROOTtest   232G  73.4M   232G 0%  ONLINE  -
 
After a full 30 minutes there's no indication whatsoever of any problem.  
Checking properties of the folder in File Browser reports 2665 items, totalling 
9.0MB.
 
At this point I tried # zfs set sharesmb=on test.  I didn't really expect it 
to work, and sure enough, that command hung.  zpool status also hung, so I had 
to reboot the server.
 
 
-- Rebooted server --
 
 
Now I found that not only are all the files I've written in the last 30 minutes 
missing, but in fact files that I had deleted several minutes prior to removing 
the drive have re-appeared.
 
 
-- /test mount point is still present, I'll probably have to remove that 
manually --
 
 
# cd /# lsbin export  media   procsystemboot
homemnt rc-pool testdev kernel  net 
rc-usb  tmpdevices lib opt rootusretc 
lost+found  platformsbinvar
 
 
-- ZFS still has the pool mounted, but at least now it realises it's not 
working --
 
 
# zpool listNAME  SIZE   

Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-28 Thread Ross Smith

File Browser is the name of the program that Solaris opens when you open 
Computer on the desktop.  It's the default graphical file manager.
 
It does eventually stop copying with an error, but it takes a good long while 
for ZFS to throw up that error, and even when it does, the pool doesn't report 
any problems at all.
 Date: Mon, 28 Jul 2008 13:03:24 -0500 From: [EMAIL PROTECTED] To: [EMAIL 
 PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] 
 Supermicro AOC-SAT2-MV8 hang when drive removed  On Mon, 28 Jul 2008, Ross 
 wrote:   TEST1: Opened File Browser, copied the test data to the pool.  
  Half way through the copy I pulled the drive. THE COPY COMPLETED   
 WITHOUT ERROR. Zpool list reports the pool as online, however zpool   
 status hung as expected.  Are you sure that this reference software you 
 call File Browser  actually responds to errors? Maybe it is typical 
 Linux-derived  software which does not check for or handle errors and ZFS is 
  reporting errors all along while the program pretends to copy the lost  
 files. If you were using Microsoft Windows, its file browser would  probably 
 report Unknown error: 666 but at least you would see an  error dialog and 
 you could visit the Microsoft knowledge base to learn  that message ID 666 
 means Unknown error. The other possibility is  that all of these files fit 
 in the ZFS write cache so the error  reporting is delayed.  The Dtrace 
 Toolkit provides a very useful DTrace script called  'errinfo' which will 
 list every system call which reports and error.  This is very useful and 
 informative. If you run it, you will see  every error reported to the 
 application level.  Bob == Bob 
 Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ 
 GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ 
_
Invite your Facebook friends to chat on Messenger
http://clk.atdmt.com/UKM/go/101719649/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-28 Thread Ross Smith

snv_91.  I downloaded snv_94 today so I'll be testing with that tomorrow.
 Date: Mon, 28 Jul 2008 09:58:43 -0700 From: [EMAIL PROTECTED] Subject: Re: 
 [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed To: [EMAIL 
 PROTECTED]  Which OS and revision? -- richard   Ross wrote:  Ok, 
 after doing a lot more testing of this I've found it's not the Supermicro 
 controller causing problems. It's purely ZFS, and it causes some major 
 problems! I've even found one scenario that appears to cause huge data loss 
 without any warning from ZFS - up to 30,000 files and 100MB of data missing 
 after a reboot, with zfs reporting that the pool is OK.   
 ***  1. 
 Solaris handles USB and SATA hot plug fine   If disks are not in use by 
 ZFS, you can unplug USB or SATA devices, cfgadm will recognise the 
 disconnection. USB devices are recognised automatically as you reconnect 
 them, SATA devices need reconfiguring. Cfgadm even recognises the SATA device 
 as an empty bay:   # cfgadm  Ap_Id Type Receptacle Occupant Condition 
  sata1/7 sata-port empty unconfigured ok  usb1/3 unknown empty 
 unconfigured ok   -- insert devices --   # cfgadm  Ap_Id Type 
 Receptacle Occupant Condition  sata1/7 disk connected unconfigured unknown 
  usb1/3 usb-storage connected configured ok   To bring the sata drive 
 online it's just a case of running  # cfgadm -c configure sata1/7
 ***  2. 
 If ZFS is using a hot plug device, disconnecting it will hang all ZFS status 
 tools.   While pools remain accessible, any attempt to run zpool status 
 will hang. I don't know if there is any way to recover these tools once this 
 happens. While this is a pretty big problem in itself, it also makes me worry 
 if other types of error could have the same effect. I see potential for this 
 leaving a server in a state whereby you know there are errors in a pool, but 
 have no way of finding out what those errors might be without rebooting the 
 server.   
 ***  3. 
 Once ZFS status tools are hung the computer will not shut down.   The 
 only way I've found to recover from this is to physically power down the 
 server. The solaris shutdown process simply hangs.   
 ***  4. 
 While reading an offline disk causes errors, writing does not!   *** CAUSES 
 DATA LOSS ***   This is a big one: ZFS can continue writing to an 
 unavailable pool. It doesn't always generate errors (I've seen it copy over 
 100MB before erroring), and if not spotted, this *will* cause data loss after 
 you reboot.   I discovered this while testing how ZFS coped with the 
 removal of a hot plug SATA drive. I knew that the ZFS admin tools were 
 hanging, but that redundant pools remained available. I wanted to see whether 
 it was just the ZFS admin tools that were failing, or whether ZFS was also 
 failing to send appropriate error messages back to the OS.   These are 
 the tests I carried out:   Zpool: Single drive zpool, consisting of one 
 250GB SATA drive in a hot plug bay.  Test data: A folder tree containing 
 19,160 items. 71.1MB in total.   TEST1: Opened File Browser, copied the 
 test data to the pool. Half way through the copy I pulled the drive. THE COPY 
 COMPLETED WITHOUT ERROR. Zpool list reports the pool as online, however zpool 
 status hung as expected.   Not quite believing the results, I rebooted 
 and tried again.   TEST2: Opened File Browser, copied the data to the 
 pool. Pulled the drive half way through. The copy again finished without 
 error. Checking the properties shows 19,160 files in the copy. ZFS list again 
 shows the filesystem as ONLINE.   Now I decided to see how many files I 
 could copy before it errored. I started the copy again. File Browser managed 
 a further 9,171 files before it stopped. That's nearly 30,000 files before 
 any error was detected. Again, despite the copy having finally errored, zpool 
 list shows the pool as online, even though zpool status hangs.   I 
 rebooted the server, and found that after the reboot my first copy contains 
 just 10,952 items, and my second copy is completely missing. That's a loss of 
 almost 20,000 files. Zpool status however reports NO ERRORS.   For the 
 third test I decided to see if these files are actually accessible before the 
 reboot:   TEST3: This time I pulled the drive *before* starting the copy. 
 The copy started much slower this time and only got to 2,939 files before 
 reporting an error. At this point I copied all the files that had been copied 
 to another pool, and then rebooted.   After the reboot, the folder in the 
 test pool had disappeared completely, but the copy I took before rebooting 
 was fine and contains 2,938 items, approximately 12MB of data. Again, zpool 
 status reports no errors.   

[zfs-discuss] FW: please help with raid / failure / rebuild calculations

2008-07-15 Thread Ross Smith



bits vs bytes D'oh! again.  It's a good job I don't do these calculations 
professionally. :-) Date: Tue, 15 Jul 2008 02:30:33 -0400 From: [EMAIL 
PROTECTED] To: [EMAIL PROTECTED] Subject: Re: [zfs-discuss] please help with 
raid / failure / rebuild calculations CC: zfs-discuss@opensolaris.org  On 
Tue, Jul 15, 2008 at 01:58, Ross [EMAIL PROTECTED] wrote:  However, I'm not 
sure where the 8 is coming from in your calculations. Bits per byte ;)   In 
this case approximately 13/100 or around 1 in 8 odds. Taking into account the 
factor 8, and it's around 8 in 8.  Another possible factor to consider in 
calculations of this nature is that you probably won't get a single bit 
flipped here or there. If drives take 512-byte sectors and apply Hamming codes 
to those 512 bytes to get, say, 548 bytes of coded data that are actually 
written to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you 
cannot correct them from the data you have. Thus, rather than getting one 
incorrect bit in a particular 4096-bit sector, you're likely to get all good 
sectors and one that's complete garbage. Unless the manufacturers' 
specifications account for this, I would say the sector error rate of the 
drive is about 1 in 4*(10**17). I have no idea whether they account for this 
or not, but it'd be interesting (and fairly doable) to test. Write a 1TB disk 
full of known data, then read it and verify. Then repeat until you have seen 
incorrect sectors a few times for a decent sample size, and store elsewhere 
what the sector was supposed to be and what it actually was.  Will

Get Hotmail on your Mobile! Try it Now! 
_
The John Lewis Clearance - save up to 50% with FREE delivery
http://clk.atdmt.com/UKM/go/101719806/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] J4500 device renumbering

2008-07-15 Thread Ross Smith

It sounds like you might be interested to read up on Eric Schrock's work.  I 
read today about some of the stuff he's been doing to bring integrated fault 
management to Solaris:
http://blogs.sun.com/eschrock/entry/external_storage_enclosures_in_solaris
His last paragraph is great to see, Sun really do seem to be headed in the 
right direction:
 
I often like to joke about the amount of time that I have spent just getting a 
single LED to light. At first glance, it seems like a pretty simple task. But 
to do it in a generic fashion that can be generalized across a wide variety of 
platforms, correlated with physically meaningful labels, and incorporate a 
diverse set of diagnoses (ZFS, SCSI, HBA, etc) requires an awful lot of work. 
Once it's all said and done, however, future platforms will require little to 
no integration work, and you'll be able to see a bad drive generate checksum 
errors in ZFS, resulting in a FMA diagnosis indicating the faulty drive, 
activate a hot spare, and light the fault LED on the drive bay (wherever it may 
be). Only then will we have accomplished our goal of an end-to-end storage 
strategy for Solaris - and hopefully someone besides me will know what it has 
taken to get that little LED to light.
 
Ross
 
 Date: Tue, 15 Jul 2008 12:51:22 -0500 From: [EMAIL PROTECTED] To: [EMAIL 
 PROTECTED] CC: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] J4500 
 device renumbering  On Tue, 15 Jul 2008, Ross wrote:   Well I haven't 
 used a J4500, but when we had an x4500 (Thumper) on   loan they had Solaris 
 pretty well integrated with the hardware.   When a disk failed, I used 
 cfgadm to offline it and as soon as I did   that a bright blue Ready to 
 Remove LED lit up on the drive tray of   the faulty disk, right next to 
 the handle you need to lift to remove   the drive.  That sure sounds a 
 whole lot easier to manage than my setup with a  StorageTek 2540 and each 
 drive as a LUN. The 2540 could detect a  failed drive by itself and turn an 
 LED on, but if ZFS decides that a  drive has failed and the 2540 does not, 
 then I will have to use the  2540's CAM administrative interface and 
 manually set the drive out of  service. I very much doubt that cfgadm will 
 communicate with the 2540  and tell it to do anything.  A little while 
 back I created this table so I could understand how  things were mapped:  
 Disk Volume LUN WWN Device ZFS == === === 
 === 
 = === t85d01 Disk-01 0 
 60:0A:0B:80:00:3A:8A:0B:00:00:09:61:47:B4:51:BE 
 c4t600A0B80003A8A0B096147B451BEd0 P3-A t85d02 Disk-02 1 
 60:0A:0B:80:00:39:C9:B5:00:00:0A:9C:47:B4:52:2D 
 c4t600A0B800039C9B50A9C47B4522Dd0 P6-A t85d03 Disk-03 2 
 60:0A:0B:80:00:39:C9:B5:00:00:0A:A0:47:B4:52:9B 
 c4t600A0B800039C9B50AA047B4529Bd0 P1-B t85d04 Disk-04 3 
 60:0A:0B:80:00:3A:8A:0B:00:00:09:66:47:B4:53:CE 
 c4t600A0B80003A8A0B096647B453CEd0 P4-A t85d05 Disk-05 4 
 60:0A:0B:80:00:39:C9:B5:00:00:0A:A4:47:B4:54:4F 
 c4t600A0B800039C9B50AA447B4544Fd0 P2-B t85d06 Disk-06 5 
 60:0A:0B:80:00:3A:8A:0B:00:00:09:6A:47:B4:55:9E 
 c4t600A0B80003A8A0B096A47B4559Ed0 P1-A t85d07 Disk-07 6 
 60:0A:0B:80:00:39:C9:B5:00:00:0A:A8:47:B4:56:05 
 c4t600A0B800039C9B50AA847B45605d0 P3-B t85d08 Disk-08 7 
 60:0A:0B:80:00:3A:8A:0B:00:00:09:6E:47:B4:56:DA 
 c4t600A0B80003A8A0B096E47B456DAd0 P2-A t85d09 Disk-09 8 
 60:0A:0B:80:00:39:C9:B5:00:00:0A:AC:47:B4:57:39 
 c4t600A0B800039C9B50AAC47B45739d0 P4-B t85d10 Disk-10 9 
 60:0A:0B:80:00:39:C9:B5:00:00:0A:B0:47:B4:57:AD 
 c4t600A0B800039C9B50AB047B457ADd0 P5-B t85d11 Disk-11 10 
 60:0A:0B:80:00:3A:8A:0B:00:00:09:73:47:B4:57:D4 
 c4t600A0B80003A8A0B097347B457D4d0 P5-A t85d12 Disk-12 11 
 60:0A:0B:80:00:39:C9:B5:00:00:0A:B4:47:B4:59:5F 
 c4t600A0B800039C9B50AB447B4595Fd0 P6-B  When I selected the drive 
 pairings, it was based on a dump from a  multipath utility and it seems that 
 on a chassis level there is no  rhyme or reason for the zfs mirror 
 pairings.  This is an area where traditional RAID hardware makes ZFS more  
 difficult to use.  Bob
_
Find the best and worst places on the planet
http://clk.atdmt.com/UKM/go/101719807/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss