Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Ross Smith
Ok, thanks everyone then (but still thanks to Victor for the heads up)  :-)


On Mon, Nov 2, 2009 at 4:03 PM, Victor Latushkin
 wrote:
> On 02.11.09 18:38, Ross wrote:
>>
>> Double WOHOO!  Thanks Victor!
>
> Thanks should go to Tim Haley, Jeff Bonwick and George Wilson ;-)
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tunable iSCSI timeouts - ZFS over iSCSI fix

2009-07-29 Thread Ross Smith
Yup, somebody pointed that out to me last week and I can't wait :-)


On Wed, Jul 29, 2009 at 7:48 PM, Dave wrote:
> Anyone (Ross?) creating ZFS pools over iSCSI connections will want to pay
> attention to snv_121 which fixes the 3 minute hang after iSCSI disk
> problems:
>
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=649
>
> Yay!
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-14 Thread Ross Smith
Hey guys,

I'll let this die in a sec, but I just wanted to say that I've gone
and read the on disk document again this morning, and to be honest
Richard, without the description you just wrote, I really wouldn't
have known that uberblocks are in a 128 entry circular queue that's 4x
redundant.

Please understand that I'm not asking for answers to these notes, this
post is purely to illustrate to you ZFS guys that much as I appreciate
having the ZFS docs available, they are very tough going for anybody
who isn't a ZFS developer.  I consider myself well above average in IT
ability, and I've really spent quite a lot of time in the past year
reading around ZFS, but even so I would definitely have come to the
wrong conclusion regarding uberblocks.

Richard's post I can understand really easily, but in the on disk
format docs, that information is spread over 7 pages of really quite
technical detail, and to be honest, for a user like myself raises as
many questions as it answers:

On page 6 I learn that labels are stored on each vdev, as well as each
disk.  So there will be a label on the pool, mirror (or raid group),
and disk.  I know the disk ones are at the start and end of the disk,
and it sounds like the mirror vdev is in the same place, but where is
the root vdev label?  The example given doesn't mention its location
at all.

Then, on page 7 it sounds like the entire label is overwriten whenever
on-disk data is updated - "any time on-disk data is overwritten, there
is potential for error".  To me, it sounds like it's not a 128 entry
queue, but just a group of 4 labels, all of which are overwritten as
data goes to disk.

Then finally, on page 12 the uberblock is mentioned (although as an
aside, the first time I read these docs I had no idea what the
uberblock actually was).  It does say that only one uberblock is
active at a time, but with it being part of the label I'd just assume
these were overwritten as a group..

And that's why I'll often throw ideas out - I can either rely on my
own limited knowledge of ZFS to say if it will work, or I can take
advantage of the excellent community we have here, and post the idea
for all to see.  It's a quick way for good ideas to be improved upon,
and bad ideas consigned to the bin.  I've done it before in my rather
lengthly 'zfs availability' thread.  My thoughts there were thrashed
out nicely, with some quite superb additions (namely the concept of
lop sided mirrors which I think are a great idea).

Ross

PS.  I've also found why I thought you had to search for these blocks,
it was after reading this thread where somebody used mdb to search a
corrupt pool to try to recover data:
http://opensolaris.org/jive/message.jspa?messageID=318009







On Fri, Feb 13, 2009 at 11:09 PM, Richard Elling
 wrote:
> Tim wrote:
>>
>>
>> On Fri, Feb 13, 2009 at 4:21 PM, Bob Friesenhahn
>> mailto:bfrie...@simple.dallas.tx.us>> wrote:
>>
>>On Fri, 13 Feb 2009, Ross Smith wrote:
>>
>>However, I've just had another idea.  Since the uberblocks are
>>pretty
>>vital in recovering a pool, and I believe it's a fair bit of
>>work to
>>search the disk to find them.  Might it be a good idea to
>>allow ZFS to
>>store uberblock locations elsewhere for recovery purposes?
>>
>>
>>Perhaps it is best to leave decisions on these issues to the ZFS
>>designers who know how things work.
>>
>>Previous descriptions from people who do know how things work
>>didn't make it sound very difficult to find the last 20
>>uberblocks.  It sounded like they were at known points for any
>>given pool.
>>
>>Those folks have surely tired of this discussion by now and are
>>working on actual code rather than reading idle discussion between
>>several people who don't know the details of how things work.
>>
>>
>>
>> People who "don't know how things work" often aren't tied down by the
>> baggage of knowing how things work.  Which leads to creative solutions those
>> who are weighed down didn't think of.  I don't think it hurts in the least
>> to throw out some ideas.  If they aren't valid, it's not hard to ignore them
>> and move on.  It surely isn't a waste of anyone's time to spend 5 minutes
>> reading a response and weighing if the idea is valid or not.
>
> OTOH, anyone who followed this discussion the last few times, has looked
> at the on-disk format documents, or reviewed the source code would know
> that the uberblocks are kept in an 128-entry circular queue which is 4x
> redundant with 2 copies each at the beginning an

Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
You don't, but that's why I was wondering about time limits.  You have
to have a cut off somewhere, but if you're checking the last few
minutes of uberblocks that really should cope with a lot.  It seems
like a simple enough thing to implement, and if a pool still gets
corrupted with these checks in place, you can absolutely, positively
blame it on the hardware.  :D

However, I've just had another idea.  Since the uberblocks are pretty
vital in recovering a pool, and I believe it's a fair bit of work to
search the disk to find them.  Might it be a good idea to allow ZFS to
store uberblock locations elsewhere for recovery purposes?

This could be as simple as a USB stick plugged into the server, a
separate drive, or a network server.  I guess even the ZIL device
would work if it's separate hardware.  But knowing the locations of
the uberblocks would save yet more time should recovery be needed.



On Fri, Feb 13, 2009 at 8:59 PM, Bob Friesenhahn
 wrote:
> On Fri, 13 Feb 2009, Ross Smith wrote:
>
>> Thinking about this a bit more, you've given me an idea:  Would it be
>> worth ZFS occasionally reading previous uberblocks from the pool, just
>> to check they are there and working ok?
>
> That sounds like a good idea.  However, how do you know for sure that the
> data returned is not returned from a volatile cache?  If the hardware is
> ignoring cache flush requests, then any data returned may be from a volatile
> cache.
>
> Bob
> ==
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn
 wrote:
> On Fri, 13 Feb 2009, Ross Smith wrote:
>>
>> You have to consider that even with improperly working hardware, ZFS
>> has been checksumming data, so if that hardware has been working for
>> any length of time, you *know* that the data on it is good.
>
> You only know this if the data has previously been read.
>
> Assume that the device temporarily stops pysically writing, but otherwise
> responds normally to ZFS.  Then the device starts writing again (including a
> recent uberblock), but with a large gap in the writes.  Then the system
> loses power, or crashes.  What happens then?

Hey Bob,

Thinking about this a bit more, you've given me an idea:  Would it be
worth ZFS occasionally reading previous uberblocks from the pool, just
to check they are there and working ok?

I wonder if you could do this after a few uberblocks have been
written.  It would seem to be a good way of catching devices that
aren't writing correctly early on, as well as a way of guaranteeing
that previous uberblocks are available to roll back to should a write
go wrong.

I wonder what the upper limits for this kind of write failure is going
to be.  I've seen 30 second delays mentioned in this thread.  How
often are uberblocks written?  Is there any guarantee that we'll
always have more than 30 seconds worth of uberblocks on a drive?
Should ZFS be set so that it keeps either a given number of
uberblocks, or 5 minutes worth of uberblocks, whichever is the larger?

Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn
 wrote:
> On Fri, 13 Feb 2009, Ross Smith wrote:
>>
>> You have to consider that even with improperly working hardware, ZFS
>> has been checksumming data, so if that hardware has been working for
>> any length of time, you *know* that the data on it is good.
>
> You only know this if the data has previously been read.
>
> Assume that the device temporarily stops pysically writing, but otherwise
> responds normally to ZFS.  Then the device starts writing again (including a
> recent uberblock), but with a large gap in the writes.  Then the system
> loses power, or crashes.  What happens then?

Well in that case you're screwed, but if ZFS is known to handle even
corrupted pools automatically, when that happens the immediate
response on the forums is going to be "something really bad has
happened to your hardware", followed by troubleshooting to find out
what.  Instead of the response now, where we all know there's every
chance the data is ok, and just can't be gotten to without zdb.

Also, that's a pretty extreme situation since you'd need a device that
is being written to but not read from to fail in this exact way.  It
also needs to have no scrubbing being run, so the problem has remained
undetected.

However, even in that situation, if we assume that it happened and
that these recovery tools are available, ZFS will either report that
your pool is seriously corrupted, indicating a major hardware problem
(and ZFS can now state this with some confidence), or ZFS will be able
to open a previous uberblock, mount your pool and begin a scrub, at
which point all your missing writes will be found too and reported.

And then you can go back to your snapshots.  :-D


>
> Bob
> ==
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread Ross Smith
On Fri, Feb 13, 2009 at 7:41 PM, Bob Friesenhahn
 wrote:
> On Fri, 13 Feb 2009, Ross wrote:
>>
>> Something like that will have people praising ZFS' ability to safeguard
>> their data, and the way it recovers even after system crashes or when
>> hardware has gone wrong.  You could even have a "common causes of this
>> are..." message, or a link to an online help article if you wanted people to
>> be really impressed.
>
> I see a career in politics for you.  Barring an operating system
> implementation bug, the type of problem you are talking about is due to
> improperly working hardware.  Irreversibly reverting to a previous
> checkpoint may or may not obtain the correct data.  Perhaps it will produce
> a bunch of checksum errors.

Yes, the root cause is improperly working hardware (or an OS bug like
6424510), but with ZFS being a copy on write system, when errors occur
with a recent write, for the vast majority of the pools out there you
still have huge amounts of data that is still perfectly valid and
should be accessible.  Unless I'm misunderstanding something,
reverting to a previous checkpoint gets you back to a state where ZFS
knows it's good (or at least where ZFS can verify whether it's good or
not).

You have to consider that even with improperly working hardware, ZFS
has been checksumming data, so if that hardware has been working for
any length of time, you *know* that the data on it is good.

Yes, if you have databases or files there that were mid-write, they
will almost certainly be corrupted.  But at least your filesystem is
back, and it's in as good a state as it's going to be given that in
order for your pool to be in this position, your hardware went wrong
mid-write.

And as an added bonus, if you're using ZFS snapshots, now your pool is
accessible, you have a bunch of backups available so you can probably
roll corrupted files back to working versions.

For me, that is about as good as you can get in terms of handling a
sudden hardware failure.  Everything that is known to be saved to disk
is there, you can verify (with absolute certainty) whether data is ok
or not, and you have backup copies of damaged files.  In the old days
you'd need to be reverting to tape backups for both of these, with
potentially hours of downtime before you even know where you are.
Achieving that in a few seconds (or minutes) is a massive step
forwards.

> There are already people praising ZFS' ability to safeguard their data, and
> the way it recovers even after system crashes or when hardware has gone
> wrong.

Yes there are, but the majority of these are praising the ability of
ZFS checksums to detect bad data, and to repair it when you have
redundancy in your pool.  I've not seen that many cases of people
praising ZFS' recovery ability - uberblock problems seem to have a
nasty habit of leaving you with tons of good, checksummed data on a
pool that you can't get to, and while many hardware problems are dealt
with, others can hang your entire pool.


>
> Bob
> ==
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-12 Thread Ross Smith
That would be the ideal, but really I'd settle for just improved error
handling and recovery for now.  In the longer term, disabling write
caching by default for USB or Firewire drives might be nice.


On Thu, Feb 12, 2009 at 8:35 PM, Gary Mills  wrote:
> On Thu, Feb 12, 2009 at 11:53:40AM -0500, Greg Palmer wrote:
>> Ross wrote:
>> >I can also state with confidence that very, very few of the 100 staff
>> >working here will even be aware that it's possible to unmount a USB volume
>> >in windows.  They will all just pull the plug when their work is saved,
>> >and since they all come to me when they have problems, I think I can
>> >safely say that pulling USB devices really doesn't tend to corrupt
>> >filesystems in Windows.  Everybody I know just waits for the light on the
>> >device to go out.
>> >
>> The key here is that Windows does not cache writes to the USB drive
>> unless you go in and specifically enable them. It caches reads but not
>> writes. If you enable them you will lose data if you pull the stick out
>> before all the data is written. This is the type of safety measure that
>> needs to be implemented in ZFS if it is to support the average user
>> instead of just the IT professionals.
>
> That implies that ZFS will have to detect removable devices and treat
> them differently than fixed devices.  It might have to be an option
> that can be enabled for higher performance with reduced data security.
>
> --
> -Gary Mills--Unix Support--U of M Academic Computing and Networking-
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

2009-02-12 Thread Ross Smith
Heh, yeah, I've thought the same kind of thing in the past.  The
problem is that the argument doesn't really work for system admins.

As far as I'm concerned, the 7000 series is a new hardware platform,
with relatively untested drivers, running a software solution that I
know is prone to locking up when hardware faults are handled badly by
drivers.  Fair enough, that actual solution is out of our price range,
but I would still be very dubious about purchasing it.  At the very
least I'd be waiting a year for other people to work the kinks out of
the drivers.

Which is a shame, because ZFS has so many other great features it's
easily our first choice for a storage platform.  The one and only
concern we have is its reliability.  We have snv_106 running as a test
platform now.  If I felt I could trust ZFS 100% I'd roll it out
tomorrow.



On Thu, Feb 12, 2009 at 4:25 PM, Tim  wrote:
>
>
> On Thu, Feb 12, 2009 at 9:25 AM, Ross  wrote:
>>
>> This sounds like exactly the kind of problem I've been shouting about for
>> 6 months or more.  I posted a huge thread on availability on these forums
>> because I had concerns over exactly this kind of hanging.
>>
>> ZFS doesn't trust hardware or drivers when it comes to your data -
>> everything is checksummed.  However, when it comes to seeing whether devices
>> are responding, and checking for faults, it blindly trusts whatever the
>> hardware or driver tells it.  Unfortunately, that means ZFS is vulnerable to
>> any unexpected bug or error in the storage chain.  I've encountered at least
>> two hang conditions myself (and I'm not exactly a heavy user), and I've seen
>> several others on the forums, including a few on x4500's.
>>
>> Now, I do accept that errors like this will be few and far between, but
>> they still means you have the risk that a badly handled error condition can
>> hang your entire server, instead of just one drive.  Solaris can handle
>> things like CPU's or Memory going faulty for crying out loud.  Its raid
>> storage system had better be able to handle a disk failing.
>>
>> Sun seem to be taking the approach that these errors should be dealt with
>> in the driver layer.  And while that's technically correct, a reliable
>> storage system had damn well better be able to keep the server limping along
>> while we wait for patches to the storage drivers.
>>
>> ZFS absolutely needs an error handling layer between the volume manager
>> and the devices.  It needs to timeout items that are not responding, and it
>> needs to drop bad devices if they could cause problems elsewhere.
>>
>> And yes, I'm repeating myself, but I can't understand why this is not
>> being acted on.  Right now the error checking appears to be such that if an
>> unexpected, or badly handled error condition occurs in the driver stack, the
>> pool or server hangs.  Whereas the expected behavior would be for just one
>> drive to fail.  The absolute worst case scenario should be that an entire
>> controller has to be taken offline (and I would hope that the controllers in
>> an x4500 would be running separate instances of the driver software).
>>
>> None one of those conditions should be fatal, good storage designs cope
>> with them all, and good error handling at the ZFS layer is absolutely vital
>> when you have projects like Comstar introducing more and more types of
>> storage device for ZFS to work with.
>>
>> Each extra type of storage introduces yet more software into the equation,
>> and increases the risk of finding faults like this.  While they will be
>> rare, they should be expected, and ZFS should be designed to handle them.
>
>
> I'd imagine for the exact same reason short-stroking/right-sizing isn't a
> concern.
>
> "We don't have this problem in the 7000 series, perhaps you should buy one
> of those".
>
> ;)
>
> --Tim
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss bug - sidelined??

2009-02-06 Thread Ross Smith
Something to do with cache was my first thought.  It seems to be able
to read and write from the cache quite happily for some time,
regardless of whether the pool is live.

If you're reading or writing large amounts of data, zfs starts
experiencing IO faults and offlines the pool pretty quickly.  If
you're just working with small datasets, or viewing files that you've
recently opened, it seems you can stretch it out for quite a while.

But yes, it seems that it doesn't enter failmode until the cache is
full.  I would expect it to hit this within 5 seconds (since I believe
that is how often the cache should be writing).


On Fri, Feb 6, 2009 at 7:04 PM, Brent Jones  wrote:
> On Fri, Feb 6, 2009 at 10:50 AM, Ross Smith  wrote:
>> I can check on Monday, but the system will probably panic... which
>> doesn't really help :-)
>>
>> Am I right in thinking failmode=wait is still the default?  If so,
>> that should be how it's set as this testing was done on a clean
>> install of snv_106.  From what I've seen, I don't think this is a
>> problem with the zfs failmode.  It's more of an issue of what happens
>> in the period *before* zfs realises there's a problem and applies the
>> failmode.
>>
>> This time there was just a window of a couple of minutes while
>> commands would continue.  In the past I've managed to stretch it out
>> to hours.
>>
>> To me the biggest problems are:
>> - ZFS accepting writes that don't happen (from both before and after
>> the drive is removed)
>> - No logging or warning of this in zpool status
>>
>> I appreciate that if you're using cache, some data loss is pretty much
>> inevitable when a pool fails, but that should be a few seconds worth
>> of data at worst, not minutes or hours worth.
>>
>> Also, if a pool fails completely and there's data in the cache that
>> hasn't been committed to disk, it would be great if Solaris could
>> respond by:
>>
>> - immediately dumping the cache to any (all?) working storage
>> - prompting the user to fix the pool, or save the cache before
>> powering down the system
>>
>> Ross
>>
>>
>> On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling  
>> wrote:
>>> Ross, this is a pretty good description of what I would expect when
>>> failmode=continue. What happens when failmode=panic?
>>> -- richard
>>>
>>>
>>> Ross wrote:
>>>>
>>>> Ok, it's still happening in snv_106:
>>>>
>>>> I plugged a USB drive into a freshly installed system, and created a
>>>> single disk zpool on it:
>>>> # zpool create usbtest c1t0d0
>>>>
>>>> I opened the (nautilus?) file manager in gnome, and copied the /etc/X11
>>>> folder to it.  I then copied the /etc/apache folder to it, and at 4:05pm,
>>>> disconnected the drive.
>>>>
>>>> At this point there are *no* warnings on screen, or any indication that
>>>> there is a problem.  To check that the pool was still working, I created
>>>> duplicates of the two folders on that drive.  That worked without any
>>>> errors, although the drive was physically removed.
>>>>
>>>> 4:07pm
>>>> I ran zpool status, the pool is actually showing as unavailable, so at
>>>> least that has happened faster than my last test.
>>>>
>>>> The folder is still open in gnome, however any attempt to copy files to or
>>>> from it just hangs the file transfer operation window.
>>>>
>>>> 4:09pm
>>>> /usbtest is still visible in gnome
>>>> Also, I can still open a console and use the folder:
>>>>
>>>> # cd usbtest
>>>> # ls
>>>> X11X11 (copy) apache apache (copy)
>>>>
>>>> I also tried:
>>>> # mv X11 X11-test
>>>>
>>>> That hung, but I saw the X11 folder disappear from the graphical file
>>>> manager, so the system still believes something is working with this pool.
>>>>
>>>> The main GUI is actually a little messed up now.  The gnome file manager
>>>> window looking at the /usbtest folder has hung.  Also, right-clicking the
>>>> desktop to open a new terminal hangs, leaving the right-click menu on
>>>> screen.
>>>>
>>>> The main menu still works though, and I can still open a new terminal.
>>>>
>>>> 4:19pm
>>>> Commands such as ls are finally hanging on the pool

Re: [zfs-discuss] Data loss bug - sidelined??

2009-02-06 Thread Ross Smith
I can check on Monday, but the system will probably panic... which
doesn't really help :-)

Am I right in thinking failmode=wait is still the default?  If so,
that should be how it's set as this testing was done on a clean
install of snv_106.  From what I've seen, I don't think this is a
problem with the zfs failmode.  It's more of an issue of what happens
in the period *before* zfs realises there's a problem and applies the
failmode.

This time there was just a window of a couple of minutes while
commands would continue.  In the past I've managed to stretch it out
to hours.

To me the biggest problems are:
- ZFS accepting writes that don't happen (from both before and after
the drive is removed)
- No logging or warning of this in zpool status

I appreciate that if you're using cache, some data loss is pretty much
inevitable when a pool fails, but that should be a few seconds worth
of data at worst, not minutes or hours worth.

Also, if a pool fails completely and there's data in the cache that
hasn't been committed to disk, it would be great if Solaris could
respond by:

- immediately dumping the cache to any (all?) working storage
- prompting the user to fix the pool, or save the cache before
powering down the system

Ross


On Fri, Feb 6, 2009 at 5:49 PM, Richard Elling  wrote:
> Ross, this is a pretty good description of what I would expect when
> failmode=continue. What happens when failmode=panic?
> -- richard
>
>
> Ross wrote:
>>
>> Ok, it's still happening in snv_106:
>>
>> I plugged a USB drive into a freshly installed system, and created a
>> single disk zpool on it:
>> # zpool create usbtest c1t0d0
>>
>> I opened the (nautilus?) file manager in gnome, and copied the /etc/X11
>> folder to it.  I then copied the /etc/apache folder to it, and at 4:05pm,
>> disconnected the drive.
>>
>> At this point there are *no* warnings on screen, or any indication that
>> there is a problem.  To check that the pool was still working, I created
>> duplicates of the two folders on that drive.  That worked without any
>> errors, although the drive was physically removed.
>>
>> 4:07pm
>> I ran zpool status, the pool is actually showing as unavailable, so at
>> least that has happened faster than my last test.
>>
>> The folder is still open in gnome, however any attempt to copy files to or
>> from it just hangs the file transfer operation window.
>>
>> 4:09pm
>> /usbtest is still visible in gnome
>> Also, I can still open a console and use the folder:
>>
>> # cd usbtest
>> # ls
>> X11X11 (copy) apache apache (copy)
>>
>> I also tried:
>> # mv X11 X11-test
>>
>> That hung, but I saw the X11 folder disappear from the graphical file
>> manager, so the system still believes something is working with this pool.
>>
>> The main GUI is actually a little messed up now.  The gnome file manager
>> window looking at the /usbtest folder has hung.  Also, right-clicking the
>> desktop to open a new terminal hangs, leaving the right-click menu on
>> screen.
>>
>> The main menu still works though, and I can still open a new terminal.
>>
>> 4:19pm
>> Commands such as ls are finally hanging on the pool.
>>
>> At this point I tried to reboot, but it appears that isn't working.  I
>> used system monitor to kill everything I had running and tried again, but
>> that didn't help.
>>
>> I had to physically power off the system to reboot.
>>
>> After the reboot, as expected, /usbtest still exists (even though the
>> drive is disconnected).  I removed that folder and connected the drive.
>>
>> ZFS detects the insertion and automounts the drive, but I find that
>> although the pool is showing as online, and the filesystem shows as mounted
>> at /usbtest.  But the /usbtest directory doesn't exist.
>>
>> I had to export and import the pool to get it available, but as expected,
>> I've lost data:
>> # cd usbtest
>> # ls
>> X11
>>
>> even worse, zfs is completely unaware of this:
>> # zpool status -v usbtest
>>  pool: usbtest
>>  state: ONLINE
>>  scrub: none requested
>> config:
>>
>>NAMESTATE READ WRITE CKSUM
>>usbtest ONLINE   0 0 0
>>  c1t0d0ONLINE   0 0 0
>>
>> errors: No known data errors
>>
>>
>> So in summary, there are a good few problems here, many of which I've
>> already reported as bugs:
>>
>> 1. ZFS still accepts read and write operations for a faulted pool, causing
>> data loss that isn't necessarily reported by zpool status.
>> 2. Even after writes start to hang, it's still possible to continue
>> reading data from a faulted pool.
>> 3. A faulted pool causes unwanted side effects in the GUI, making the
>> system hard to use, and impossible to reboot.
>> 4. After a hard reset, ZFS does not recover cleanly.  Unused mountpoints
>> are left behind.
>> 5. Automatic mounting of pools doesn't seem to work reliably.
>> 6. zfs status doesn't inform of any problems mounting the pool.
>>
>
>
___
zfs-discuss m

Re: [zfs-discuss] Any way to set casesensitivity=mixed on the main pool?

2009-02-04 Thread Ross Smith
It's not intuitive because when you know that -o sets options, an
error message saying that it's not a valid property makes you think
that it's not possible to do what you're trying.

Documented and intuitive are very different things.  I do appreciate
that the details are there in the manuals, but for items like this
where it's very easy to pick the wrong one, it helps if the commands
can work with you.

The difference between -o and -O is pretty subtle, I just think that
extra sentence in the error message could save a lot of frustration
when people get mixed up.

Ross



On Wed, Feb 4, 2009 at 11:14 AM, Darren J Moffat
 wrote:
> Ross wrote:
>>
>> Good god.  Talk about non intuitive.  Thanks Darren!
>
> Why isn't that intuitive ?  It is even documented in the man page.
>
> zpool create [-fn] [-o property=value] ... [-O file-system-
> property=value] ... [-m mountpoint] [-R root] pool vdev ...
>
>
>> Is it possible for me to suggest a quick change to the zpool error message
>> in solaris?  Should I file that as an RFE?  I'm just wondering if the error
>> message could be changed to something like:
>> "property 'casesensitivity' is not a valid pool property.  Did you mean to
>> use -O?"
>>
>> It's just a simple change, but it makes it obvious that it can be done,
>> instead of giving the impression that it's not possible.
>
> Feel free to log the RFE in defect.opensolaris.org.
>
> --
> Darren J Moffat
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD drives in Sun Fire X4540 or X4500 for dedicated ZIL device

2009-01-23 Thread Ross Smith
That's my understanding too.  One (STEC?) drive as a write cache,
basically a write optimised SSD.  And cheaper, larger, read optimised
SSD's for the read cache.

I thought it was an odd strategy until I read into SSD's a little more
and realised you really do have to think about your usage cases with
these.  SSD's are very definitely not all alike.


On Fri, Jan 23, 2009 at 4:33 PM, Greg Mason  wrote:
> If i'm not mistaken (and somebody please correct me if i'm wrong), the Sun
> 7000 series storage appliances (the Fishworks boxes) use enterprise SSDs,
> with dram caching. One such product is made by STEC.
>
> My understanding is that the Sun appliances use one SSD for the ZIL, and one
> as a read cache. For the 7210 (which is basically a Sun Fire X4540), that
> gives you 46 disks and 2 SSDs.
>
> -Greg
>
>
> Bob Friesenhahn wrote:
>>
>> On Thu, 22 Jan 2009, Ross wrote:
>>
>>> However, now I've written that, Sun use SATA (SAS?) SSD's in their high
>>> end fishworks storage, so I guess it definately works for some use cases.
>>
>> But the "fishworks" (Fishworks is a development team, not a product) write
>> cache device is not based on FLASH.  It is based on DRAM.  The difference is
>> like night and day. Apparently there can also be a read cache which is based
>> on FLASH.
>>
>> Bob
>> ==
>> Bob Friesenhahn
>> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs list improvements?

2009-01-10 Thread Ross Smith
Hmm... that's a tough one.  To me, it's a trade off either way, using
a -r parameter to specify the depth for zfs list feels more intuitive
than adding extra commands to modify the -r behaviour, but I can see
your point.

But then, using -c or -d means there's an optional parameter for zfs
list that you don't have in the other commands anyway.  And would you
have to use -c or -d with -r, or would they work on their own,
providing two ways to achieve very similar functionality.

Also, now you've mentioned that you want to keep things consistent
among all the commands, keeping -c and -d free becomes more important
to me.  You don't know if you might want to use these for another
command later on.

It sounds to me that whichever way you implement it there's going to
be some potential for confusion, but personally I'd stick with using
-r.  It leaves you with a single syntax for viewing children.  The -r
on the other commands can be modified to give an error message if they
don't support this extra parameter, and it leaves both -c and -d free
to use later on.

Ross



On Fri, Jan 9, 2009 at 7:16 PM, Richard Morris - Sun Microsystems -
Burlington United States  wrote:
> On 01/09/09 01:44, Ross wrote:
>>
>> Can I ask why we need to use -c or -d at all?  We already have -r to
>> recursively list children, can't we add an optional depth parameter to that?
>>
>> You then have:
>> zfs list : shows current level (essentially -r 0)
>> zfs list -r : shows all levels (infinite recursion)
>> zfs list -r 2 : shows 2 levels of children
>
> An optional depth argument to -r has already been suggested:
> http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/054241.html
>
> However, other zfs subcommands such as destroy, get, rename, and snapshot
> also provide -r options without optional depth arguments.  And its probably
> good to keep the zfs subcommand option syntax consistent.  On the other
> hand,
> if all of the zfs subcommands were modified to accept an optional depth
> argument
> to -r, then this would not be an issue.  But, for example, the top level(s)
> of
> datasets cannot be destroyed if that would leave orphaned datasets.
>
> BTW, when no dataset is specified, zfs list is the same as zfs list -r
> (infinite
> recursion).  When a dataset is specified then it shows only the current
> level.
>
> Does anyone have any non-theoretical situations where a depth option other
> than
> 1 or 2 would be used?  Are scripts being used to work around this problem?
>
> -- Rich
>
>
>
>
>
>
>
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-22 Thread Ross Smith
On Fri, Dec 19, 2008 at 6:47 PM, Richard Elling  wrote:
> Ross wrote:
>>
>> Well, I really like the idea of an automatic service to manage
>> send/receives to backup devices, so if you guys don't mind, I'm going to
>> share some other ideas for features I think would be useful.
>>
>
> cool.
>
>> One of the first is that you need some kind of capacity management and
>> snapshot deletion.  Eventually backup media are going to fill and you need
>> to either prompt the user to remove snapshots, or even better, you need to
>> manage the media automatically and remove old snapshots to make space for
>> new ones.
>>
>
> I've implemented something like this for a project I'm working on.
> Consider this a research project at this time, though I hope to
> leverage some of the things we learn as we scale up, out, and
> refine the operating procedures.

Way cool :D

> There is a failure mode lurking here.  Suppose you take two sets
> of snapshots: local and remote.  You want to do an incremental
> send, for efficiency.  So you look at the set of snapshots on both
> machines and find the latest, common snapshot.  You will then
> send the list of incrementals from the latest, common through the
> latest snapshot.  On the remote machine, if there are any other
> snapshots not in the list you are sending and newer than the latest,
> common snapshot, then the send/recv will fail.  In practice, this
> means that if you use the zfs-auto-snapshot feature, which will
> automatically destroy older snapshots as it goes (eg. the default
> policy for "frequent" is take snapshots every 15 minutes, keep 4).
>
> If you never have an interruption in your snapshot schedule, you
> can merrily cruise along and not worry about this.  But if there is
> an interruption (for maintenance, perhaps) and a snapshot is
> destroyed on the sender, then you also must make sure it gets
> destroyed on the receiver.  I just polished that code yesterday,
> and it seems to work fine... though it makes folks a little nervous.
> Anyone with an operations orientation will recognize that there
> needs to be a good process wrapped around this, but I haven't
> worked through all of the scenarios on the receiver yet.

Very true.  In this context I think this would be fine.  You would
want a warning to pop up saying that a snapshot has been deleted
locally and will have to be overwritten on the backup, but I think
that would be ok.  If necessary you could have a help page explaining
why - essentially this is a copy of your pool, not just a backup of
your files, and to work it needs an accurate copy of your snapshots.
If you wanted to be really fancy, you could have an option for the
user to view the affected files, but I think that's probably over
complicating things.

I don't suppose there's any way the remote snapshot can be cloned /
separated from the pool just in case somebody wanted to retain access
to the files within it?

>
>> I'm thinking that a setup like time slider would work well, where you
>> specify how many of each age of snapshot to keep.  But I would want to be
>> able to specify different intervals for different devices.
>>
>> eg. I might want just the latest one or two snapshots on a USB disk so I
>> can take my files around with me.  On a removable drive however I'd be more
>> interested in preserving a lot of daily / weekly backups.  I might even have
>> an archive drive that I just store monthly snapshots on.
>>
>> What would be really good would be a GUI that can estimate how much space
>> is going to be taken up for any configuration.  You could use the existing
>> snapshots on disk as a guide, and take an average size for each interval,
>> giving you average sizes for hourly, daily, weekly, monthly, etc...
>>
>
> ha ha, I almost blew coffee out my nose ;-)  I'm sure that once
> the forward time-slider functionality is implemented, it will be
> much easier to manage your storage utilization :-)  So, why am
> I giggling?  My wife just remembered that she hadn't taken her
> photos off the camera lately... 8 GByte SD cards are the vehicle
> of evil destined to wreck your capacity planning :-)

Haha, that's a great image, but I've got some food for thought even with this.

If you think about it, even though 8GB sounds a lot, it's barely over
1% of a 500GB drive, so it's not an unmanageable blip as far as
storage goes.

Also, if you're using the default settings for Tim's backups, you'll
be taking snapshots every 15 minutes, hour, day, week and month.  Now,
when you start you're not going to have any sensible averages for your
monthly snapshot sizes, but you're very rapidly going to get a set of
figures for your 15 minute snapshots.

What I would suggest is to use those to extrapolate forwards to give
very rough estimates of usage early on, with warnings as to how rough
these are.  In time these estimates will improve in accuracy, and your
8GB photo 'blip' should be relatively easily incorporated.

What you could maybe do is have a high and lo

Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
 I was thinking more something like:

  - find all disk devices and slices that have ZFS pools on them
  - show users the devices and pool names (and UUIDs and device paths in
  case of conflicts)..

>>>
>>> I was thinking that device & pool names are too variable, you need to
>>> be reading serial numbers or ID's from the device and link to that.
>>>
>>
>> Device names are, but there's no harm in showing them if there's
>> something else that's less variable.  Pool names are not very variable
>> at all.
>>
>
> I was thinking of something a little different.  Don't worry about
> devices, because you don't send to a device (rather, send to a pool).
> So a simple list of source file systems and a list of destinations
> would do.  I suppose you could work up something with pictures
> and arrows, like Nautilus, but that might just be more confusing
> than useful.

True, but if this is an end user service, you want something that can
create the filesystem for them on their devices.  An advanced mode
that lets you pick any destination filesystem would be good for
network admins, but for end users they're just going to want to point
this at their USB drive.

> But that is the easy part.  The hard part is dealing with the plethora
> of failure modes...
> -- richard

Heh, my response to this is who cares? :-D

This is a high level service, it's purely concerned with "backup
succeeded" or "backup failed", possibly with an "overdue for backup"
prompt if you want to help the user manage the backups.

Any other failure modes can be dealt with by the lower level services
or by the user.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
>> Of course, you'll need some settings for this so it's not annoying if
>> people don't want to use it.  A simple tick box on that pop up dialog
>> allowing people to say "don't ask me again" would probably do.
>
> I would like something better than that.  "Don't ask me again" sucks
> when much, much later you want to be asked and you don't know how to get the 
> system to ask you.

Only if your UI design doesn't make it easy to discover how to add
devices another way, or turn this setting back on.

My thinking is that this actually won't be the primary way of adding
devices.  It's simply there for ease of use for end users, as an easy
way for them to discover that they can use external drives to backup
their system.

Once you have a backup drive configured, most of the time you're not
going to want to be prompted for other devices.  Users will generally
setup a single external drive for backups, and won't want prompting
every time they insert a USB thumb drive, a digital camera, phone,
etc.

So you need that initial prompt to make the feature discoverable, and
then an easy and obvious way to configure backup devices later.

>> You'd then need a second way to assign drives if the user changed
>> their mind.  I'm thinking this would be to load the software and
>> select a drive.  Mapping to physical slots would be tricky, I think
>> you'd be better with a simple view that simply names the type of
>> interface, the drive size, and shows any current disk labels.  It
>> would be relatively easy then to recognise the 80GB USB drive you've
>> just connected.
>
> Right, so do as I suggested: tell the user to remove the device if it's
> plugged in, then plug it in again.  That way you can known unambiguously
> (unless the user is doing this with more than one device at a time).

That's horrible from a users point of view though.  Possibly worth
having as a last resort, but I'd rather just let the user pick the
device.  This does have potential as a "help me find my device"
feature though.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
On Thu, Dec 18, 2008 at 7:11 PM, Nicolas Williams
 wrote:
> On Thu, Dec 18, 2008 at 07:05:44PM +0000, Ross Smith wrote:
>> > Absolutely.
>> >
>> > The tool shouldn't need to know that the backup disk is accessed via
>> > USB, or whatever.  The GUI should, however, present devices
>> > intelligently, not as cXtYdZ!
>>
>> Yup, and that's easily achieved by simply prompting for a user
>> friendly name as devices are attached.  Now you could store that
>> locally, but it would be relatively easy to drop an XML configuration
>> file on the device too, allowing the same friendly name to be shown
>> wherever it's connected.
>
> I was thinking more something like:
>
>  - find all disk devices and slices that have ZFS pools on them
>  - show users the devices and pool names (and UUIDs and device paths in
>   case of conflicts)..

I was thinking that device & pool names are too variable, you need to
be reading serial numbers or ID's from the device and link to that.

>  - let the user pick one.
>
>  - in the case that the user wants to initialize a drive to be a backup
>   you need something more complex.
>
>- one possibility is to tell the user when to attach the desired
>  backup device, in which case the GUI can detect the addition and
>  then it knows that that's the device to use (but be careful to
>  check that the user also owns the device so that you don't pick
>  the wrong one on multi-seat systems)
>
>- another is to be much smarter about mapping topology to physical
>  slots and present a picture to the user that makes sense to the
>  user, so the user can click on the device they want.  This is much
>  harder.

I was actually thinking of a resident service.  Tim's autobackup
script was capable of firing off backups when it detected the
insertion of a USB drive, and if you've got something sitting there
monitoring drive insertions you could have it prompt the user when new
drives are detected, asking if they should be used for backups.

Of course, you'll need some settings for this so it's not annoying if
people don't want to use it.  A simple tick box on that pop up dialog
allowing people to say "don't ask me again" would probably do.

You'd then need a second way to assign drives if the user changed
their mind.  I'm thinking this would be to load the software and
select a drive.  Mapping to physical slots would be tricky, I think
you'd be better with a simple view that simply names the type of
interface, the drive size, and shows any current disk labels.  It
would be relatively easy then to recognise the 80GB USB drive you've
just connected.

Also, because you're formatting these drives as ZFS, you're not
restricted to just storing your backups on them.  You can create a
root pool (to contain the XML files, etc), and the backups can then be
saved to a filesystem within that.

That means the drive then functions as both a removable drive, and as
a full backup for your system.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using zfs mirror as a simple backup mechanism for time-slider.

2008-12-18 Thread Ross Smith
> Absolutely.
>
> The tool shouldn't need to know that the backup disk is accessed via
> USB, or whatever.  The GUI should, however, present devices
> intelligently, not as cXtYdZ!

Yup, and that's easily achieved by simply prompting for a user
friendly name as devices are attached.  Now you could store that
locally, but it would be relatively easy to drop an XML configuration
file on the device too, allowing the same friendly name to be shown
wherever it's connected.

And this is sounding more and more like something I was thinking of
developing myself.  A proper Sun version would be much better though
(not least before I've never developed anything for Solaris!).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need Help Invalidating Uberblock

2008-12-16 Thread Ross Smith
It sounds to me like there are several potentially valid filesystem
uberblocks available, am I understanding this right?

1. There are four copies of the current uberblock.  Any one of these
should be enough to load your pool with no data loss.

2. There are also a few (would love to know how many) previous
uberblocks which will point to a consistent filesystem, but with some
data loss.

3. Failing that, the system could be rolled back to any snapshot
uberblock.  Any data saved since that snapshot will be lost.

Is there any chance at all of automated tools that can take advantage
of all of these for pool recovery?




On Tue, Dec 16, 2008 at 11:55 AM, Johan Hartzenberg  wrote:
>
> On Tue, Dec 16, 2008 at 1:43 PM,  wrote:
>>
>>
>> >When current uber-block A is detected to point to a corrupted on-disk
>> > data,
>> >how would "zpool import" (or any other tool for that matter) quickly and
>> >safely know that, once it found an older uber-block "B" that it points to
>> > a
>> >set of blocks which does not include any blocks that has since been freed
>> >and re-allocated and, thus, corrupted?  Eg, without scanning the entire
>> >on-disk structure?
>>
>> Without a scrub, you mean?
>>
>> Not possible, except the first few uberblocks (blocks aren't used until a
>> few uberblocks later)
>>
>> Casper
>
> Does that mean that each of the last "few-minus-1" uberblocks point to a
> consistent version of the file system? Does "few" have a definition?
>
>
>
> --
> Any sufficiently advanced technology is indistinguishable from magic.
>Arthur C. Clarke
>
> My blog: http://initialprogramload.blogspot.com
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-15 Thread Ross Smith
I'm not sure I follow how that can happen, I thought ZFS writes were
designed to be atomic?  They either commit properly on disk or they
don't?


On Mon, Dec 15, 2008 at 6:34 PM, Bob Friesenhahn
 wrote:
> On Mon, 15 Dec 2008, Ross wrote:
>
>> My concern is that ZFS has all this information on disk, it has the
>> ability to know exactly what is and isn't corrupted, and it should (at least
>> for a system with snapshots) have many, many potential uberblocks to try.
>>  It should be far, far better than UFS at recovering from these things, but
>> for a certain class of faults, when it hits a problem it just stops dead.
>
> While ZFS knows if a data block is retrieved correctly from disk, a
> correctly retrieved data block does not indicate that the pool isn't
> "corrupted".  A block written in the wrong order is a form of corruption.
>
> Bob
> ==
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-15 Thread Ross Smith
Forgive me for not understanding the details, but couldn't you also
work backwards through the blocks with ZFS and attempt to recreate the
uberblock?

So if you lost the uberblock, could you (memory and time allowing)
start scanning the disk, looking for orphan blocks that aren't
refernced anywhere else and piece together the top of the tree?

Or roll back to a previous uberblock (or a snapshot uberblock), and
then look to see what blocks are on the disk but not referenced
anywhere.  Is there any way to intelligently work out where those
blocks would be linked by looking at how they interact with the known
data?

Of course, rolling back to a previous uberblock would still be a
massive step forward, and something I think would do much to improve
the perception of ZFS as a tool to reliably store data.

You cannot understate the difference to the end user between a file
system that on boot says:
"Sorry, can't read your data pool."

With one that says:
"Whoops, the uberblock, and all the backups are borked.  Would you
like to roll back to a backup uberblock, or leave the filesystem
offline to repair manually?"

As much as anything else, a simple statement explaining *why* a pool
is inaccessible, and saying just how badly things have gone wrong
helps tons.  Being able to recover anything after that is just the
icing on the cake, especially if it can be done automatically.

Ross

PS.  Sorry for the duplicate Casper, I forgot to cc the list.



On Mon, Dec 15, 2008 at 10:30 AM,   wrote:
>
>>I think the problem for me is not that there's a risk of data loss if
>>a pool becomes corrupt, but that there are no recovery tools
>>available.  With UFS, people expect that if the worst happens, fsck
>>will be able to recover their data in most cases.
>
> Except, of course, that fsck lies.  In "fixes" the meta data and the
> quality of the rest is unknown.
>
> Anyone using UFS knows that UFS file corruption are common; specifically,
> when using a "UFS root" and the system panic's when trying to
> install a device driver, there's a good chance that some files in
> /etc are corrupt. Some were application problems (some code used
> fsync(fileno(fp)); fclose(fp); it doesn't guarantee anything)
>
>
>>With ZFS you have no such tools, yet Victor has on at least two occasions
>>shown that it's quite possible to recover pools that were completely unusable
>>(I believe by making use of old / backup copies of the uberblock).
>
> True; and certainly ZFS should be able backtrack.  But it's
> much more likely to happen "automatically" then using a recovery
> tool.
>
> See, fsck could only be written because specific corruption are known
> and the patterns they have.   With ZFS, you can only backup to
> a certain uberblock and the pattern will be a surprise.
>
> Casper
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs not yet suitable for HA applications?

2008-12-05 Thread Ross Smith
Hi Dan, replying in line:

On Fri, Dec 5, 2008 at 9:19 PM, David Anderson <[EMAIL PROTECTED]> wrote:
> Trying to keep this in the spotlight. Apologies for the lengthy post.

Heh, don't apologise, you should see some of my posts... o_0

> I'd really like to see features as described by Ross in his summary of the
> "Availability: ZFS needs to handle disk removal / driver failure better"
>  (http://www.opensolaris.org/jive/thread.jspa?messageID=274031񂹯 ).
> I'd like to have these/similar features as well. Has there already been
> internal discussions regarding adding this type of functionality to ZFS
> itself, and was there approval, disapproval or no decision?
>
> Unfortunately my situation has put me in urgent need to find workarounds in
> the meantime.
>
> My setup: I have two iSCSI target nodes, each with six drives exported via
> iscsi (Storage Nodes). I have a ZFS Node that logs into each target from
> both Storage Nodes and creates a mirrored Zpool with one drive from each
> Storage Node comprising each half of the mirrored vdevs (6 x 2-way mirrors).
>
> My problem: If a Storage Node crashes completely, is disconnected from the
> network, iscsitgt core dumps, a drive is pulled, or a drive has a problem
> accessing data (read retries), then my ZFS Node hangs while ZFS waits
> patiently for the layers below to report a problem and timeout the devices.
> This can lead to a roughly 3 minute or longer halt when reading OR writing
> to the Zpool on the ZFS node. While this is acceptable in certain
> situations, I have a case where my availability demand is more severe.
>
> My goal: figure out how to have the zpool pause for NO LONGER than 30
> seconds (roughly within a typical HTTP request timeout) and then issue
> reads/writes to the good devices in the zpool/mirrors while the other side
> comes back online or is fixed.
>
> My ideas:
>  1. In the case of the iscsi targets disappearing (iscsitgt core dump,
> Storage Node crash, Storage Node disconnected from network), I need to lower
> the iSCSI login retry/timeout values. Am I correct in assuming the only way
> to accomplish this is to recompile the iscsi initiator? If so, can someone
> help point me in the right direction (I have never compiled ONNV sources -
> do I need to do this or can I just recompile the iscsi initiator)?

I believe it's possible to just recompile the initiator and install
the new driver.  I have some *very* rough notes that were sent to me
about a year ago, but I've no experience compiling anything in
Solaris, so don't know how useful they will be.  I'll try to dig them
out in case they're useful.

>
>   1.a. I'm not sure in what Initiator session states iscsi_sess_max_delay is
> applicable - only for the initial login, or also in the case of reconnect?
> Ross, if you still have your test boxes available, can you please try
> setting "set iscsi:iscsi_sess_max_delay = 5" in /etc/system, reboot and try
> failing your iscsi vdevs again? I can't find a case where this was tested
> quick failover.

Will gladly have a go at this on Monday.

>1.b. I would much prefer to have bug 649 addressed and fixed rather
> than having to resort to recompiling the iscsi initiator (if
> iscsi_sess_max_delay) doesn't work. This seems like a trivial feature to
> implement. How can I sponsor development?
>
>  2. In the case of the iscsi target being reachable, but the physical disk
> is having problems reading/writing data (retryable events that take roughly
> 60 seconds to timeout), should I change the iscsi_rx_max_window tunable with
> mdb? Is there a tunable for iscsi_tx? Ross, I know you tried this recently
> in the thread referenced above (with value 15), which resulted in a 60
> second hang. How did you offline the iscsi vol to test this failure? Unless
> iscsi uses a multiple of the value for retries, then maybe the way you
> failed the disk caused the iscsi system to follow a different failure path?
> Unfortunately I don't know of a way to introduce read/write retries to a
> disk while the disk is still reachable and presented via iscsitgt, so I'm
> not sure how to test this.

So far I've just been shutting down the Solaris box hosting the iSCSI
target.  Next step will involve pulling some virtual cables.
Unfortunately I don't think I've got a physical box handy to test
drive failures right now, but my previous testing (of simply pulling
drives) showed that it can be hit and miss as to how well ZFS detects
these types of 'failure'.

Like you I don't know yet how to simulate failures, so I'm doing
simple tests right now, offlining entire drives or computers.
Unfortunately I've found more than enough problems with just those
tests to keep me busy.


>2.a With the fix of
> http://bugs.opensolaris.org/view_bug.do?bug_id=6518995 , we can set
> sd_retry_count along with sd_io_time to cause I/O failure when a command
> takes longer than sd_retry_count * sd_io_time. Can (or should) these
> tunables be set on the imported iscsi disks in the ZFS N

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-03 Thread Ross Smith
Yeah, thanks Maurice, I just saw that one this afternoon.  I guess you
can't reboot with iscsi full stop... o_0

And I've seen the iscsi bug before (I was just too lazy to look it up
lol), I've been complaining about that since February.

In fact it's been a bad week for iscsi here, I've managed to crash the
iscsi client twice in the last couple of days too (full kernel dump
crashes), so I'll be filing a bug report on that tomorrow morning when
I get back to the office.

Ross


On Wed, Dec 3, 2008 at 7:39 PM, Maurice Volaski <[EMAIL PROTECTED]> wrote:
>> 2.  With iscsi, you can't reboot with sendtargets enabled, static
>> discovery still seems to be the order of the day.
>
> I'm seeing this problem with static discovery:
> http://bugs.opensolaris.org/view_bug.do?bug_id=6775008.
>
>> 4.  iSCSI still has a 3 minute timeout, during which time your pool will
>> hang, no matter how many redundant drives you have available.
>
> This is CR 649, http://bugs.opensolaris.org/view_bug.do?bug_id=649,
> which is separate from the boot time timeout, though, and also one that Sun
> so far has been unable to fix!
> --
>
> Maurice Volaski, [EMAIL PROTECTED]
> Computing Support, Rose F. Kennedy Center
> Albert Einstein College of Medicine of Yeshiva University
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hi Richard,

Thanks, I'll give that a try.  I think I just had a kernel dump while
trying to boot this system back up though, I don't think it likes it
if the iscsi targets aren't available during boot.  Again, that rings
a bell, so I'll go see if that's another known bug.

Changing that setting on the fly didn't seem to help, if anything
things are worse this time around.  I changed the timeout to 15
seconds, but didn't restart any services:

# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:180
# echo iscsi_rx_max_window/W0t15 | mdb -kw
iscsi_rx_max_window:0xb4=   0xf
# echo iscsi_rx_max_window/D | mdb -k
iscsi_rx_max_window:
iscsi_rx_max_window:15

After making those changes, and repeating the test, offlining an iscsi
volume hung all the commands running on the pool.  I had three ssh
sessions open, running the following:
# zpool iostats -v iscsipool 10 100
# format < /dev/null
# time zpool status

They hung for what felt a minute or so.
After that, the CIFS copy timed out.

After the CIFS copy timed out, I tried immediately restarting it.  It
took a few more seconds, but restarted no problem.  Within a few
seconds of that restarting, iostat recovered, and format returned it's
result too.

Around 30 seconds later, zpool status reported two drives, paused
again, then showed the status of the third:

# time zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934119E50569675FF00d0  ONLINE   0
200 0  24K resilvered

errors: No known data errors

real3m51.774s
user0m0.015s
sys 0m0.100s

Repeating that a few seconds later gives:

# time zpool status
  pool: iscsipool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 0h0m with 0 errors on Tue Dec  2 16:39:21 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  DEGRADED 0 0 0
  raidz1   DEGRADED 0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   0
0 0  15K resilvered
c2t600144F04934119E50569675FF00d0  UNAVAIL  3
5.80K 0  cannot open

errors: No known data errors

real0m0.272s
user0m0.029s
sys 0m0.169s




On Tue, Dec 2, 2008 at 3:58 PM, Richard Elling <[EMAIL PROTECTED]> wrote:

..

> iSCSI timeout is set to 180 seconds in the client code.  The only way
> to change is to recompile it, or use mdb.  Since you have this test rig
> setup, and I don't, do you want to experiment with this timeout?
> The variable is actually called "iscsi_rx_max_window" so if you do
>   echo iscsi_rx_max_window/D | mdb -k
> you should see "180"
> Change it using something like:
>   echo iscsi_rx_max_window/W0t30 | mdb -kw
> to set it to 30 seconds.
> -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-12-02 Thread Ross Smith
Hey folks,

I've just followed up on this, testing iSCSI with a raided pool, and
it still appears to be struggling when a device goes offline.

>>> I don't see how this could work except for mirrored pools.  Would that
>>> carry enough market to be worthwhile?
>>> -- richard
>>>
>>
>> I have to admit, I've not tested this with a raided pool, but since
>> all ZFS commands hung when my iSCSI device went offline, I assumed
>> that you would get the same effect of the pool hanging if a raid-z2
>> pool is waiting for a response from a device.  Mirrored pools do work
>> particularly well with this since it gives you the potential to have
>> remote mirrors of your data, but if you had a raid-z2 pool, you still
>> wouldn't want that hanging if a single device failed.
>>
>
> zpool commands hanging is CR6667208, and has been fixed in b100.
> http://bugs.opensolaris.org/view_bug.do?bug_id=6667208
>
>> I will go and test the raid scenario though on a current build, just to be
>> sure.
>>
>
> Please.
> -- richard


I've just created a pool using three snv_103 iscsi Targets, with a
fourth install of snv_103 collating those targets into a raidz pool,
and sharing that out over CIFS.

To test the server, while transferring files from a windows
workstation, I powered down one of the three iSCSI targets.  It took a
few minutes to shutdown, but once that happened the windows copy
halted with the error:
"The specified network name is no longer available."

At this point, the zfs admin tools still work fine (which is a huge
improvement, well done!), but zpool status still reports that all
three devices are online.

A minute later, I can open the share again, and start another copy.

Thirty seconds after that, zpool status finally reports that the iscsi
device is offline.

So it looks like we have the same problems with that 3 minute delay,
with zpool status reporting wrong information, and the CIFS service
having problems tool.

At this point I restarted the iSCSI target, but had problems bringing
it back online.  It appears there's a bug in the initiator, but it's
easily worked around:
http://www.opensolaris.org/jive/thread.jspa?messageID=312981񌚕

What was great was that as soon as the iSCSI initiator reconnected,
ZFS started resilvering.

What might not be so great is the fact that all three devices are
showing that they've been resilvered:

# zpool status
  pool: iscsipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h2m with 0 errors on Tue Dec  2 11:04:10 2008
config:

NAME   STATE READ WRITE CKSUM
iscsipool  ONLINE   0 0 0
  raidz1   ONLINE   0 0 0
c2t600144F04933FF6C5056967AC800d0  ONLINE   0
0 0  179K resilvered
c2t600144F04934FAB35056964D9500d0  ONLINE   5
9.88K 0  311M resilvered
c2t600144F04934119E50569675FF00d0  ONLINE   0
0 0  179K resilvered

errors: No known data errors

It's proving a little hard to know exactly what's happening when,
since I've only got a few seconds to log times, and there are delays
with each step.  However, I ran another test using robocopy and was
able to observe the behaviour a little more closely:

Test 2:  Using robocopy for the transfer, and iostat plus zpool status
on the server

10:46:30 - iSCSI server shutdown started
10:52:20 - all drives still online according to zpool status
10:53:30 - robocopy error - "The specified network name is no longer available"
 - zpool status shows all three drives as online
 - zpool iostat appears to have hung, taking much longer than the 30s
specified to return a result
 - robocopy is now retrying the file, but appears to have hung
10:54:30 - robocopy, CIFS and iostat all start working again, pretty
much simultaneously
 - zpool status now shows the drive as offline

I could probably do with using DTrace to get a better look at this,
but I haven't learnt that yet.  My guess as to what's happening would
be:

- iSCSI target goes offline
- ZFS will not be notified for 3 minutes, but I/O to that device is
essentially hung
- CIFS times out (I suspect this is on the client side with around a
30s timeout, but I can't find the timeout documented anywhere).
- zpool iostat is now waiting, I may be wrong but this doesn't appear
to have benefited from the changes to zpool status
- After 3 minutes, the iSCSI drive goes offline.  The pool carries on
with the remaining two drives, CIFS carries on working, iostat carries
on working.  "zpool status" however is still out of date.
- zpool status eventually catches u

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-11-27 Thread Ross Smith
On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling <[EMAIL PROTECTED]> wrote:
> Ross wrote:
>>
>> Well, you're not alone in wanting to use ZFS and iSCSI like that, and in
>> fact my change request suggested that this is exactly one of the things that
>> could be addressed:
>>
>> "The idea is really a two stage RFE, since just the first part would have
>> benefits.  The key is to improve ZFS availability, without affecting it's
>> flexibility, bringing it on par with traditional raid controllers.
>>
>> A.  Track response times, allowing for lop sided mirrors, and better
>> failure detection.
>
> I've never seen a study which shows, categorically, that disk or network
> failures are preceded by significant latency changes.  How do we get
> "better failure detection" from such measurements?

Not preceded by as such, but a disk or network failure will certainly
cause significant latency changes.  If the hardware is down, there's
going to be a sudden, and very large change in latency.  Sure, FMA
will catch most cases, but we've already shown that there are some
cases where it doesn't work too well (and I would argue that's always
going to be possible when you are relying on so many different types
of driver).  This is there to ensure that ZFS can handle *all* cases.


>>  Many people have requested this since it would facilitate remote live
>> mirrors.
>>
>
> At a minimum, something like VxVM's preferred plex should be reasonably
> easy to implement.
>
>> B.  Use response times to timeout devices, dropping them to an interim
>> failure mode while waiting for the official result from the driver.  This
>> would prevent redundant pools hanging when waiting for a single device."
>>
>
> I don't see how this could work except for mirrored pools.  Would that
> carry enough market to be worthwhile?
> -- richard

I have to admit, I've not tested this with a raided pool, but since
all ZFS commands hung when my iSCSI device went offline, I assumed
that you would get the same effect of the pool hanging if a raid-z2
pool is waiting for a response from a device.  Mirrored pools do work
particularly well with this since it gives you the potential to have
remote mirrors of your data, but if you had a raid-z2 pool, you still
wouldn't want that hanging if a single device failed.

I will go and test the raid scenario though on a current build, just to be sure.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
I disagree Bob, I think this is a very different function to that
which FMA provides.

As far as I know, FMA doesn't have access to the big picture of pool
configuration that ZFS has, so why shouldn't ZFS use that information
to increase the reliability of the pool while still using FMA to
handle device failures?

The flip side of the argument is that ZFS already checks the data
returned by the hardware.  You might as well say that FMA should deal
with that too since it's responsible for all hardware failures.

The role of ZFS is to manage the pool, availability should be part and
parcel of that.


On Tue, Nov 25, 2008 at 3:57 PM, Bob Friesenhahn
<[EMAIL PROTECTED]> wrote:
> On Tue, 25 Nov 2008, Ross Smith wrote:
>>
>> Good to hear there's work going on to address this.
>>
>> What did you guys think to my idea of ZFS supporting a "waiting for a
>> response" status for disks as an interim solution that allows the pool
>> to continue operation while it's waiting for FMA or the driver to
>> fault the drive?
>
> A stable and sane system never comes with "two brains".  It is wrong to put
> this sort of logic into ZFS when ZFS is already depending on FMA to make the
> decisions and Solaris already has an infrastructure to handle faults.  The
> more appropriate solution is that this feature should be in FMA.
>
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
> The shortcomings of timeouts have been discussed on this list before. How do
> you tell the difference between a drive that is dead and a path that is just
> highly loaded?

A path that is dead is either returning bad data, or isn't returning
anything.  A highly loaded path is by definition reading & writing
lots of data.  I think you're assuming that these are file level
timeouts, when this would actually need to be much lower level.


> Sounds good - devil, meet details, etc.

Yup, I imagine there are going to be a few details to iron out, many
of which will need looking at by somebody a lot more technical than
myself.

Despite that I still think this is a discussion worth having.  So far
I don't think I've seen any situation where this would make things
worse than they are now, and I can think of plenty of cases where it
would be a huge improvement.

Of course, it also probably means a huge amount of work to implement.
I'm just hoping that it's not prohibitively difficult, and that the
ZFS team see the benefits as being worth it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
Hmm, true.  The idea doesn't work so well if you have a lot of writes,
so there needs to be some thought as to how you handle that.

Just thinking aloud, could the missing writes be written to the log
file on the rest of the pool?  Or temporarily stored somewhere else in
the pool?  Would it be an option to allow up to a certain amount of
writes to be cached in this way while waiting for FMA, and only
suspend writes once that cache is full?

With a large SSD slog device would it be possible to just stream all
writes to the log?  As a further enhancement, might it be possible to
commit writes to the working drives, and just leave the writes for the
bad drive(s) in the slog (potentially saving a lot of space)?

For pools without log devices, I suspect that you would probably need
the administrator to specify the behavior as I can see several options
depending on the raid level and that pools priorities for data
availability / integrity:

Drive fault write cache settings:
default - pool waits for device, no writes occur until device or spare
comes online
slog - writes are cached to slog device until full, then pool reverts
to default behavior (could this be the default with slog devices
present?)
pool - writes are cached to the pool itself, up to a set maximum, and
are written to the device or spare as soon as possible.  This assumes
a single parity pool with the other devices available.  If the upper
limit is reached, or another devices goes faulty, pool reverts to
default behaviour.

Storing directly to the rest of the pool would probably want to be off
by default on single parity pools, but I would imagine that it could
be on by default on dual parity pools.

Would that be enough to allow writes to continue in most circumstances
while the pool waits for FMA?

Ross



On Tue, Nov 25, 2008 at 10:55 AM,  <[EMAIL PROTECTED]> wrote:
>
>
>>My idea is simply to allow the pool to continue operation while
>>waiting for the drive to fault, even if that's a faulty write.  It
>>just means that the rest of the operations (reads and writes) can keep
>>working for the minute (or three) it takes for FMA and the rest of the
>>chain to flag a device as faulty.
>
> Except when you're writing a lot; 3 minutes can cause a 20GB backlog
> for a single disk.
>
> Casper
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
No, I count that as "doesn't return data ok", but my post wasn't very
clear at all on that.

Even for a write, the disk will return something to indicate that the
action has completed, so that can also be covered by just those two
scenarios, and right now ZFS can lock the whole pool up if it's
waiting for that response.

My idea is simply to allow the pool to continue operation while
waiting for the drive to fault, even if that's a faulty write.  It
just means that the rest of the operations (reads and writes) can keep
working for the minute (or three) it takes for FMA and the rest of the
chain to flag a device as faulty.

For write operations, the data can be safely committed to the rest of
the pool, with just the outstanding writes for the drive left waiting.
 Then as soon as the device is faulted, the hot spare can kick in, and
the outstanding writes quickly written to the spare.

For single parity, or non redundant volumes there's some benefit in
this.  For dual parity pools there's a massive benefit as your pool
stays available, and your data is still well protected.

Ross



On Tue, Nov 25, 2008 at 10:44 AM,  <[EMAIL PROTECTED]> wrote:
>
>
>>My justification for this is that it seems to me that you can split
>>disk behavior into two states:
>>- returns data ok
>>- doesn't return data ok
>
>
> I think you're missing "won't write".
>
> There's clearly a difference between "get data from a different copy"
> which you can fix but retrying data to a different part of the redundant
> data and writing data: the data which can't be written must be kept
> until the drive is faulted.
>
>
> Casper
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
PS.  I think this also gives you a chance at making the whole problem
much simpler.  Instead of the hard question of "is this faulty",
you're just trying to say "is it working right now?".

In fact, I'm now wondering if the "waiting for a response" flag
wouldn't be better as "possibly faulty".  That way you could use it
with checksum errors too, possibly with settings as simple as "errors
per minute" or "error percentage".  As with the timeouts, you could
have it off by default (or provide sensible defaults), and let
administrators tweak it for their particular needs.

Imagine a pool with the following settings:
- zfs-auto-device-timeout = 5s
- zfs-auto-device-checksum-fail-limit-epm = 20
- zfs-auto-device-checksum-fail-limit-percent = 10
- zfs-auto-device-fail-delay = 120s

That would allow the pool to flag a device as possibly faulty
regardless of the type of fault, and take immediate proactive action
to safeguard data (generally long before the device is actually
faulted).

A device triggering any of these flags would be enough for ZFS to
start reading from (or writing to) other devices first, and should you
get multiple failures, or problems on a non redundant pool, you always
just revert back to ZFS' current behaviour.

Ross





On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick <[EMAIL PROTECTED]> wrote:
> I think we (the ZFS team) all generally agree with you.  The current
> nevada code is much better at handling device failures than it was
> just a few months ago.  And there are additional changes that were
> made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
> product line that will make things even better once the FishWorks team
> has a chance to catch its breath and integrate those changes into nevada.
> And then we've got further improvements in the pipeline.
>
> The reason this is all so much harder than it sounds is that we're
> trying to provide increasingly optimal behavior given a collection of
> devices whose failure modes are largely ill-defined.  (Is the disk
> dead or just slow?  Gone or just temporarily disconnected?  Does this
> burst of bad sectors indicate catastrophic failure, or just localized
> media errors?)  The disks' SMART data is notoriously unreliable, BTW.
> So there's a lot of work underway to model the physical topology of
> the hardware, gather telemetry from the devices, the enclosures,
> the environmental sensors etc, so that we can generate an accurate
> FMA fault diagnosis and then tell ZFS to take appropriate action.
>
> We have some of this today; it's just a lot of work to complete it.
>
> Oh, and regarding the original post -- as several readers correctly
> surmised, we weren't faking anything, we just didn't want to wait
> for all the device timeouts.  Because the disks were on USB, which
> is a hotplug-capable bus, unplugging the dead disk generated an
> interrupt that bypassed the timeout.  We could have waited it out,
> but 60 seconds is an eternity on stage.
>
> Jeff
>
> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
>> But that's exactly the problem Richard:  AFAIK.
>>
>> Can you state that absolutely, categorically, there is no failure mode out 
>> there (caused by hardware faults, or bad drivers) that won't lock a drive up 
>> for hours?  You can't, obviously, which is why we keep saying that ZFS 
>> should have this kind of timeout feature.
>>
>> For once I agree with Miles, I think he's written a really good writeup of 
>> the problem here.  My simple view on it would be this:
>>
>> Drives are only aware of themselves as an individual entity.  Their job is 
>> to save & restore data to themselves, and drivers are written to minimise 
>> any chance of data loss.  So when a drive starts to fail, it makes complete 
>> sense for the driver and hardware to be very, very thorough about trying to 
>> read or write that data, and to only fail as a last resort.
>>
>> I'm not at all surprised that drives take 30 seconds to timeout, nor that 
>> they could slow a pool for hours.  That's their job.  They know nothing else 
>> about the storage, they just have to do their level best to do as they're 
>> told, and will only fail if they absolutely can't store the data.
>>
>> The raid controller on the other hand (Netapp / ZFS, etc) knows all about 
>> the pool.  It knows if you have half a dozen good drives online, it knows if 
>> there are hot spares available, and it *should* also know how quickly the 
>> drives under its care usually respond to requests.
>>
>> ZFS is perfectly placed to spot when a drive is starting to fail, and to 
>> take the appropriate action to safeguard your data.  It has far more 
>> information available than a single drive ever will, and should be designed 
>> accordingly.
>>
>> Expecting the firmware and drivers of individual drives to control the 
>> failure modes of your redundant pool is just crazy imo.  You're throwing 
>> away some of the biggest benefits of using multiple drives in the first 
>> place.
>> --
>

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

2008-11-25 Thread Ross Smith
Hey Jeff,

Good to hear there's work going on to address this.

What did you guys think to my idea of ZFS supporting a "waiting for a
response" status for disks as an interim solution that allows the pool
to continue operation while it's waiting for FMA or the driver to
fault the drive?

I do appreciate that it's hard to come up with a definative "it's dead
Jim" answer, and I agree that long term the FMA approach will pay
dividends.  But I still feel this is a good short term solution, and
one that would also compliment your long term plans.

My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok

And for the state where it's not returning data, you can again split
that in two:
- returns wrong data
- doesn't return data

The first of these is already covered by ZFS with its checksums (with
FMA doing the extra work to fault drives), so it's just the second
that needs immediate attention, and for the life of me I can't think
of any situation that a simple timeout wouldn't catch.

Personally I'd love to see two parameters, allowing this behavior to
be turned on if desired, and allowing timeouts to be configured:

zfs-auto-device-timeout
zfs-auto-device-timeout-fail-delay

The first sets whether to use this feature, and configures the maximum
time ZFS will wait for a response from a device before putting it in a
"waiting" status.  The second would be optional and is the maximum
time ZFS will wait before faulting a device (at which point it's
replaced by a hot spare).

The reason I think this will work well with the FMA work is that you
can implement this now and have a real improvement in ZFS
availability.  Then, as the other work starts bringing better modeling
for drive timeouts, the parameters can be either removed, or set
automatically by ZFS.

Long term I guess there's also the potential to remove the second
setting if you felt FMA etc ever got reliable enough, but personally I
would always want to have the final fail delay set.  I'd maybe set it
to a long value such as 1-2 minutes to give FMA, etc a fair chance to
find the fault.  But I'd be much happier knowing that the system will
*always* be able to replace a faulty device within a minute or two, no
matter what the FMA system finds.

The key thing is that you're not faulting devices early, so FMA is
still vital.  The idea is purely to let ZFS to keep the pool active by
removing the need for the entire pool to wait on the FMA diagnosis.

As I said before, the driver and firmware are only aware of a single
disk, and I would imagine that FMA also has the same limitation - it's
only going to be looking at a single item and trying to determine
whether it's faulty or not.  Because of that, FMA is going to be
designed to be very careful to avoid false positives, and will likely
take it's time to reach an answer in some situations.

ZFS however has the benefit of knowing more about the pool, and in the
vast majority of situations, it should be possible for ZFS to read or
write from other devices while it's waiting for an 'official' result
from any one faulty component.

Ross


On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick <[EMAIL PROTECTED]> wrote:
> I think we (the ZFS team) all generally agree with you.  The current
> nevada code is much better at handling device failures than it was
> just a few months ago.  And there are additional changes that were
> made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
> product line that will make things even better once the FishWorks team
> has a chance to catch its breath and integrate those changes into nevada.
> And then we've got further improvements in the pipeline.
>
> The reason this is all so much harder than it sounds is that we're
> trying to provide increasingly optimal behavior given a collection of
> devices whose failure modes are largely ill-defined.  (Is the disk
> dead or just slow?  Gone or just temporarily disconnected?  Does this
> burst of bad sectors indicate catastrophic failure, or just localized
> media errors?)  The disks' SMART data is notoriously unreliable, BTW.
> So there's a lot of work underway to model the physical topology of
> the hardware, gather telemetry from the devices, the enclosures,
> the environmental sensors etc, so that we can generate an accurate
> FMA fault diagnosis and then tell ZFS to take appropriate action.
>
> We have some of this today; it's just a lot of work to complete it.
>
> Oh, and regarding the original post -- as several readers correctly
> surmised, we weren't faking anything, we just didn't want to wait
> for all the device timeouts.  Because the disks were on USB, which
> is a hotplug-capable bus, unplugging the dead disk generated an
> interrupt that bypassed the timeout.  We could have waited it out,
> but 60 seconds is an eternity on stage.
>
> Jeff
>
> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
>> But that's exactly the problem Richard:  A

Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
>> If the file still existed, would this be a case of redirecting the
>> file's top level block (dnode?) to the one from the snapshot?  If the
>> file had been deleted, could you just copy that one block?
>>
>> Is it that simple, or is there a level of interaction between files
>> and snapshots that I've missed (I've glanced through the tech specs,
>> but I'm a long way from fully understanding them).
>>
>
> It is as simple as a cp, or drag-n-drop in Nautilus.  The snapshot is
> read-only, so
> there is no need to cp, as long as you don't want to modify it or destroy
> the snapshot.
> -- richard

But that's missing the point here, which was that we want to restore
this file without having to copy the entire thing back.

Doing a cp or a drag-n-drop creates a new copy of the file, taking
time to restore, and allocating extra blocks.  Not a problem for small
files, but not ideal if you're say using ZFS to store virtual
machines, and want to roll back a single 20GB file from a 400GB
filesystem.

My question was whether it's technically feasible to roll back a
single file using the approach used for restoring snapshots, making it
an almost instantaneous operation?

ie:  If a snapshot exists that contains the file you want, you know
that all the relevant blocks are already on disk.  You don't want to
copy all of the blocks, but since ZFS follows a tree structure,
couldn't you restore the file by just restoring the one master block
for that file?

I'm just thinking that if it's technically feasible, I might raise an
RFE for this.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
> Snapshots are not replacements for traditional backup/restore features.
> If you need the latter, use what is currently available on the market.
> -- richard

I'd actually say snapshots do a better job in some circumstances.
Certainly they're being used that way by the desktop team:
http://blogs.sun.com/erwann/entry/zfs_on_the_desktop_zfs

None of this is stuff I'm after personally btw.  This was just my
attempt to interpret the request of the OP.

Although having said that, the ability to restore single files as fast
as you can restore a whole snapshot would be a nice feature.  Is that
something that would be possible?

Say you had a ZFS filesystem containing a 20GB file, with a recent
snapshot.  Is it technically feasible to restore that file by itself
in the same way a whole filesystem is rolled back with "zfs restore"?
If the file still existed, would this be a case of redirecting the
file's top level block (dnode?) to the one from the snapshot?  If the
file had been deleted, could you just copy that one block?

Is it that simple, or is there a level of interaction between files
and snapshots that I've missed (I've glanced through the tech specs,
but I'm a long way from fully understanding them).

Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions on zfs send,receive,backups

2008-11-03 Thread Ross Smith
Hi Darren,

That's storing a dump of a snapshot on external media, but files
within it are not directly accessible.  The work Tim et all are doing
is actually putting a live ZFS filesystem on external media and
sending snapshots to it.

A live ZFS filesystem is far more useful (and reliable) than a dump,
and having the ability to restore individual files from that would be
even better.

It still doesn't help the OP, but I think that's what he was after.

Ross



On Mon, Nov 3, 2008 at 9:55 AM, Darren J Moffat <[EMAIL PROTECTED]> wrote:
> Ross wrote:
>>
>> Ok, I see where you're coming from now, but what you're talking about
>> isn't zfs send / receive.  If I'm interpreting correctly, you're talking
>> about a couple of features, neither of which is in ZFS yet, and I'd need the
>> input of more technical people to know if they are possible.
>>
>> 1.  The ability to restore individual files from a snapshot, in the same
>> way an entire snapshot is restored - simply using the blocks that are
>> already stored.
>>
>> 2.  The ability to store (and restore from) snapshots on external media.
>
> What makes you say this doesn't work ?  Exactly what do you mean here
> because this will work:
>
>$ zfs send [EMAIL PROTECTED] | dd of=/dev/tape
>
> Sure it might not be useful and I don't think that is what you mean here  so
> can you expand on "sotre snapshots on external media.
>
> --
> Darren J Moffat
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis

2008-10-23 Thread Ross Smith
No problem.  I didn't use mirrored slogs myself, but that's certainly
a step up for reliability.

It's pretty easy to create a boot script to re-create the ramdisk and
re-attach it to the pool too.  So long as you use the same device name
for the ramdisk you can add it each time with a simple "zpool replace
pool ramdisk"


On Thu, Oct 23, 2008 at 1:56 PM, Constantin Gonzalez
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> yes, using slogs is the best solution.
>
> Meanwhile, using mirrored slogs from other servers' RAM-Disks running on
> UPSs
> seem like an interesting idea, if the reliability of UPS-backed RAM is
> deemed
> reliable enough for the purposes of the NFS server.
>
> Thanks for siggesting this!
>
> Cheers,
>   Constantin
>
> Ross wrote:
>>
>> Well, it might be even more of a bodge than disabling the ZIL, but how
>> about:
>>
>> - Create a 512MB ramdisk, use that for the ZIL
>> - Buy a Micro Memory nvram PCI card for £100 or so.
>> - Wait 3-6 months, hopefully buy a fully supported PCI-e SSD to replace
>> the Micro Memory card.
>>
>> The ramdisk isn't an ideal solution, but provided you don't export the
>> pool with it offline, it does work.  We used it as a stop gap solution for a
>> couple of weeks while waiting for a Micro Memory nvram card.
>>
>> Our reasoning was that our server's on a UPS and we figured if something
>> crashed badly enough to take out something like the UPS, the motherboard,
>> etc, we'd be loosing data anyway.  We just made sure we had good backups in
>> case the pool got corrupted and crossed our fingers.
>>
>> The reason I say wait 3-6 months is that there's a huge amount of activity
>> with SSD's at the moment.  Sun said that they were planning to have flash
>> storage launched by Christmas, so I figure there's a fair chance that we'll
>> see some supported PCIe cards by next Spring.
>> --
>> This message posted from opensolaris.org
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> --
> Constantin Gonzalez  Sun Microsystems GmbH,
> Germany
> Principal Field Technologist
>  http://blogs.sun.com/constantin
> Tel.: +49 89/4 60 08-25 91
> http://google.com/search?q=constantin+gonzalez
>
> Sitz d. Ges.: Sun Microsystems GmbH, Sonnenallee 1, 85551
> Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Ross Smith


Oh dear god.  Sorry folks, it looks like the new hotmail really doesn't play 
well with the list.  Trying again in plain text:
 
 
> Try to separate the two things:
> 
> (1) Try /dev/zero -> mbuffer --- network ---> mbuffer> /dev/null
> That should give you wirespeed
 
I tried that already.  It still gets just 10-11MB/s from this server.
I can get zfs send / receive and mbuffer working at 30MB/s though from a couple 
of test servers (with much lower specs).
 
> (2) Try zfs send | mbuffer> /dev/null
> That should give you an idea how fast zfs send really is locally.
 
Hmm, that's better than 10MB/s, but the average is still only around 20MB/s:
summary:  942 MByte in 47.4 sec - average of 19.9 MB/s
 
I think that points to another problem though as the send mbuffer is 100% full. 
 Certainly the pool itself doesn't appear under any strain at all while this is 
going on:
 
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rc-pool  732G  1.55T171 85  21.3M  1.01M
  mirror 144G   320G 38  0  4.78M  0
c1t1d0  -  -  6  0   779K  0
c1t2d0  -  - 17  0  2.17M  0
c2t1d0  -  - 14  0  1.85M  0
  mirror 146G   318G 39  0  4.89M  0
c1t3d0  -  - 20  0  2.50M  0
c2t2d0  -  - 13  0  1.63M  0
c2t0d0  -  -  6  0   779K  0
  mirror 146G   318G 34  0  4.35M  0
c2t3d0  -  - 19  0  2.39M  0
c1t5d0  -  -  7  0  1002K  0
c1t4d0  -  -  7  0  1002K  0
  mirror 148G   316G 23  0  2.93M  0
c2t4d0  -  -  8  0  1.09M  0
c2t5d0  -  -  6  0   890K  0
c1t6d0  -  -  7  0  1002K  0
  mirror 148G   316G 35  0  4.35M  0
c1t7d0  -  -  6  0   779K  0
c2t6d0  -  - 12  0  1.52M  0
c2t7d0  -  - 17  0  2.07M  0
  c3d1p0  12K   504M  0 85  0  1.01M
--  -  -  -  -  -  -
 
Especially when compared to the zfs send stats on my backup server which 
managed 30MB/s via mbuffer (Being received on a single virtual SATA disk):
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rpool   5.12G  42.6G  0  5  0  27.1K
  c4t0d0s0  5.12G  42.6G  0  5  0  27.1K
--  -  -  -  -  -  -
zfspool  431G  4.11T261  0  31.4M  0
  raidz2 431G  4.11T261  0  31.4M  0
c4t1d0  -  -155  0  6.28M  0
c4t2d0  -  -155  0  6.27M  0
c4t3d0  -  -155  0  6.27M  0
c4t4d0  -  -155  0  6.27M  0
c4t5d0  -  -155  0  6.27M  0
--  -  -  -  -  -  -
The really ironic thing is that the 30MB/s send / receive was sending to a 
virtual SATA disk which is stored (via sync NFS) on the server I'm having 
problems with...
 
Ross

 

> Date: Thu, 16 Oct 2008 14:27:49 +0200
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> CC: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Improving zfs send performance
> 
> Hi Ross
> 
> Ross wrote:
>> Now though I don't think it's network at all. The end result from that 
>> thread is that we can't see any errors in the network setup, and using 
>> nicstat and NFS I can show that the server is capable of 50-60MB/s over the 
>> gigabit link. Nicstat also shows clearly that both zfs send / receive and 
>> mbuffer are only sending 1/5 of that amount of data over the network.
>> 
>> I've completely run out of ideas of my own (but I do half expect there's a 
>> simple explanation I haven't thought of). Can anybody think of a reason why 
>> both zfs send / receive and mbuffer would be so slow?
> 
> Try to separate the two things:
> 
> (1) Try /dev/zero -> mbuffer --- network ---> mbuffer> /dev/null
> 
> That should give you wirespeed
> 
> (2) Try zfs send | mbuffer> /dev/null
> 
> That should give you an idea how fast zfs send really is locally.
> 
> Carsten
_
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Ross Smith

> Try to separate the two things:> > (1) Try /dev/zero -> mbuffer --- network 
> ---> mbuffer > /dev/null
> That should give you wirespeed
I tried that already.  It still gets just 10-11MB/s from this server.
I can get zfs send / receive and mbuffer working at 30MB/s though from a couple 
of test servers (with much lower specs).
 
> (2) Try zfs send | mbuffer > /dev/null> That should give you an idea how fast 
> zfs send really is locally.
Hmm, that's better than 10MB/s, but the average is still only around 20MB/s:
summary:  942 MByte in 47.4 sec - average of 19.9 MB/s
 
I think that points to another problem though as the send mbuffer is 100% full. 
 Certainly the pool itself doesn't appear under any strain at all while this is 
going on:
 
   capacity operationsbandwidthpool used  avail   
read  write   read  write--  -  -  -  -  -  
-rc-pool  732G  1.55T171 85  21.3M  1.01M  mirror 144G   
320G 38  0  4.78M  0c1t1d0  -  -  6  0   779K   
   0c1t2d0  -  - 17  0  2.17M  0c2t1d0  -  
- 14  0  1.85M  0  mirror 146G   318G 39  0  4.89M  
0c1t3d0  -  - 20  0  2.50M  0c2t2d0  -  -   
  13  0  1.63M  0c2t0d0  -  -  6  0   779K  0  
mirror 146G   318G 34  0  4.35M  0c2t3d0  -  - 
19  0  2.39M  0c1t5d0  -  -  7  0  1002K  0
c1t4d0  -  -  7  0  1002K  0  mirror 148G   316G 23 
 0  2.93M  0c2t4d0  -  -  8  0  1.09M  0
c2t5d0  -  -  6  0   890K  0c1t6d0  -  -  7 
 0  1002K  0  mirror 148G   316G 35  0  4.35M  0
c1t7d0  -  -  6  0   779K  0c2t6d0  -  - 12 
 0  1.52M  0c2t7d0  -  - 17  0  2.07M  0  
c3d1p0  12K   504M  0 85  0  1.01M--  -  -  
-  -  -  -
Especially when compared to the zfs send stats on my backup server which 
managed 30MB/s via mbuffer (Being received on a single virtual SATA disk):
   capacity operationsbandwidthpool used  avail   
read  write   read  write--  -  -  -  -  -  
-rpool   5.12G  42.6G  0  5  0  27.1K  c4t0d0s0  5.12G  
42.6G  0  5  0  27.1K--  -  -  -  -  -  
-zfspool  431G  4.11T261  0  31.4M  0  raidz2 431G  
4.11T261  0  31.4M  0c4t1d0  -  -155  0  6.28M  
0c4t2d0  -  -155  0  6.27M  0c4t3d0  -  
-155  0  6.27M  0c4t4d0  -  -155  0  6.27M  
0c4t5d0  -  -155  0  6.27M  0--  -  -  
-  -  -  -
The really ironic thing is that the 30MB/s send / receive was sending to a 
virtual SATA disk which is stored (via sync NFS) on the server I'm having 
problems with...
 
Ross
 
 
_
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-15 Thread Ross Smith

I'm using 2008-05-07 (latest stable), am I right in assuming that one is ok?


> Date: Wed, 15 Oct 2008 13:52:42 +0200
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Improving zfs send performance
> 
> Thomas Maier-Komor schrieb:
>> BTW: I release a new version of mbuffer today.
> 
> WARNING!!!
> 
> Sorry people!!!
> 
> The latest version of mbuffer has a regression that can CORRUPT output
> if stdout is used. Please fall back to the last version. A fix is on the
> way...
> 
> - Thomas

_
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving zfs send performance

2008-10-15 Thread Ross Smith

Thanks, that got it working.  I'm still only getting 10MB/s, so it's not solved 
my problem - I've still got a bottleneck somewhere, but mbuffer is a huge 
improvement over standard zfs send / receive.  It makes such a difference when 
you can actually see what's going on.



> Date: Wed, 15 Oct 2008 12:08:14 +0200
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Improving zfs send performance
> 
> Ross schrieb:
>> Hi,
>> 
>> I'm just doing my first proper send/receive over the network and I'm getting 
>> just 9.4MB/s over a gigabit link.  Would you be able to provide an example 
>> of how to use mbuffer / socat with ZFS for a Solaris beginner?
>> 
>> thanks,
>> 
>> Ross
>> --
>> This message posted from opensolaris.org
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> receiver> mbuffer -I sender:1 -s 128k -m 512M | zfs receive
> 
> sender> zfs send mypool/[EMAIL PROTECTED] | mbuffer -s 128k -m
> 512M -O receiver:1
> 
> BTW: I release a new version of mbuffer today.
> 
> HTH,
> Thomas

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Mirrors braindead?

2008-10-07 Thread Ross Smith

Oh cool, that's great news.  Thanks Eric.



> Date: Tue, 7 Oct 2008 11:50:08 -0700
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> CC: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] ZFS Mirrors braindead?
> 
> On Tue, Oct 07, 2008 at 11:42:57AM -0700, Ross wrote:
>> 
>> Running "zpool status" is a complete no no if your array is degraded
>> in any way.  This is capable of locking up zfs even when it would
>> otherwise have recovered itself.  If you had zpool status hang, this
>> probably happened to you.
> 
> FYI, this is bug 6667208 fixed in build 100 of nevada.
> 
> - Eric
> 
> --
> Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

_
Discover Bird's Eye View now with Multimap from Live Search
http://clk.atdmt.com/UKM/go/111354026/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scripting zfs send / receive

2008-09-26 Thread Ross Smith

Hi Mertol,
 
Yes, I'm using zfs send -i to just send the changes rather than the whole 
thing.  I'll have a think about your suggestion for deleting snapshots too, 
that does sound like a good idea.
 
Unfortunately I won't be able to synchronise any applications with this script. 
 It's backing up a filestore used by VMware ESX, so could be holding any number 
of machines.  The aim of this is purely to give me a crash consistent backup of 
those virtual machines just in case.  We have other software in place for our 
regular backups, this is our belt & braces disaster recovery copy :).
 
Ross
> Date: Fri, 26 Sep 2008 12:53:06 +0300> From: [EMAIL PROTECTED]> Subject: RE: 
> [zfs-discuss] Scripting zfs send / receive> To: [EMAIL PROTECTED]> > Hi Ross 
> ;> > I am no expert in scripting but I was a software engineer once :)> It's 
> good to desing the script to be able to tolerate errors. > > Instead of 
> sending the snapshot I'd recommending sending difference of snap> shots. > 
> Also instead of deleting the oldest snapshot I recommend deleting all but> 
> newest X number of snapshots. (incase for some reason script is unable to> 
> delete a snap shot, it will clean them in the next run, as it will always> 
> leave a fix number of snap shots alive) > > Also you may want your script 
> talk to application running on the Fs before> and after snapshot to make the 
> snapshot consistent. > > My 2 cents...> > Best regards> Mertol> > > > Mertol 
> Ozyoney > Storage Practice - Sales Manager> > Sun Microsystems, TR> Istanbul 
> TR> Phone +902123352200> Mobile +905339310752> Fax +90212335> Email 
> [EMAIL PROTECTED]> > > > -Original Message-> From: [EMAIL PROTECTED]> 
> [mailto:[EMAIL PROTECTED] On Behalf Of Ross> Sent: Friday, September 26, 2008 
> 12:43 PM> To: zfs-discuss@opensolaris.org> Subject: [zfs-discuss] Scripting 
> zfs send / receive> > Hey folks,> > Is anybody able to help a Solaris 
> scripting newbie with this? I want to put> together an automatic script to 
> take snapshots on one system and send them> across to another. I've shown the 
> manual process works, but only have a very> basic idea about how I'm going to 
> automate this.> > My current thinking is that I want to put together a cron 
> job that will work> along these lines:> > - Run every 15 mins> - take a new 
> snapshot of the pool> - send the snapshot to the remote system with zfs send 
> / receive and ssh.> (am I right in thinking I can get ssh to work with no 
> password if I create a> public/private key pair?> 
> http://www.go2linux.org/ssh-login-using-no-password)> - send an e-mail alert 
> if zfs send / receive fails for any reason (with the> text of the failure 
> message)> - send an e-mail alert if zfs send / receive takes longer than 15 
> minutes> and clashes with the next attempt> - delete the oldest snapshot on 
> both systems if the send / receive worked> > Can anybody think of any 
> potential problems I may have missed? > > Bearing in mind I've next to no 
> experience in bash scripting, how does the> following look?> > 
> **> 
> #!/bin/bash> > # Prepare variables for e-mail alerts> SUBJECT="zfs send / 
> receive error"> EMAIL="[EMAIL PROTECTED]"> > NEWSNAP="build filesystem + 
> snapshot name here"> RESULTS=$(/usr/sbin/zfs snapshot $NEWSNAP)> # how do I 
> check for a snapshot failure here? Just look for non blank> $RESULTS?> if 
> $RESULTS; then> # send e-mail> /bin/mail -s $SUBJECT $EMAIL $RESULTS> exit> 
> fi> > PREVIOUSSNAP="build filesystem + snapshot name here"> 
> RESULTS=$(/usr/sbin/zfs send -i $NEWSNAP $PREVIOUSSNAP | ssh -l *user*> 
> *remote-system* /usr/sbin/zfs receive *filesystem*)> # again, how do I check 
> for error messages here? Do I just look for a blank> $RESULTS to indicate 
> success?> if $RESULTS ok; then> OBSOLETESNAP="build filesystem + name here"> 
> zfs destroy $OBSOLETESNAP> ssh -l *user* *remote-system* /usr/sbin/zfs 
> destroy $OBSOLETESNAP> else > # send e-mail with error message> /bin/mail -s 
> $SUBJECT $EMAIL $RESULTS> fi> 
> **> > One 
> concern I have is what happens if the send / receive takes longer than> 15 
> minutes. Do I need to check that manually, or will the script cope with> this 
> already? Can anybody confirm that it will behave as I am hoping in that> the 
> script will take the next snapshot, but the send / receive will fail and> 
> generate an e-mail alert?> > thanks,> > Ross> --> This message posted from 
> opensolaris.org> ___> zfs-discuss 
> mailing list> zfs-discuss@opensolaris.org> 
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> 
_
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/___
zfs-dis

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-09-02 Thread Ross Smith

Thinking about it, we could make use of this too.  The ability to add a
remote iSCSI mirror to any pool without sacrificing local performance
could be a huge benefit.


> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
> Subject: Re: Availability: ZFS needs to handle disk removal / driver failure 
> better
> Date: Fri, 29 Aug 2008 09:15:41 +1200
> 
> Eric Schrock writes:
> > 
> > A better option would be to not use this to perform FMA diagnosis, but
> > instead work into the mirror child selection code.  This has already
> > been alluded to before, but it would be cool to keep track of latency
> > over time, and use this to both a) prefer one drive over another when
> > selecting the child and b) proactively timeout/ignore results from one
> > child and select the other if it's taking longer than some historical
> > standard deviation.  This keeps away from diagnosing drives as faulty,
> > but does allow ZFS to make better choices and maintain response times.
> > It shouldn't be hard to keep track of the average and/or standard
> > deviation and use it for selection; proactively timing out the slow I/Os
> > is much trickier. 
> > 
> This would be a good solution to the remote iSCSI mirror configuration.  
> I've been working though this situation with a client (we have been 
> comparing ZFS with Cleversafe) and we'd love to be able to get the read 
> performance of the local drives from such a pool. 
> 
> > As others have mentioned, things get more difficult with writes.  If I
> > issue a write to both halves of a mirror, should I return when the first
> > one completes, or when both complete?  One possibility is to expose this
> > as a tunable, but any such "best effort RAS" is a little dicey because
> > you have very little visibility into the state of the pool in this
> > scenario - "is my data protected?" becomes a very difficult question to
> > answer. 
> > 
> One solution (again, to be used with a remote mirror) is the three way 
> mirror.  If two devices are local and one remote, data is safe once the two 
> local writes return.  I guess the issue then changes from "is my data safe" 
> to "how safe is my data".  I would be reluctant to deploy a remote mirror 
> device without local redundancy, so this probably won't be an uncommon 
> setup.  There would have to be an acceptable window of risk when local data 
> isn't replicated. 
> 
> Ian

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Ross Smith

Dear god.  Thanks Tim, that's useful info.

The sales rep we spoke to was really trying quite hard to persuade us that 
NetApp was the best solution for us, they spent a couple of months working with 
us, but ultimately we were put off because of those 'limitations'.  They knew 
full well that those were two of our major concerns, but never had an answer 
for us.  That was a big part of the reason we started seriously looking into 
ZFS instead of NetApp.

If nothing else at least I now know a firm to avoid when buying NetApp...

Date: Sun, 31 Aug 2008 11:06:16 -0500
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: [zfs-discuss] EMC - top of the table for efficiency, how well 
would ZFS do?
CC: zfs-discuss@opensolaris.org



On Sun, Aug 31, 2008 at 10:39 AM, Ross Smith <[EMAIL PROTECTED]> wrote:






Hey Tim,

I'll admit I just quoted the blog without checking, I seem to remember the 
sales rep I spoke to recommending putting aside 20-50% of my disk for 
snapshots.  Compared to ZFS where I don't need to reserve any space it feels 
very old fashioned.  With ZFS, snapshots just take up as much space as I want 
them to.

Your sales rep was an idiot then.  Snapshot reserve isn't required at all. It 
isn't necessary to take snapshots.  It's simply a portion of space out of a 
volume that can only be used for snapshots, live data cannot enter into this 
space.  Snapshots, however, can exist on a volume with no snapshot reserve.  
They are in no way limited to the "snapshot reserve" you've set. Snapshot 
reserve is a guaranteed minimum amount of space out of a volume.  You can set 
it 90% as you mention below, and it will work just fine.


ZFS is no different than NetApp when it comes to snapshots.  I suggest until 
you have a basic understanding of how NetApp software works, not making ANY 
definitive statements about them.  You're sounding like a fool and/or someone 
working for one of their competitors.

 

The problem though for our usage with NetApp was that we actually couldn't 
reserve enough space for snapshots.  50% of the pool was their maximum, and 
we're interested in running ten years worth of snapshots here, which could see 
us with a pool with just 10% of live data and 90% of the space taken up by 
snapshots.  The NetApp approach was just too restrictive.


Ross
 There is not, and never has been a "50% of the pool maximum".  That's also a 
lie.  If you want snapshots to take up 90% of the pool, ONTAP will GLADLY do 
so.  I've got a filer sitting in my lab and would be MORE than happy to post 
the df output of a volume that has snapshots taking up 90% of the volume.



--Tim




_
Win a voice over part with Kung Fu Panda & Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] EMC - top of the table for efficiency, how well would ZFS do?

2008-08-31 Thread Ross Smith

Hey Tim,

I'll admit I just quoted the blog without checking, I seem to remember the 
sales rep I spoke to recommending putting aside 20-50% of my disk for 
snapshots.  Compared to ZFS where I don't need to reserve any space it feels 
very old fashioned.  With ZFS, snapshots just take up as much space as I want 
them to.

The problem though for our usage with NetApp was that we actually couldn't 
reserve enough space for snapshots.  50% of the pool was their maximum, and 
we're interested in running ten years worth of snapshots here, which could see 
us with a pool with just 10% of live data and 90% of the space taken up by 
snapshots.  The NetApp approach was just too restrictive.

Ross


> Date: Sun, 31 Aug 2008 08:08:09 -0700
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] EMC - top of the table for efficiency, how well 
> would ZFS do?
> 
> Netapp does NOT recommend 100 percent.  Perhaps you should talk to
> netapp or one of their partners who know their tech instead of their
> competitors next time.
> 
> Zfs, the way its currently implemented will require roughly the same
> as netapp... Which still isn't 100.
> 
> 
> 
> On 8/30/08, Ross <[EMAIL PROTECTED]> wrote:
> > Just saw this blog post linked from the register, it's EMC pointing out that
> > their array wastes less disk space than either HP or NetApp.  I'm loving the
> > 10% of space they have to reserve for snapshots, and you can't add more o_0.
> >
> > HP similarly recommend 20% of reserved space for snapshots, and NetApp
> > recommend a whopping 100% (that was one reason we didn't buy NetApp
> > actually).
> >
> > Could anybody say how ZFS would match up to these figures?  I'd have thought
> > a 14+2 raid-z2 scheme similar to NFS' would probably be fairest.
> >
> > http://chucksblog.typepad.com/chucks_blog/2008/08/your-storage-mi.html
> >
> > Ross
> > --
> > This message posted from opensolaris.org
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >

_
Make a mini you on Windows Live Messenger!
http://clk.atdmt.com/UKM/go/107571437/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-30 Thread Ross Smith

Triple mirroring you say?  That'd be me then :D

The reason I really want to get ZFS timeouts sorted is that our long term goal 
is to mirror that over two servers too, giving us a pool mirrored across two 
servers, each of which is actually a zfs iscsi volume hosted on triply mirrored 
disks.

Oh, and we'll have two sets of online off-site backups running raid-z2, plus a 
set of off-line backups too.

All in all I'm pretty happy with the integrity of the data, wouldn't want to 
use anything other than ZFS for that now.  I'd just like to get the 
availability working a bit better, without having to go back to buying raid 
controllers.  We have big plans for that too; once we get the iSCSI / iSER 
timeout issue sorted our long term availability goals are to have the setup I 
mentioned above hosted out from a pair of clustered Solaris NFS / CIFS servers.

Failover time on the cluster is currently in the order of 5-10 seconds, if I 
can get the detection of a bad iSCSI link down under 2 seconds we'll 
essentially have a worst case scenario of < 15 seconds downtime.  Downtime that 
low means it's effectively transparent for our users as all of our applications 
can cope with that seamlessly, and I'd really love to be able to do that this 
calendar year.

Anyway, getting back on topic, it's a good point about moving forward while 
redundancy exists.  I think the flag for specifying the write behavior should 
have that as the default, with the optional setting being to allow the pool to 
continue accepting writes while the pool is in a non redundant state.

Ross

> Date: Sat, 30 Aug 2008 10:59:19 -0500
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> CC: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / 
> driver failure better
> 
> On Sat, 30 Aug 2008, Ross wrote:
> > while the problem is diagnosed. - With that said, could the write 
> > timeout default to on when you have a slog device?  After all, the 
> > data is safely committed to the slog, and should remain there until 
> > it's written to all devices.  Bob, you seemed the most concerned 
> > about writes, would that be enough redundancy for you to be happy to 
> > have this on by default?  If not, I'd still be ok having it off by 
> > default, we could maybe just include it in the evil tuning guide 
> > suggesting that this could be turned on by anybody who has a 
> > separate slog device.
> 
> It is my impression that the slog device is only used for synchronous 
> writes.  Depending on the system, this could be just a small fraction 
> of the writes.
> 
> In my opinion, ZFS's primary goal is to avoid data loss, or 
> consumption of wrong data.  Availability is a lesser goal.
> 
> If someone really needs maximum availability then they can go to 
> triple mirroring or some other maximally redundant scheme.  ZFS should 
> to its best to continue moving forward as long as some level of 
> redundancy exists.  There could be an option to allow moving forward 
> with no redundancy at all.
> 
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
> 

_
Win a voice over part with Kung Fu Panda & Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

2008-08-28 Thread Ross Smith

Hi guys,

Bob, my thought was to have this timeout as something that can be optionally 
set by the administrator on a per pool basis.  I'll admit I was mainly thinking 
about reads and hadn't considered the write scenario, but even having thought 
about that it's still a feature I'd like.  After all, this would be a timeout 
set by the administrator based on the longest delay they can afford for that 
storage pool.

Personally, if a SATA disk wasn't responding to any requests after 2 seconds I 
really don't care if an error has been detected, as far as I'm concerned that 
disk is faulty.  I'd be quite happy for the array to drop to a degraded mode 
based on that and for writes to carry on with the rest of the array.

Eric, thanks for the extra details, they're very much appreciated.  It's good 
to hear you're working on this, and I love the idea of doing a B_FAILFAST read 
on both halves of the mirror.

I do have a question though.  From what you're saying, the response time can't 
be consistent across all hardware, so you're once again at the mercy of the 
storage drivers.  Do you know how long does B_FAILFAST takes to return a 
response on iSCSI?  If that's over 1-2 seconds I would still consider that too 
slow I'm afraid.

I understand that Sun in general don't want to add fault management to ZFS, but 
I don't see how this particular timeout does anything other than help ZFS when 
it's dealing with such a diverse range of media.  I agree that ZFS can't know 
itself what should be a valid timeout, but that's exactly why this needs to be 
an optional administrator set parameter.  The administrator of a storage array 
who wants to set this certainly knows what a valid timeout is for them, and 
these timeouts are likely to be several orders of magnitude larger than the 
standard response times.  I would configure very different values for my SATA 
drives as for my iSCSI connections, but in each case I would be happier knowing 
that ZFS has more of a chance of catching bad drivers or unexpected scenarios.

I very much doubt hardware raid controllers would wait 3 minutes for a drive to 
return a response, they will have their own internal timeouts to know when a 
drive has failed, and while ZFS is dealing with very different hardware I can't 
help but feel it should have that same approach to management of its drives.

However, that said, I'll be more than willing to test the new
B_FAILFAST logic on iSCSI once it's released.  Just let me know when
it's out.


Ross





> Date: Thu, 28 Aug 2008 11:29:21 -0500
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> CC: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / 
> driver failure better
> 
> On Thu, 28 Aug 2008, Ross wrote:
> >
> > I believe ZFS should apply the same tough standards to pool 
> > availability as it does to data integrity.  A bad checksum makes ZFS 
> > read the data from elsewhere, why shouldn't a timeout do the same 
> > thing?
> 
> A problem is that for some devices, a five minute timeout is ok.  For 
> others, there must be a problem if the device does not respond in a 
> second or two.
> 
> If the system or device is simply overwelmed with work, then you would 
> not want the system to go haywire and make the problems much worse.
> 
> Which of these do you prefer?
> 
>o System waits substantial time for devices to (possibly) recover in
>  order to ensure that subsequently written data has the least
>  chance of being lost.
> 
>o System immediately ignores slow devices and switches to
>  non-redundant non-fail-safe non-fault-tolerant may-lose-your-data
>  mode.  When system is under intense load, it automatically
>  switches to the may-lose-your-data mode.
> 
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
> 

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS automatic snapshots 0.11 Early Access

2008-08-27 Thread Ross Smith

That sounds absolutely perfect Tim, thanks.
 
Yes, we'll be sending these to other zfs filesystems, although I haven't looked 
at the send/receive part of your service yet.  What I'd like to do is stage the 
send/receive as files on an external disk, and then receive them remotely from 
that.  I've tested the concept works with a single send/receive operation, but 
haven't looked into the automation yet.
 
The plan is to use usb/firewire/esata disks to do the data transfers rather 
than doing it all over the wire.  We'll do the initial full send/receive 
locally over gigabit to prepare the remote system, and from that point on it 
will just be incremental daily or weekly transfers which should fit fine on an 
80-200GB external drive.
 
When I get around to it I'll be pulling apart your automatic backup code to see 
if I can't get it to fire off the incremental zfs send (or receive) as soon as 
the system detects that the external drive has been attached.
 
We will be using tape backups too, but those will be our disaster recovery plan 
in case ZFS itself fails, so those will be backups of the raw files, possibly 
using something as simple as tar.  We don't expect to ever need those, but at 
least we'll be safe should we ever experience pool corruption on all four 
servers.
 
And yes, you could say we're paranoid :D
 
Ross
> Date: Wed, 27 Aug 2008 12:14:10 +0100> From: [EMAIL PROTECTED]> Subject: Re: 
> [zfs-discuss] ZFS automatic snapshots 0.11 Early Access> To: [EMAIL 
> PROTECTED]> CC: zfs-discuss@opensolaris.org> > On Wed, 2008-08-27 at 03:53 
> -0700, Ross wrote:> > We're looking at autohome folders for windows users 
> over CIFS, but I'm> > wondering how that is going to affect our backup 
> strategy. I was> > hoping to be able to use your automatic snapshot service 
> on these> > servers, do you know how that service would work with the 
> autohome> > service when filesystems are being created on demand?> > If 
> you're using 0.11ea, and you're creating filesystems on the fly, so> long as 
> the parent filesystem you're creating a child in has a> com.sun:auto-snapshot 
> property set, the child will inherit that zfs> user-property, and snapshots 
> will automatically get taken for that child> too, no user intervention 
> needed.> > The automatic backup stuff in the service (not turned by default) 
> should> handle incremental vs. full send/recvs, even on newly created> 
> filesystems. If it finds an earlier snapshot for the filesystem, it'll> do an 
> incremental send/recv, otherwise it'll send a full snapshot stream> first, 
> followed by incremental send/recvs after that.> > [ of course, you'd be 
> sending these streams to other zfs filesystems,> not just saving the flat zfs 
> send streams to tape, right? ]> > cheers,> tim> > 
_
Win a voice over part with Kung Fu Panda & Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best layout for 15 disks?

2008-08-22 Thread Ross Smith

Yup, you got it, and an 8 disk raid-z2 array should still fly for a home system 
:D  I'm guessing you're on gigabit there?  I don't see you having any problems 
hitting the bandwidth limit on it.

Ross


> Date: Fri, 22 Aug 2008 11:11:21 -0700
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> Subject: Re: [zfs-discuss] Best layout for 15 disks?
> CC: zfs-discuss@opensolaris.org
> 
> On 8/22/08, Ross <[EMAIL PROTECTED]> wrote:
> 
> > Yes, that looks pretty good mike.  There are a few limitations to that as 
> > you add the 2nd raidz2 set, but nothing major.  When you add the extra 
> > disks, your original data will still be stored on the first set of disks, 
> > if you've any free space left on those you'll then get some data stored 
> > across all the disks, and then I think that once the first set are full, 
> > zfs will just start using the free space on the newer 8.
> 
> > It shouldn't be a problem for a home system, and all that will happen 
> > silently in the background.  It's just worth knowing that you don't 
> > necessarily get the full performance of a 16 disk array when you do it in 
> > two stages like that.
> 
> that's fine. I'll basically be getting the performance of an 8 disk
> raidz2 at worst, yeah? i'm fine with how the space will be
> distributed. after all this is still a huge improvement over my
> current haphazard setup :P

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

2008-08-20 Thread Ross Smith

> > Without fail, cfgadm changes the status from "disk" to "sata-port" when I
> > unplug a device attached to port 6 or 7, but most of the time unplugging
> > disks 0-5 results in no change in cfgadm, until I also attach disk 6 or 7.
> 
> That does seem inconsistent, or at least, it's not what I'd expect.

Yup, was an absolute nightmare to diagnose on top of everything else.  
Definitely doesn't happen in windows too.  I really want somebody to try snv_94 
on a Thumper to see if you get the same behaviour there, or whether it's unique 
to Supermicro's Marvell card.

> > Often the system hung completely when you pulled one of the disks 0-5,
> > and wouldn't respond again until you re-inserted it.
> > 
> > I'm 99.99% sure this is a driver issue for this controller.
> 
> Have you logged a bug on it yet?

Yup, 6735931.  Added the information about it working in Windows today too.

Ross

_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive removed

2008-08-15 Thread Ross Smith

Oh god no, I'm already learning three new operating systems, now is not a good 
time to add a fourth.
 
Ross<-- Windows admin now working with Ubuntu, OpenSolaris and ESX



Date: Fri, 15 Aug 2008 10:07:31 -0500From: [EMAIL PROTECTED]: [EMAIL 
PROTECTED]: Re: [zfs-discuss] FW: Supermicro AOC-SAT2-MV8 hang when drive 
removedCC: zfs-discuss@opensolaris.org
You could always try FreeBSD :)--Tim
On Fri, Aug 15, 2008 at 9:44 AM, Ross <[EMAIL PROTECTED]> wrote:
Haven't a clue, but I've just gotten around to installing windows on this box 
to test and I can confirm that hot plug works just fine in windows.Drives 
appear and dissappear in device manager the second I unplug the hardware.  Any 
drive, either controller.  So far I've done a couple of dozen removals, pulling 
individual drives, or as many as half a dozen at once.  I've even gone as far 
as to immediately pull a drive I only just connected.  Windows has no problems 
at all.Unfortunately for me, Windows doesn't support ZFS...  right now it's 
looking a whole load more stable.Ross
_
Win a voice over part with Kung Fu Panda & Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working - I broke my pool...

2008-08-06 Thread Ross Smith

Hmm... got a bit more information for you to add to that bug I think.
 
Zpool import also doesn't work if you have mirrored log devices and either one 
of them is offline.
 
I created two ramdisks with:
# ramdiskadm -a rc-pool-zil-1 256m
# ramdiskadm -a rc-pool-zil-2 256m
 
And added them to the pool with:
# zpool add rc-pool log mirror /dev/ramdisk/rc-pool-zil-1 
/dev/ramdisk/rc-pool-zil-2
 
I can reboot fine, the pool imports ok without the ZIL and I have a script that 
recreates the ramdisks and adds them back to the pool:#!/sbin/shstate="$1"case 
"$state" in'start')   echo 'Starting Ramdisks'   /usr/sbin/ramdiskadm -a 
rc-pool-zil-1 256m   /usr/sbin/ramdiskadm -a rc-pool-zil-2 256m   echo 
'Attaching to ZFS ZIL'   /usr/sbin/zpool replace test 
/dev/ramdisk/rc-pool-zil-1   /usr/sbin/zpool replace test 
/dev/ramdisk/rc-pool-zil-2   ;;'stop')   ;;esac
 
However, if I export the pool, and delete one ramdisk to check that the 
mirroring works fine, the import fails:
# zpool export rc-pool
# ramdiskadm -d rc-pool-zil-1
# zpool import rc-pool
cannot import 'rc-pool': one or more devices is currently unavailable
 
Ross
> Date: Mon, 4 Aug 2008 10:42:43 -0600> From: [EMAIL PROTECTED]> Subject: Re: 
> [zfs-discuss] Zpool import not working - I broke my pool...> To: [EMAIL 
> PROTECTED]; [EMAIL PROTECTED]> CC: zfs-discuss@opensolaris.org> > > > Richard 
> Elling wrote:> > Ross wrote:> >> I'm trying to import a pool I just exported 
> but I can't, even -f doesn't help. Every time I try I'm getting an error:> >> 
> "cannot import 'rc-pool': one or more devices is currently unavailable"> >>> 
> >> Now I suspect the reason it's not happy is that the pool used to have a 
> ZIL :)> >> > > > > Correct. What you want is CR 6707530, log device failure 
> needs some work> > http://bugs.opensolaris.org/view_bug.do?bug_id=6707530> > 
> which Neil has been working on, scheduled for b96.> > Actually no. That CR 
> mentioned the problem and talks about splitting out> the bug, as it's really 
> a separate problem. I've just done that and here's> the new CR which probably 
> won't be visible immediately to you:> > 6733267 Allow a pool to be imported 
> with a missing slog> > Here's the Description:> > ---> This 
> CR is being broken out from 6707530 "log device failure needs some work"> > 
> When Separate Intent logs (slogs) were designed they were given equal status 
> in the pool device tree.> This was because they can contain committed changes 
> to the pool.> So if one is missing it is assumed to be important to the 
> integrity of the> application(s) that wanted the data committed 
> synchronously, and thus> a pool cannot be imported with a missing slog.> 
> However, we do allow a pool to be missing a slog on boot up if> it's in the 
> /etc/zfs/zpool.cache file. So this sends a mixed message.> > We should allow 
> a pool to be imported without a slog if -f is used> and to not import without 
> "-f" but perhaps with a better error message.> > It's the guidsum check that 
> actually rejects imports with missing devices.> We could have a separate 
> guidsum for the main pool devices (non slog/cache).> ---> 
_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working - I broke my pool...

2008-08-05 Thread Ross Smith

No, but that's a great idea!  I'm on a UFS root at the moment, will have a look 
at using ZFS next time I re-install.
> Date: Tue, 5 Aug 2008 07:59:35 -0700> From: [EMAIL PROTECTED]> Subject: Re: 
> [zfs-discuss] Zpool import not working - I broke my pool...> To: [EMAIL 
> PROTECTED]> CC: [EMAIL PROTECTED]; zfs-discuss@opensolaris.org> > Ross Smith 
> wrote:> > Just a thought, before I go and wipe this zpool, is there any way 
> to > > manually recreate the /etc/zfs/zpool.cache file?> > Do you have a copy 
> in a snapshot? ZFS for root is awesome!> -- richard> > > > > Ross> >> > > 
> Date: Mon, 4 Aug 2008 10:42:43 -0600> > > From: [EMAIL PROTECTED]> > > 
> Subject: Re: [zfs-discuss] Zpool import not working - I broke my pool...> > > 
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]> > > CC: 
> zfs-discuss@opensolaris.org> > >> > >> > >> > > Richard Elling wrote:> > > > 
> Ross wrote:> > > >> I'm trying to import a pool I just exported but I can't, 
> even -f > > doesn't help. Every time I try I'm getting an error:> > > >> 
> "cannot import 'rc-pool': one or more devices is currently > > unavailable"> 
> > > >>> > > >> Now I suspect the reason it's not happy is that the pool used 
> to > > have a ZIL :)> > > >>> > > >> > > > Correct. What you want is CR 
> 6707530, log device failure needs > > some work> > > > 
> http://bugs.opensolaris.org/view_bug.do?bug_id=6707530> > > > which Neil has 
> been working on, scheduled for b96.> > >> > > Actually no. That CR mentioned 
> the problem and talks about splitting out> > > the bug, as it's really a 
> separate problem. I've just done that and > > here's> > > the new CR which 
> probably won't be visible immediately to you:> > >> > > 6733267 Allow a pool 
> to be imported with a missing slog> > >> > > Here's the Description:> > >> > 
> > ---> > > This CR is being broken out from 6707530 "log 
> device failure needs > > some work"> > >> > > When Separate Intent logs 
> (slogs) were designed they were given > > equal status in the pool device 
> tree.> > > This was because they can contain committed changes to the pool.> 
> > > So if one is missing it is assumed to be important to the integrity > > 
> of the> > > application(s) that wanted the data committed synchronously, and 
> thus> > > a pool cannot be imported with a missing slog.> > > However, we do 
> allow a pool to be missing a slog on boot up if> > > it's in the 
> /etc/zfs/zpool.cache file. So this sends a mixed message.> > >> > > We should 
> allow a pool to be imported without a slog if -f is used> > > and to not 
> import without "-f" but perhaps with a better error message.> > >> > > It's 
> the guidsum check that actually rejects imports with missing > > devices.> > 
> > We could have a separate guidsum for the main pool devices (non > > 
> slog/cache).> > > ---> > >> >> >> > 
> > > 
> Get Hotmail on your mobile from Vodafone Try it Now! > > 
> <http://clk.atdmt.com/UKM/go/107571435/direct/01/>> 
_
Get Hotmail on your mobile from Vodafone 
http://clk.atdmt.com/UKM/go/107571435/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool import not working - I broke my pool...

2008-08-05 Thread Ross Smith

Just a thought, before I go and wipe this zpool, is there any way to manually 
recreate the /etc/zfs/zpool.cache file?
 
Ross> Date: Mon, 4 Aug 2008 10:42:43 -0600> From: [EMAIL PROTECTED]> Subject: 
Re: [zfs-discuss] Zpool import not working - I broke my pool...> To: [EMAIL 
PROTECTED]; [EMAIL PROTECTED]> CC: zfs-discuss@opensolaris.org> > > > Richard 
Elling wrote:> > Ross wrote:> >> I'm trying to import a pool I just exported 
but I can't, even -f doesn't help. Every time I try I'm getting an error:> >> 
"cannot import 'rc-pool': one or more devices is currently unavailable"> >>> >> 
Now I suspect the reason it's not happy is that the pool used to have a ZIL :)> 
>> > > > > Correct. What you want is CR 6707530, log device failure needs some 
work> > http://bugs.opensolaris.org/view_bug.do?bug_id=6707530> > which Neil 
has been working on, scheduled for b96.> > Actually no. That CR mentioned the 
problem and talks about splitting out> the bug, as it's really a separate 
problem. I've just done that and here's> the new CR which probably won't be 
visible immediately to you:> > 6733267 Allow a pool to be imported with a 
missing slog> > Here's the Description:> > ---> This CR is 
being broken out from 6707530 "log device failure needs some work"> > When 
Separate Intent logs (slogs) were designed they were given equal status in the 
pool device tree.> This was because they can contain committed changes to the 
pool.> So if one is missing it is assumed to be important to the integrity of 
the> application(s) that wanted the data committed synchronously, and thus> a 
pool cannot be imported with a missing slog.> However, we do allow a pool to be 
missing a slog on boot up if> it's in the /etc/zfs/zpool.cache file. So this 
sends a mixed message.> > We should allow a pool to be imported without a slog 
if -f is used> and to not import without "-f" but perhaps with a better error 
message.> > It's the guidsum check that actually rejects imports with missing 
devices.> We could have a separate guidsum for the main pool devices (non 
slog/cache).> ---> 
_
Win a voice over part with Kung Fu Panda & Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] are these errors dangerous

2008-08-03 Thread Ross Smith

Hi Matt,
 
If it's all 3 disks, I wouldn't have thought it likely to be disk errors, and I 
don't think it's a ZFS fault as such.  You might be better posting the question 
in the storage or help forums to see if anybody there can shed more light on 
this.
 
Ross
> Date: Sun, 3 Aug 2008 16:48:03 +0100> From: [EMAIL PROTECTED]> To: [EMAIL 
> PROTECTED]> CC: zfs-discuss@opensolaris.org> Subject: Re: [zfs-discuss] are 
> these errors dangerous> > Ross wrote:> > Hi,> > > > First of all, I really 
> should warn you that I'm very new to Solaris, I'll happily share my thoughts 
> but be aware that there's not a lot of experience backing them up.> > > 
> >>From what you've said, and the logs you've posted I suspect you're hitting 
> recoverable read errors. ZFS wouldn't flag these as no corrupt data has been 
> encountered, but I suspect the device driver is logging them anyway.> > > > 
> The log you posted all appears to refer to one disk (sd0), my guess would be 
> that you have some hardware faults on that device and if it were me I'd 
> probably be replacing it before it actually fails.> > > > I'd check your logs 
> before replacing that disk though, you need to see if it's just that one 
> disk, or if others are affected. Provided you have a redundant ZFS pool, it 
> may be worth offlining that disk, unconfiguring it with cfgadm, and then 
> pulling the drive to see if that does cure the warnings you're getting in the 
> logs.> > > > Whatever you do, please keep me posted. Your post has already 
> made me realise it would be a good idea to have a script watching log file 
> sizes to catch problems like this early.> > > > Ross> > Thanks for your 
> insights, I'm also relatively new to solaris but i've > been on linux for 
> years. I've just read more into the logs and its > giving these errors for 
> all 3 of my disks (sd0,1,2). I'm running a > raidz1, unfortunately without 
> any spares and I'm not too keen on > removing the parity from my pool as I've 
> got a lot of important files > stored there.> > I would agree that this seems 
> to be a recoverable error and nothing is > getting corrupted thanks to ZFS. 
> The thing I'm worried about is if the > entire batch is failing slowly and 
> will all die at the same time.> > Hopefully some ZFS/hardware guru can 
> comment on this before the world > ends for me :P> > Thanks> > Matt> > No 
> virus found in this outgoing message.> Checked by AVG - http://www.avg.com > 
> Version: 8.0.138 / Virus Database: 270.5.10/1587 - Release Date: 02/08/2008 
> 17:30> > 
_
Win a voice over part with Kung Fu Panda & Live Search   and   100’s of Kung Fu 
Panda prizes to win with Live Search
http://clk.atdmt.com/UKM/go/107571439/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing the boot HDDs in x4500

2008-08-01 Thread Ross Smith

Sorry Ian, I was posting on the forum and missed the word "disks" from my 
previous post.  I'm still not used to Sun's mutant cross of a message board / 
mailing list.
 
Ross
> Date: Fri, 1 Aug 2008 21:08:08 +1200> From: [EMAIL PROTECTED]> To: [EMAIL 
> PROTECTED]> CC: zfs-discuss@opensolaris.org> Subject: Re: [zfs-discuss] 
> Replacing the boot HDDs in x4500> > Ross wrote:> > Wipe the snv_70b disks I 
> meant.> > > > > What disks? This message makes no sense without context.> > 
> Context free messages are a pain in the arse for those of us who use the> 
> mail list.> > Ian
_
Make a mini you on Windows Live Messenger!
http://clk.atdmt.com/UKM/go/107571437/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I trust ZFS?

2008-07-31 Thread Ross Smith

Hey Brent,
 
On the Sun hardware like the Thumper you do get a nice bright blue "ready to 
remove" led as soon as you issue the "cfgadm -c unconfigure xxx" command.  On 
other hardware it takes a little more care, I'm labelling our drive bays up 
*very* carefully to ensure we always remove the right drive.  Stickers are your 
friend, mine will probably be labelled "sata1/0", "sata1/1", "sata1/2", etc.
 
I know Sun are working to improve the LED support, but I don't know whether 
that support will ever be extended to 3rd party hardware:
http://blogs.sun.com/eschrock/entry/external_storage_enclosures_in_solaris
 
I'd love to use Sun hardware for this, but while things like x2200 servers are 
great value for money, Sun don't have anything even remotely competative to a 
standard 3U server with 16 SATA bays.  The x4240 is probably closest, but is at 
least double the price.  Even the J4200 arrays are more expensive than this 
entire server.
 
Ross
 
PS.  Once you've tested SCSI removal, could you add your results to my thread, 
would love to hear how that 
went.http://www.opensolaris.org/jive/thread.jspa?threadID=67837&tstart=0
 
 
> This conversation piques my interest.. I have been reading a lot about 
> Opensolaris/Solaris for the last few weeks.> Have even spoken to Sun storage 
> techs about bringing in Thumper/Thor for our storage needs.> I have recently 
> brought online a Dell server with a DAS (14 SCSI drives). This will be part 
> of my tests now, 
> physically removing a member of the pool before issuing the removal command 
> for that particular drive.
> One other issue I have now also, how do you physically locate a 
> failing/failed drive in ZFS?
>> With hardware RAID sets, if the RAID controller itself detects the error, it 
>> will inititate a BLINK command to that 
> drive, so the individual drive is now flashing red/amber/whatever on the RAID 
> enclosure.> How would this be possible with ZFS? Say you have a JBOD 
> enclosure, (14, hell maybe 48 drives).> Knowing c0d0xx failed is no longer 
> helpful, if only ZFS catches an error. Will you be able to isolate the drive 
> quickly, to replace it? Or will you be going "does the enclosure start at 
> logical zero... left to right.. hrmmm"
> Thanks
> -- > Brent Jones> [EMAIL PROTECTED]
 
_
100’s of Nikon cameras to be won with Live Search
http://clk.atdmt.com/UKM/go/101719808/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-31 Thread Ross Smith
libc_hwcap2.so.1   
7.2G   6.0G   1.1G85%/lib/libc.so.1fd   0K 0K   
  0K 0%/dev/fdswap   4.7G48K   4.7G 1%
/tmpswap   4.7G76K   4.7G 1%
/var/run/dev/dsk/c1t0d0s7  425G   4.8G   416G 2%/export/home
 
6. 10:35am  It's now been two hours, neither "zpool status" nor "zfs list" have 
ever finished.  The file copy attempt has also been hung for over an hour 
(although that's not unexpected with 'wait' as the failmode).
 
Richard, you say ZFS is not silently failing, well for me it appears that it 
is.  I can't see any warnings from ZFS, I can't get any status information.  I 
see no way that I could find out what files are going to be lost on this server.
 
Yes, I'm now aware that the pool has hung since file operations are hanging, 
however had that been my first indication of a problem I believe I am now left 
in a position where I cannot find out either the cause, nor the files affected. 
 I don't believe I have any way to find out which operations had completed 
without error, but are not currently committed to disk.  I certainly don't get 
the status message you do saying permanent errors have been found in files.
 
I plugged the USB drive back in now, Solaris detected it ok, but ZFS is still 
hung.  The rest of /var/adm/messages is:
Jul 31 09:39:44 unknown smbd[603]: [ID 766186 daemon.error] 
NbtDatagramDecode[11]: too small packetJul 31 09:45:22 unknown 
/sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK packet on 
nge0 missing mandatory lease option, ignoredJul 31 09:45:38 unknown last 
message repeated 5 timesJul 31 09:51:44 unknown smbd[603]: [ID 766186 
daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:03:44 unknown 
last message repeated 2 timesJul 31 10:14:27 unknown /sbin/dhcpagent[95]: [ID 
732317 daemon.warning] accept_v4_acknak: ACK packet on nge0 missing mandatory 
lease option, ignoredJul 31 10:14:45 unknown last message repeated 5 timesJul 
31 10:15:44 unknown smbd[603]: [ID 766186 daemon.error] NbtDatagramDecode[11]: 
too small packetJul 31 10:27:45 unknown smbd[603]: [ID 766186 daemon.error] 
NbtDatagramDecode[11]: too small packet
Jul 31 10:36:25 unknown usba: [ID 691482 kern.warning] WARNING: /[EMAIL 
PROTECTED],0/pci15d9,[EMAIL PROTECTED],1/[EMAIL PROTECTED] (scsa2usb0): 
Reinserted device is accessible again.Jul 31 10:39:45 unknown smbd[603]: [ID 
766186 daemon.error] NbtDatagramDecode[11]: too small packetJul 31 10:45:53 
unknown /sbin/dhcpagent[95]: [ID 732317 daemon.warning] accept_v4_acknak: ACK 
packet on nge0 missing mandatory lease option, ignoredJul 31 10:46:09 unknown 
last message repeated 5 timesJul 31 10:51:45 unknown smbd[603]: [ID 766186 
daemon.error] NbtDatagramDecode[11]: too small packet
 
7. 10:55am  Gave up on ZFS ever recovering.  A shutdown attempt hung as 
expected.  I hard-reset the computer.
 
Ross
 
 
> Date: Wed, 30 Jul 2008 11:17:08 -0700> From: [EMAIL PROTECTED]> Subject: Re: 
> [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: [EMAIL 
> PROTECTED]> CC: zfs-discuss@opensolaris.org> > I was able to reproduce this 
> in b93, but might have a different> interpretation of the conditions. More 
> below...> > Ross Smith wrote:> > A little more information today. I had a 
> feeling that ZFS would > > continue quite some time before giving an error, 
> and today I've shown > > that you can carry on working with the filesystem 
> for at least half an > > hour with the disk removed.> > > > I suspect on a 
> system with little load you could carry on working for > > several hours 
> without any indication that there is a problem. It > > looks to me like ZFS 
> is caching reads & writes, and that provided > > requests can be fulfilled 
> from the cache, it doesn't care whether the > > disk is present or not.> > In 
> my USB-flash-disk-sudden-removal-while-writing-big-file-test,> 1. I/O to the 
> missing device stopped (as I expected)> 2. FMA kicked in, as expected.> 3. 
> /var/adm/messages recorded "Command failed to complete... device gone."> 4. 
> After exactly 9 minutes, 17,951 e-reports had been processed and the> 
> diagnosis was complete. FMA logged the following to /var/adm/messages> > Jul 
> 30 10:33:44 grond scsi: [ID 107833 kern.warning] WARNING: > /[EMAIL 
> PROTECTED],0/pci1458,[EMAIL PROTECTED],1/[EMAIL PROTECTED]/[EMAIL 
> PROTECTED],0 (sd1):> Jul 30 10:33:44 grond Command failed to 
> complete...Device is gone> Jul 30 10:42:31 grond fmd: [ID 441519 
> daemon.error] SUNW-MSG-ID: > ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: 
> Major> Jul 30 10:42:31 grond EVENT-TIME: Wed Jul 30 10:42:30 PDT 2008> Jul 30 

Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-30 Thread Ross Smith

I agree that device drivers should perform the bulk of the fault monitoring, 
however I disagree that this absolves ZFS of any responsibility for checking 
for errors.  The primary goal of ZFS is to be a filesystem and maintain data 
integrity, and that entails both reading and writing data to the devices.  It 
is no good having checksumming when reading data if you are loosing huge 
amounts of data when a disk fails.
 
I'm not saying that ZFS should be monitoring disks and drivers to ensure they 
are working, just that if ZFS attempts to write data and doesn't get the 
response it's expecting, an error should be logged against the device 
regardless of what the driver says.  If ZFS is really about end-to-end data 
integrity, then you do need to consider the possibility of a faulty driver.  
Now I don't know what the root cause of this error is, but I suspect it will be 
either a bad response from the SATA driver, or something within ZFS that is not 
working correctly.  Either way however I believe ZFS should have caught this.
 
It's similar to the iSCSI problem I posted a few months back where the ZFS pool 
hangs for 3 minutes when a device is disconnected.  There's absolutely no need 
for the entire pool to hang when the other half of the mirror is working fine.  
ZFS is often compared to hardware raid controllers, but so far it's ability to 
handle problems is falling short.
 
Ross
 
> Date: Wed, 30 Jul 2008 09:48:34 -0500> From: [EMAIL PROTECTED]> To: [EMAIL 
> PROTECTED]> CC: zfs-discuss@opensolaris.org> Subject: Re: [zfs-discuss] 
> Supermicro AOC-SAT2-MV8 hang when drive removed> > On Wed, 30 Jul 2008, Ross 
> wrote:> >> > Imagine you had a raid-z array and pulled a drive as I'm doing 
> here. > > Because ZFS isn't aware of the removal it keeps writing to that > > 
> drive as if it's valid. That means ZFS still believes the array is > > online 
> when in fact it should be degrated. If any other drive now > > fails, ZFS 
> will consider the status degrated instead of faulted, and > > will continue 
> writing data. The problem is, ZFS is writing some of > > that data to a drive 
> which doesn't exist, meaning all that data will > > be lost on reboot.> > 
> While I do believe that device drivers. or the fault system, should > notify 
> ZFS when a device fails (and ZFS should appropriately react), I > don't think 
> that ZFS should be responsible for fault monitoring. ZFS > is in a rather 
> poor position for device fault monitoring, and if it > attempts to do so then 
> it will be slow and may misbehave in other > ways. The software which 
> communicates with the device (i.e. the > device driver) is in the best 
> position to monitor the device.> > The primary goal of ZFS is to be able to 
> correctly read data which was > successfully committed to disk. There are 
> programming interfaces > (e.g. fsync(), msync()) which may be used to ensure 
> that data is > committed to disk, and which should return an error if there 
> is a > problem. If you were performing your tests over an NFS mount then the 
> > results should be considerably different since NFS requests that its > data 
> be committed to disk.> > Bob> ==> Bob 
> Friesenhahn> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/> 
> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/> 
_
Find the best and worst places on the planet
http://clk.atdmt.com/UKM/go/101719807/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-29 Thread Ross Smith

A little more information today.  I had a feeling that ZFS would continue quite 
some time before giving an error, and today I've shown that you can carry on 
working with the filesystem for at least half an hour with the disk removed.
 
I suspect on a system with little load you could carry on working for several 
hours without any indication that there is a problem.  It looks to me like ZFS 
is caching reads & writes, and that provided requests can be fulfilled from the 
cache, it doesn't care whether the disk is present or not.
 
I would guess that ZFS is attempting to write to the disk in the background, 
and that this is silently failing.
 
Here's the log of the tests I did today.  After removing the drive, over a 
period of 30 minutes I copied folders to the filesystem, created an archive, 
set permissions, and checked properties.  I did this both in the command line 
and with the graphical file manager tool in Solaris.  Neither reported any 
errors, and all the data could be read & written fine.  Until the reboot, at 
which point all the data was lost, again without error.
 
If you're not interested in the detail, please skip to the end where I've got 
some thoughts on just how many problems there are here.
 
 
# zpool status test  pool: test state: ONLINE scrub: none requestedconfig:
NAMESTATE READ WRITE CKSUMtestONLINE   
0 0 0  c2t7d0ONLINE   0 0 0
errors: No known data errors# zfs list testNAME   USED  AVAIL  REFER  
MOUNTPOINTtest   243M   228G   242M  /test# zpool list testNAME   SIZE   USED  
AVAILCAP  HEALTH  ALTROOTtest   232G   243M   232G 0%  ONLINE  -
 
-- drive removed --
 
# cfgadm |grep sata1/7sata1/7sata-portempty
unconfigured ok
 
 
-- cfgadmin knows the drive is removed.  How come ZFS does not? --
 
# cp -r /rc-pool/copytest /test/copytest# zpool list testNAME  SIZE   USED  
AVAILCAP  HEALTH  ALTROOTtest  232G  73.4M   232G 0%  ONLINE  -# 
zfs list testNAME   USED  AVAIL  REFER  MOUNTPOINTtest   142K   228G18K  
/test
 
 
-- Yup, still up.  Let's start the clock --
 
# dateTue Jul 29 09:31:33 BST 2008# du -hs /test/copytest 667K /test/copytest
 
 
-- 5 minutes later, still going strong --
 
# dateTue Jul 29 09:36:30 BST 2008# zpool list testNAME  SIZE   USED  AVAIL 
   CAP  HEALTH  ALTROOTtest  232G  73.4M   232G 0%  ONLINE  -# cp -r 
/rc-pool/copytest /test/copytest2# ls /testcopytest   copytest2# du -h -s /test 
1.3M /test# zpool list testNAME   SIZE   USED  AVAILCAP  HEALTH  
ALTROOTtest   232G  73.4M   232G 0%  ONLINE  -# find /test | wc -l  
   2669# find //test/copytest | wc -l1334# find 
/rc-pool/copytest | wc -l1334# du -h -s /rc-pool/copytest 5.3M 
/rc-pool/copytest
 
 
-- Not sure why the original pool has 5.3MB of data when I use du. --
-- File Manager reports that they both have the same size --
 
 
-- 15 minutes later it's still working.  I can read data fine --
# dateTue Jul 29 09:43:04 BST 2008# chmod 777 /test/*# mkdir /rc-pool/test2# cp 
-r /test/copytest2 /rc-pool/test2/copytest2# find /rc-pool/test2/copytest2 | wc 
-l1334# zpool list testNAME  SIZE   USED  AVAILCAP  HEALTH  
ALTROOTtest  232G  73.4M   232G 0%  ONLINE  -
 
 
-- and yup, the drive is still offline --
 
# cfgadm | grep sata1/7sata1/7sata-portempty
unconfigured ok
-- And finally, after 30 minutes the pool is still going strong --
 
# dateTue Jul 29 09:59:56 BST 2008
# tar -cf /test/copytest.tar /test/copytest/*# ls -ltotal 3drwxrwxrwx   3 root  
   root   3 Jul 29 09:30 copytest-rwxrwxrwx   1 root root 
4626432 Jul 29 09:59 copytest.tardrwxrwxrwx   3 root root   3 Jul 
29 09:39 copytest2# zpool list testNAME   SIZE   USED  AVAILCAP  HEALTH  
ALTROOTtest   232G  73.4M   232G 0%  ONLINE  -
 
After a full 30 minutes there's no indication whatsoever of any problem.  
Checking properties of the folder in File Browser reports 2665 items, totalling 
9.0MB.
 
At this point I tried "# zfs set sharesmb=on test".  I didn't really expect it 
to work, and sure enough, that command hung.  zpool status also hung, so I had 
to reboot the server.
 
 
-- Rebooted server --
 
 
Now I found that not only are all the files I've written in the last 30 minutes 
missing, but in fact files that I had deleted several minutes prior to removing 
the drive have re-appeared.
 
 
-- /test mount point is still present, I'll probably have to remove that 
manually --
 
 
# cd /# lsbin export  media   procsystemboot
homemnt rc-pool testdev kernel  net 
rc-usb  tmpdevices lib opt rootusretc 
lost+found  platformsbinvar
 
 
-- ZFS still has the pool mounted, but at least now it realises it's not 
working --
 
 
# zpool listNAME  SIZE  

Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-28 Thread Ross Smith

Heh, sounds like there are a few problems with that tool then.  I guess that's 
one of the benefits of me being so new to Solaris.  I'm still learning all the 
command line tools so I'm playing with the graphical stuff as much as possible. 
:)
 
Regarding the delay, I plan to have a go tomorrow and see just how much of a 
delay there can be.  I've definately had the system up for 10 minutes still 
reading data that's going to disappear on reboot and suspect I can stretch it a 
lot longer than that.
 
The biggest concern for me with the delay is that the data appears fine to all 
intents & purposes.  You can read it off the pool and copy it elsewhere.  There 
doesn't seem to be any indication that it's going to disappear after a reboot.
> Date: Mon, 28 Jul 2008 13:35:21 -0500> From: [EMAIL PROTECTED]> To: [EMAIL 
> PROTECTED]> Subject: RE: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when 
> drive removed> > On Mon, 28 Jul 2008, Ross Smith wrote:> > >> > "File 
> Browser" is the name of the program that Solaris opens when > > you open 
> "Computer" on the desktop. It's the default graphical file > > manager.> > 
> Got it. I have brought it up once or twice. I tend to distrust such > tools 
> since I am not sure if their implementation is sound. In fact, > usually it 
> is not.> > Now that you mention this tool, I am going to see what happens 
> when it > enters my test directory containing a million files. Hmmm, this 
> turd > says "Loading" and I see that system error messages are scrolling by > 
> as fast as dtrace can report them:> > nautilus ioctl 25 Inappropriate ioctl 
> for device> nautilus acl 89 Unsupported file system operation> nautilus ioctl 
> 25 Inappropriate ioctl for device> nautilus acl 89 Unsupported file system 
> operation> nautilus ioctl 25 Inappropriate ioctl for device> > we shall see 
> if it crashes or if it eventually returns. Ahhh, it has > returned and 
> declared that my directory with a million files is > "(Empty)". So much for a 
> short stint of trusting this tool.> > > It does eventually stop copying with 
> an error, but it takes a good > > long while for ZFS to throw up that error, 
> and even when it does, > > the pool doesn't report any problems at all.> > 
> The delayed error report may be ok but the pool not reporting a > problem 
> does not seem very ok.> > Bob> ==> Bob 
> Friesenhahn> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/> 
> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/> 
_
Play and win great prizes with Live Search and Kung Fu Panda
http://clk.atdmt.com/UKM/go/101719966/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-28 Thread Ross Smith

snv_91.  I downloaded snv_94 today so I'll be testing with that tomorrow.
> Date: Mon, 28 Jul 2008 09:58:43 -0700> From: [EMAIL PROTECTED]> Subject: Re: 
> [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed> To: [EMAIL 
> PROTECTED]> > Which OS and revision?> -- richard> > > Ross wrote:> > Ok, 
> after doing a lot more testing of this I've found it's not the Supermicro 
> controller causing problems. It's purely ZFS, and it causes some major 
> problems! I've even found one scenario that appears to cause huge data loss 
> without any warning from ZFS - up to 30,000 files and 100MB of data missing 
> after a reboot, with zfs reporting that the pool is OK.> >> > 
> ***> > 1. 
> Solaris handles USB and SATA hot plug fine> >> > If disks are not in use by 
> ZFS, you can unplug USB or SATA devices, cfgadm will recognise the 
> disconnection. USB devices are recognised automatically as you reconnect 
> them, SATA devices need reconfiguring. Cfgadm even recognises the SATA device 
> as an empty bay:> >> > # cfgadm> > Ap_Id Type Receptacle Occupant Condition> 
> > sata1/7 sata-port empty unconfigured ok> > usb1/3 unknown empty 
> unconfigured ok> >> > -- insert devices --> >> > # cfgadm> > Ap_Id Type 
> Receptacle Occupant Condition> > sata1/7 disk connected unconfigured unknown> 
> > usb1/3 usb-storage connected configured ok> >> > To bring the sata drive 
> online it's just a case of running> > # cfgadm -c configure sata1/7 > >> > 
> ***> > 2. 
> If ZFS is using a hot plug device, disconnecting it will hang all ZFS status 
> tools.> >> > While pools remain accessible, any attempt to run "zpool status" 
> will hang. I don't know if there is any way to recover these tools once this 
> happens. While this is a pretty big problem in itself, it also makes me worry 
> if other types of error could have the same effect. I see potential for this 
> leaving a server in a state whereby you know there are errors in a pool, but 
> have no way of finding out what those errors might be without rebooting the 
> server.> >> > 
> ***> > 3. 
> Once ZFS status tools are hung the computer will not shut down.> >> > The 
> only way I've found to recover from this is to physically power down the 
> server. The solaris shutdown process simply hangs.> >> > 
> ***> > 4. 
> While reading an offline disk causes errors, writing does not! > > *** CAUSES 
> DATA LOSS ***> >> > This is a big one: ZFS can continue writing to an 
> unavailable pool. It doesn't always generate errors (I've seen it copy over 
> 100MB before erroring), and if not spotted, this *will* cause data loss after 
> you reboot.> >> > I discovered this while testing how ZFS coped with the 
> removal of a hot plug SATA drive. I knew that the ZFS admin tools were 
> hanging, but that redundant pools remained available. I wanted to see whether 
> it was just the ZFS admin tools that were failing, or whether ZFS was also 
> failing to send appropriate error messages back to the OS.> >> > These are 
> the tests I carried out:> >> > Zpool: Single drive zpool, consisting of one 
> 250GB SATA drive in a hot plug bay.> > Test data: A folder tree containing 
> 19,160 items. 71.1MB in total.> >> > TEST1: Opened File Browser, copied the 
> test data to the pool. Half way through the copy I pulled the drive. THE COPY 
> COMPLETED WITHOUT ERROR. Zpool list reports the pool as online, however zpool 
> status hung as expected.> >> > Not quite believing the results, I rebooted 
> and tried again.> >> > TEST2: Opened File Browser, copied the data to the 
> pool. Pulled the drive half way through. The copy again finished without 
> error. Checking the properties shows 19,160 files in the copy. ZFS list again 
> shows the filesystem as ONLINE.> >> > Now I decided to see how many files I 
> could copy before it errored. I started the copy again. File Browser managed 
> a further 9,171 files before it stopped. That's nearly 30,000 files before 
> any error was detected. Again, despite the copy having finally errored, zpool 
> list shows the pool as online, even though zpool status hangs.> >> > I 
> rebooted the server, and found that after the reboot my first copy contains 
> just 10,952 items, and my second copy is completely missing. That's a loss of 
> almost 20,000 files. Zpool status however reports NO ERRORS.> >> > For the 
> third test I decided to see if these files are actually accessible before the 
> reboot:> >> > TEST3: This time I pulled the drive *before* starting the copy. 
> The copy started much slower this time and only got to 2,939 files before 
> reporting an error. At this point I copied all the files that had been copied 
> to another pool, and then rebooted.> >> > After the reboot, the folder i

Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-28 Thread Ross Smith

"File Browser" is the name of the program that Solaris opens when you open 
"Computer" on the desktop.  It's the default graphical file manager.
 
It does eventually stop copying with an error, but it takes a good long while 
for ZFS to throw up that error, and even when it does, the pool doesn't report 
any problems at all.
> Date: Mon, 28 Jul 2008 13:03:24 -0500> From: [EMAIL PROTECTED]> To: [EMAIL 
> PROTECTED]> CC: zfs-discuss@opensolaris.org> Subject: Re: [zfs-discuss] 
> Supermicro AOC-SAT2-MV8 hang when drive removed> > On Mon, 28 Jul 2008, Ross 
> wrote:> >> > TEST1: Opened File Browser, copied the test data to the pool. > 
> > Half way through the copy I pulled the drive. THE COPY COMPLETED > > 
> WITHOUT ERROR. Zpool list reports the pool as online, however zpool > > 
> status hung as expected.> > Are you sure that this reference software you 
> call "File Browser" > actually responds to errors? Maybe it is typical 
> Linux-derived > software which does not check for or handle errors and ZFS is 
> > reporting errors all along while the program pretends to copy the lost > 
> files. If you were using Microsoft Windows, its file browser would > probably 
> report "Unknown error: 666" but at least you would see an > error dialog and 
> you could visit the Microsoft knowledge base to learn > that message ID 666 
> means "Unknown error". The other possibility is > that all of these files fit 
> in the ZFS write cache so the error > reporting is delayed.> > The Dtrace 
> Toolkit provides a very useful DTrace script called > 'errinfo' which will 
> list every system call which reports and error. > This is very useful and 
> informative. If you run it, you will see > every error reported to the 
> application level.> > Bob> ==> Bob 
> Friesenhahn> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/> 
> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/> 
_
Invite your Facebook friends to chat on Messenger
http://clk.atdmt.com/UKM/go/101719649/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] J4500 device renumbering

2008-07-15 Thread Ross Smith

It sounds like you might be interested to read up on Eric Schrock's work.  I 
read today about some of the stuff he's been doing to bring integrated fault 
management to Solaris:
http://blogs.sun.com/eschrock/entry/external_storage_enclosures_in_solaris
His last paragraph is great to see, Sun really do seem to be headed in the 
right direction:
 
"I often like to joke about the amount of time that I have spent just getting a 
single LED to light. At first glance, it seems like a pretty simple task. But 
to do it in a generic fashion that can be generalized across a wide variety of 
platforms, correlated with physically meaningful labels, and incorporate a 
diverse set of diagnoses (ZFS, SCSI, HBA, etc) requires an awful lot of work. 
Once it's all said and done, however, future platforms will require little to 
no integration work, and you'll be able to see a bad drive generate checksum 
errors in ZFS, resulting in a FMA diagnosis indicating the faulty drive, 
activate a hot spare, and light the fault LED on the drive bay (wherever it may 
be). Only then will we have accomplished our goal of an end-to-end storage 
strategy for Solaris - and hopefully someone besides me will know what it has 
taken to get that little LED to light."
 
Ross
 
> Date: Tue, 15 Jul 2008 12:51:22 -0500> From: [EMAIL PROTECTED]> To: [EMAIL 
> PROTECTED]> CC: zfs-discuss@opensolaris.org> Subject: Re: [zfs-discuss] J4500 
> device renumbering> > On Tue, 15 Jul 2008, Ross wrote:> > > Well I haven't 
> used a J4500, but when we had an x4500 (Thumper) on > > loan they had Solaris 
> pretty well integrated with the hardware. > > When a disk failed, I used 
> cfgadm to offline it and as soon as I did > > that a bright blue "Ready to 
> Remove" LED lit up on the drive tray of > > the faulty disk, right next to 
> the handle you need to lift to remove > > the drive.> > That sure sounds a 
> whole lot easier to manage than my setup with a > StorageTek 2540 and each 
> drive as a LUN. The 2540 could detect a > failed drive by itself and turn an 
> LED on, but if ZFS decides that a > drive has failed and the 2540 does not, 
> then I will have to use the > 2540's CAM administrative interface and 
> manually set the drive out of > service. I very much doubt that cfgadm will 
> communicate with the 2540 > and tell it to do anything.> > A little while 
> back I created this table so I could understand how > things were mapped:> > 
> Disk Volume LUN WWN Device ZFS> == === === 
> === 
> = ===> t85d01 Disk-01 0 
> 60:0A:0B:80:00:3A:8A:0B:00:00:09:61:47:B4:51:BE 
> c4t600A0B80003A8A0B096147B451BEd0 P3-A> t85d02 Disk-02 1 
> 60:0A:0B:80:00:39:C9:B5:00:00:0A:9C:47:B4:52:2D 
> c4t600A0B800039C9B50A9C47B4522Dd0 P6-A> t85d03 Disk-03 2 
> 60:0A:0B:80:00:39:C9:B5:00:00:0A:A0:47:B4:52:9B 
> c4t600A0B800039C9B50AA047B4529Bd0 P1-B> t85d04 Disk-04 3 
> 60:0A:0B:80:00:3A:8A:0B:00:00:09:66:47:B4:53:CE 
> c4t600A0B80003A8A0B096647B453CEd0 P4-A> t85d05 Disk-05 4 
> 60:0A:0B:80:00:39:C9:B5:00:00:0A:A4:47:B4:54:4F 
> c4t600A0B800039C9B50AA447B4544Fd0 P2-B> t85d06 Disk-06 5 
> 60:0A:0B:80:00:3A:8A:0B:00:00:09:6A:47:B4:55:9E 
> c4t600A0B80003A8A0B096A47B4559Ed0 P1-A> t85d07 Disk-07 6 
> 60:0A:0B:80:00:39:C9:B5:00:00:0A:A8:47:B4:56:05 
> c4t600A0B800039C9B50AA847B45605d0 P3-B> t85d08 Disk-08 7 
> 60:0A:0B:80:00:3A:8A:0B:00:00:09:6E:47:B4:56:DA 
> c4t600A0B80003A8A0B096E47B456DAd0 P2-A> t85d09 Disk-09 8 
> 60:0A:0B:80:00:39:C9:B5:00:00:0A:AC:47:B4:57:39 
> c4t600A0B800039C9B50AAC47B45739d0 P4-B> t85d10 Disk-10 9 
> 60:0A:0B:80:00:39:C9:B5:00:00:0A:B0:47:B4:57:AD 
> c4t600A0B800039C9B50AB047B457ADd0 P5-B> t85d11 Disk-11 10 
> 60:0A:0B:80:00:3A:8A:0B:00:00:09:73:47:B4:57:D4 
> c4t600A0B80003A8A0B097347B457D4d0 P5-A> t85d12 Disk-12 11 
> 60:0A:0B:80:00:39:C9:B5:00:00:0A:B4:47:B4:59:5F 
> c4t600A0B800039C9B50AB447B4595Fd0 P6-B> > When I selected the drive 
> pairings, it was based on a dump from a > multipath utility and it seems that 
> on a chassis level there is no > rhyme or reason for the zfs mirror 
> pairings.> > This is an area where traditional RAID hardware makes ZFS more > 
> difficult to use.> > Bob
_
Find the best and worst places on the planet
http://clk.atdmt.com/UKM/go/101719807/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] FW: please help with raid / failure / rebuild calculations

2008-07-15 Thread Ross Smith



bits vs bytes D'oh! again.  It's a good job I don't do these calculations 
professionally. :-)> Date: Tue, 15 Jul 2008 02:30:33 -0400> From: [EMAIL 
PROTECTED]> To: [EMAIL PROTECTED]> Subject: Re: [zfs-discuss] please help with 
raid / failure / rebuild calculations> CC: zfs-discuss@opensolaris.org> > On 
Tue, Jul 15, 2008 at 01:58, Ross <[EMAIL PROTECTED]> wrote:> > However, I'm not 
sure where the 8 is coming from in your calculations.> Bits per byte ;)> > > In 
this case approximately 13/100 or around 1 in 8 odds.> Taking into account the 
factor 8, and it's around 8 in 8.> > Another possible factor to consider in 
calculations of this nature is> that you probably won't get a single bit 
flipped here or there. If> drives take 512-byte sectors and apply Hamming codes 
to those 512> bytes to get, say, 548 bytes of coded data that are actually 
written> to disk, you need to flip (548-512)/2=16 bytes = 128 bits before you> 
cannot correct them from the data you have. Thus, rather than getting> one 
incorrect bit in a particular 4096-bit sector, you're likely to> get all good 
sectors and one that's complete garbage. Unless the> manufacturers' 
specifications account for this, I would say the sector> error rate of the 
drive is about 1 in 4*(10**17). I have no idea> whether they account for this 
or not, but it'd be interesting (and> fairly doable) to test. Write a 1TB disk 
full of known data, then> read it and verify. Then repeat until you have seen 
incorrect sectors> a few times for a decent sample size, and store elsewhere 
what the> sector was supposed to be and what it actually was.> > Will

Get Hotmail on your Mobile! Try it Now! 
_
The John Lewis Clearance - save up to 50% with FREE delivery
http://clk.atdmt.com/UKM/go/101719806/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss