subject:"Re\: \[zfs\-discuss\] Another user looses his pool \(10TB\) in this case and 40 days work"

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2010-02-15 Thread Orvar Korvar

Yes, if you value your data you should change from USB drives to normal drives. 
I heard that USB did some strange things? Normal connection such as SATA is 
more reliable.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2010-02-14 Thread Bruno Damour

Hello,
I'm now thinking there is some _real_ bug in the way zfs handles files systems 
created with the pool itself (ie tank filesystem when zpool is tank, usually 
mounted as /tank)
My own experiens shows that zfs is unable to send/receive recursively 
(snapshots, child fs) properly when the destination is such a "level 0" files 
system ie othertank, thought everything works as expected when i send to 
othertank/tank (see my posts)
I think you might  also see some aspects of this problem
Bruno
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2010-02-13 Thread Remco Lengers

I just have the say this, and I don't mean it in a bad way... If you 
really care about your data why then use usb drives with lose cables and 
(apparently no backup)


USB connected drives for data backup are okay, for playing around and 
getting to know ZFS seems also okay. Using it for online data that you 
care about and expecting it to be reliable...its just not the right 
technology for that imho.


..Remco

On 2/13/10 11:23 AM, Andy Stenger wrote:
I had a very similar problem. 8 external USB drives running 
OpenSolaris native. When I moved the machine into a different room and 
powered it back up (there were a couple of reboots and a couple of 
broken usb cables and drive shut downs in between), I got the same 
error. Loosing that much data is definitely a shock.


I m running zraid2 and I would have assumed that a 2 level redundancy 
should fine to toss a lot of roughness at the pool.


After panicking a little, stressing my family out, and some playing 
with zdb that lead nowhere, I did a

zpool export mypool
zpool import mypool

It complained about being unable to mount because the mount point was 
not empty, so I did

umount /mypool/mypool
zfs mount mypool/mypool
zfs status mypool

and to my relieving surprise it seems all fine.
ls /mypool/mypool

does show data.

Scrub is running right now to be on the safe side.

Thought that may help some folks out there.

Cheers!

Andy


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2010-02-13 Thread Andy Stenger

I had a very similar problem. 8 external USB drives running OpenSolaris
native. When I moved the machine into a different room and powered it back
up (there were a couple of reboots and a couple of broken usb cables and
drive shut downs in between), I got the same error. Loosing that much data
is definitely a shock.

I m running zraid2 and I would have assumed that a 2 level redundancy should
fine to toss a lot of roughness at the pool.

After panicking a little, stressing my family out, and some playing with zdb
that lead nowhere, I did a
zpool export mypool
zpool import mypool

It complained about being unable to mount because the mount point was not
empty, so I did
umount /mypool/mypool
zfs mount mypool/mypool
zfs status mypool

and to my relieving surprise it seems all fine.
ls /mypool/mypool

does show data.

Scrub is running right now to be on the safe side.

Thought that may help some folks out there.

Cheers!

Andy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-08-04 Thread Toby Thain



On 4-Aug-09, at 9:28 AM, Roch Bourbonnais wrote:



Le 26 juil. 09 à 01:34, Toby Thain a écrit :



On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:


On 07/25/09 02:50 PM, David Magda wrote:

Yes, it can be affected. If the snapshot's data structure /  
record is
underneath the corrupted data in the tree then it won't be able  
to be

reached.


Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and  
copies=2).


I was also wondering if you could explain why the ZIL can't
repair such damage.

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch,


^^ of course this can never cause inconsistency. The issue under  
discussion is inconsistency - unexpected corruption of on-disk  
structures.



and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?


The problem is assumed *ordering*. In this respect VB ignoring  
flushes and real hardware are not going to behave the same.


--Toby


I agree that noone should be ignoring cache flushes. However the  
path to corruption must involve
some dropped acknowledged I/Os. The ueberblock I/O was issued to  
stable storage but the blocks it pointed to,  which had reached the  
disk firmware earlier,
never make it to stable storage. I can see this scenerio when the  
disk looses power


Or if the host O/S crashes. All this applies to virtual IDE devices  
alone, of course. iSCSI is a different case entirely as presumably  
flushes/barriers are processed normally.



but I don't see it with cutting power to the guest.


Right, in this case it's unlikely or nearly impossible.

--Toby



When managing a zpool on external storage, do people export the  
pool before taking snapshots of the guest ?


-r






Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-08-04 Thread Roch Bourbonnais



Le 26 juil. 09 à 01:34, Toby Thain a écrit :



On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:


On 07/25/09 02:50 PM, David Magda wrote:

Yes, it can be affected. If the snapshot's data structure / record  
is
underneath the corrupted data in the tree then it won't be able to  
be

reached.


Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and copies=2).

I was also wondering if you could explain why the ZIL can't
repair such damage.

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch, and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?


The problem is assumed *ordering*. In this respect VB ignoring  
flushes and real hardware are not going to behave the same.


--Toby


I agree that noone should be ignoring cache flushes. However the path  
to corruption must involve
some dropped acknowledged I/Os. The ueberblock I/O was issued to  
stable storage but the blocks it pointed to,  which had reached the  
disk firmware earlier,
never make it to stable storage. I can see this scenerio when the disk  
looses power but I don't see it with cutting power to the guest.


When managing a zpool on external storage, do people export the pool  
before taking snapshots of the guest ?


-r






Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-08-04 Thread Roch Bourbonnais



Le 19 juil. 09 à 16:47, Bob Friesenhahn a écrit :


On Sun, 19 Jul 2009, Ross wrote:

The success of any ZFS implementation is *very* dependent on the  
hardware you choose to run it on.


To clarify:

"The success of any filesystem implementation is *very* dependent on  
the hardware you choose to run it on."


ZFS requires that the hardware cache sync works and is respected.


yes.

Without taking advantage of the drive caches, zfs would be  
considerably less performant.




That, I'm not so sure.

When ZFS first came out, most pools were built on thumpers with a SATA  
device driver
that did not handle NCQ concurrency. Enabling the write cache on a  
drive was a necessary way to have the drive firmware
handle multiple request with small service times. Today we've got  
better device drivers but we've stopped comparing performance
data with on/off settings on the disk write caches. The delta today  
might be a lot smaller than it used to be (and even less noticeable if  
one uses a slog on

SSD).

-r



Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




smime.p7s
Description: S/MIME cryptographic signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-08-02 Thread Germano Caronni

Have you considered this? 
*Maybe* a little time travel to an old uberblock could help you?
http://www.opensolaris.org/jive/thread.jspa?threadID=85794
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-29 Thread Richard Elling

On Jul 28, 2009, at 6:34 PM, Eric D. Mudama wrote:

On Mon, Jul 27 at 13:50, Richard Elling wrote:

On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote:

Can *someone* please name a single drive+firmware or RAID
controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
commands? Or worse, responds "ok" when the flush hasn't occurred?

two seconds with google shows
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush

Give it up. These things happen. Not much you can do about it, other
than design around it.
-- richard

That example is a windows-specific, and is a software driver, where
the data integrity feature must be manually disabled by the end user.
The default behavior was always maximum data protection.

I don't think you read the post. It specifically says, "Previous
versions

of the Promise drivers ignored the flush cache command until system
power down. " Promise makes RAID controllers and has a firmware
fix for this. This is the kind of thing we face: some performance
engineer tries to get an edge by assuming there is only one case
where cache flush matters.

Another 2 seconds with google shows:
http://sunsolve.sun.com/search/document.do?assetkey=1-66-27-1
(interestingly, for this one, fsck also fails)

http://sunsolve.sun.com/search/document.do?assetkey=1-21-103622-06-1

http://forums.seagate.com/stx/board/message?board.id=freeagent&message.id=5060&query.id=3999#M5060

But they also get cache flush code wrong in the opposite direction. A
good example of that is the notorious Seagate 1.5 TB disk "stutter"
problem.

NB, for the most part, vendors do not air their dirty laundry (eg bug
reports)
on the internet for those without support contracts. If you have a
support

contract, your search may show many more cases.

While perhaps analagous at some level, the perpetual "your hardware
must be crappy/cheap/not-as-expensive-as-mine" doesn't seem to be a
sufficient explanation when things go wrong, like complete loss of a
pool.

As I said before, it is a systems engineering problem. If you do your
own systems engineering, then you should make sure the components
you select work as you expect.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-29 Thread Nigel Smith

Hi James
Many thanks for finding & posting that link.

I'm sure many people on this forum will
be interested in trying out Brad Fitzpatrick's
perl script 'diskchecker.pl'.

It will be interesting to hear their results.

I've not yet had time to work out how Brad's
script works.  If would be good if others
here can take a critical look at it, and 
feedback their comments to the forum.

I'm disappointed that I've not had a reply
from someone at Sun to explain how they
test their hard drives.

We've had a few people here quick to claim that most
hard drives fail to sync/flush correctly,
but AFAIK no one is saying how they know this.
Have they actually tested, in which case
how have they tested. Or do they just know
because of bad experiences having lost lots of data.

Best Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-29 Thread James Andrewartha

Nigel Smith wrote:
> David Magda wrote:
>> This is also (theoretically) why a drive purchased from Sun is more  
>> that expensive then a drive purchased from your neighbourhood computer  
>> shop: Sun (and presumably other manufacturers) takes the time and  
>> effort to test things to make sure that when a drive says "I've synced  
>> the data", it actually has synced the data. This testing is what  
>> you're presumably paying for.
> 
> So how do you test a hard drive to check it does actually sync the data?
> How would you do it in theory?
> And in practice?
> 
> Now say we are talking about a virtual hard drive,
> rather than a physical hard drive.
> How would that affect the answer to the above questions?

http://brad.livejournal.com/2116715.html has a utility that can be used to
test if your systems (including virtual ones) properly sync data to disk
when asked to.

-- 
James Andrewartha
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-28 Thread Eric D. Mudama


On Mon, Jul 27 at 13:50, Richard Elling wrote:

On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote:

Can *someone* please name a single drive+firmware or RAID
controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
commands? Or worse, responds "ok" when the flush hasn't occurred?


two seconds with google shows
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush

Give it up. These things happen.  Not much you can do about it, other
than design around it.
-- richard



That example is a windows-specific, and is a software driver, where
the data integrity feature must be manually disabled by the end user.
The default behavior was always maximum data protection.

While perhaps analagous at some level, the perpetual "your hardware
must be crappy/cheap/not-as-expensive-as-mine" doesn't seem to be a
sufficient explanation when things go wrong, like complete loss of a
pool.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-28 Thread Rennie Allen

> This is also (theoretically) why a drive purchased
> from Sun is more  
> that expensive then a drive purchased from your
> neighbourhood computer  
> shop:

It's more significant than that.  Drives aimed at the consumer market are at a 
competitive disadvantage if they do handle cache flush correctly (since the 
popular hardware blog of the day will show that the device is far slower than 
the competitors that throw away the sync requests).

 Sun (and presumably other manufacturers) takes
> the time and  
> effort to test things to make sure that when a drive
> says "I've synced  
> the data", it actually has synced the data. This
> testing is what  
> you're presumably paying for.

It wouldn't cost any more for commercial vendors to implement cache flush 
properly, it is just that they are penalized by the market for doing so.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-28 Thread Rennie Allen

> 
> Can *someone* please name a single drive+firmware or
> RAID
> controller+firmware that ignores FLUSH CACHE / FLUSH
> CACHE EXT
> commands? Or worse, responds "ok" when the flush
> hasn't occurred?

I think it would be a shorter list if one were to name the drives/controllers 
that actually implement a flush properly. 
 
> Everyone on this list seems to blame lying hardware
> for ignoring
> commands, but disks are relatively mature and I can't
> believe that
> major OEMs would qualify disks or other hardware that
> willingly ignore
> commands.

It seems you have too much faith in major OEM's of storage, considering that 
99.9% of the market is personal use, and for which a 2% throughput advantage 
over a competitor can make or break the profit margin on a device.  Ignoring 
cache requests is guaranteed to get the best drive performance benchmarks 
regardless of what the software is driving the device.  For example, it is 
virtually impossible to find a USB drive that honors cache sync (to do so would 
require that the device would stop completely until a fully synchronous USB 
transaction had made it to the device, the data had been written).  Can you 
imagine how long a USB drive would sit on store shelves if it actually did do a 
proper cache sync?  While USB is the extreme case; and it does get better the 
more expensive the drive, it is still far from a given that any particular 
device properly handles cache flushes.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-28 Thread Ross

I think people can understand the concept of missing flushes.  The big 
conceptual problem is how this manages to hose an entire filesystem, which is 
assumed to have rather a lot of data which ZFS has already verified to be ok.

Hardware ignoring flushes and loosing recent data is understandable, I don't 
think anybody would argue with that.  Loosing access to your entire pool and 
multiple gigabytes of data because a few writes failed is a whole different 
story, and while I understand how it happens, ZFS appears to be unique among 
modern filesystems in suffering such a catastrophic failure so often.

To give a quick personal example:  I can plug a fat32 usb disk into a windows 
system, drag some files to it, and pull that drive at any point.  I might loose 
a few files, but I've never lost the entire filesystem.  Even if the absolute 
worst happened, I know I can run scandisk, chkdisk, or any number of file 
recovery tools and get my data back.

I would never, ever attempt this with ZFS.

For a filesystem like ZFS where it's integrity and stability are sold as being 
way better than existing filesystems, loosing your entire pool is a bit of a 
shock.  I know that work is going on to be able to recover pools, and I'll 
sleep a lot sounder at night once it is available.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Toby Thain



On 27-Jul-09, at 3:44 PM, Frank Middleton wrote:


On 07/27/09 01:27 PM, Eric D. Mudama wrote:


Everyone on this list seems to blame lying hardware for ignoring
commands, but disks are relatively mature and I can't believe that
major OEMs would qualify disks or other hardware that willingly  
ignore

commands.


You are absolutely correct, but if the cache flush command never makes
it to the disk, then it won't see it. The contention is that by not
relaying the cache flush to the disk,


No - by COMPLETELY ignoring the flush.


VirtualBox caused the OP to lose
his pool.

IMO this argument is bogus because AFAIK the OP didn't actually power
his system down, so the data would still have been in the cache, and
presumably have eventually have been written. The out-of-order writes
theory is also somewhat dubious, since he was able to write 10TB  
without

VB relaying the cache flushes.


Huh? Of course he could. The guest didn't crash while he was doing it!

The corruption occurred when the guest crashed (iirc). And the "out  
of order theory" need not be the *only possible* explanation, but it  
*is* sufficient.



This is all highly hardware dependant,


Not in the least. It's a logical problem.


and AFAIK no one ever asked the OP what hardware he had, instead,
blasting him for running VB on MSWindows.


Which is certainly not relevant to my hypothesis of what broke. I  
don't care what host he is running. The argument is the same for all.



Since IIRC he was using raw
disk access, it is questionable whether or not MS was to blame, but
in general it simply shouldn't be possible to lose a pool under
any conditions.


How about "when flushes are ignored"?



It does raise the question of what happens in general if a cache
flush doesn't happen if, for example, a system crashes in such a way
that it requires a power cycle to restart, and the cache never gets
flushed.


Previous explanations have not dented your misunderstanding one iota.

The problem is not that an attempted flush did not complete. It was  
that any and all flushes *prior to crash* were ignored. This is where  
the failure mode diverges from real hardware.


Again, look:

A B C FLUSH D E F FLUSH

Note that it does not matter *at all* whether the 2nd flush  
completed. What matters from an integrity point of view is that the  
*previous* flush was completed (and synchronously). Visualise this on  
the two scenarios:


1) real hardware: (barring actual defects) that A,B,C were written  
was guaranteed by the first flush (otherwise D would never have been  
issued). Integrity of system is intact regardless of whether the 2nd  
flush completed.


2) VirtualBox: flush never happened. Integrity of system is lost, or  
at best unknown, if it depends on A,B,C all completing before D.




...

Of course the ZIL isn't a journal in the traditional sense, and
AFAIK it has no undo capability the way that a DBMS usually has,
but it needs to be structured so that bizarre things that happen
when something as robust as Solaris crashes don't cause data loss.


A lot of engineering effort has been expended in UFS and ZFS to  
achieve just that. Which is why it's so nutty to undermine that by  
violating semantics in lower layers.



The nightmare scenario is when one disk of a mirror begins to
fail and the system comes to a grinding halt where even stop-a
doesn't respond, and a power cycle is the only way out. Who
knows what writes may or may not have been issued or what the
state of the disk cache might be at such a time.


Again, if the flush semantics are respected*, this is not a problem.

--Toby

* - "When this operation completes, previous writes are verifiably on  
durable media**."


** - Durable media meaning physical media in a bare metal  
environment, and potentially "virtual media" in a virtualised  
environment.





-- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Nigel Smith

David Magda wrote:
> This is also (theoretically) why a drive purchased from Sun is more  
> that expensive then a drive purchased from your neighbourhood computer  
> shop: Sun (and presumably other manufacturers) takes the time and  
> effort to test things to make sure that when a drive says "I've synced  
> the data", it actually has synced the data. This testing is what  
> you're presumably paying for.

So how do you test a hard drive to check it does actually sync the data?
How would you do it in theory?
And in practice?

Now say we are talking about a virtual hard drive,
rather than a physical hard drive.
How would that affect the answer to the above questions?

Thanks
Nigel
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Adam Sherman


On 27-Jul-09, at 15:14 , David Magda wrote:
Also, I think it may have already been posted, but I haven't found  
the

option to disable VirtualBox' disk cache. Anyone have the incantation
handy?


http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0

It tells VB not to ignore the sync/flush command. Caching is still  
enabled

(it wasn't the problem).


Thanks!

As Russell points on in the last post to that thread, it doesn't seem  
possible to do this with virtual SATA disks? Odd.


A.

--
Adam Sherman
CTO, Versature Corp.
Tel: +1.877.498.3772 x113



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Richard Elling

On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote:

On Sun, Jul 26 at 1:47, David Magda wrote:

On Jul 25, 2009, at 16:30, Carson Gaspar wrote:

Frank Middleton wrote:

Doesn't this mean /any/ hardware might have this problem, albeit
with much lower probability?

No. You'll lose unwritten data, but won't corrupt the pool,
because the on-disk state will be sane, as long as your iSCSI
stack doesn't lie about data commits or ignore cache flush commands.

But this entire thread started because Virtual Box's virtual disk /
did/ lie about data commits.

Why is this so difficult for people to understand?

Because most people make the (not unreasonable assumption) that
disks save data the way that they're supposed to: that the data
goes in is the data that comes out, and that when the OS tells them
to empty the buffer that they actually flush it.

It's only us storage geeks that generally know the ugly truth that
this assumption is not always true. :)

Can *someone* please name a single drive+firmware or RAID
controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
commands? Or worse, responds "ok" when the flush hasn't occurred?

two seconds with google shows
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush

Give it up. These things happen. Not much you can do about it, other
than design around it.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Frank Middleton


On 07/27/09 01:27 PM, Eric D. Mudama wrote:


Everyone on this list seems to blame lying hardware for ignoring
commands, but disks are relatively mature and I can't believe that
major OEMs would qualify disks or other hardware that willingly ignore
commands.


You are absolutely correct, but if the cache flush command never makes
it to the disk, then it won't see it. The contention is that by not
relaying the cache flush to the disk, VirtualBox caused the OP to lose
his pool.

IMO this argument is bogus because AFAIK the OP didn't actually power
his system down, so the data would still have been in the cache, and
presumably have eventually have been written. The out-of-order writes
theory is also somewhat dubious, since he was able to write 10TB without
VB relaying the cache flushes. This is all highly hardware dependant,
and AFAIK no one ever asked the OP what hardware he had, instead,
blasting him for running VB on MSWindows. Since IIRC he was using raw
disk access, it is questionable whether or not MS was to blame, but
in general it simply shouldn't be possible to lose a pool under
any conditions.

It does raise the question of what happens in general if a cache
flush doesn't happen if, for example, a system crashes in such a way
that it requires a power cycle to restart, and the cache never gets
flushed. Do disks with volatile caches attempt to flush the cache
by themselves if they detect power down? It seems that the ZFS team
recognizes this as a problem, hence the CR to address it.

It turns out that (at least on this almost 4 year old blog)
http://blogs.sun.com/perrin/entry/the_lumberjack that the ZILs
/are/ allocated recursively from the main pool.  Unless there is
a ZIL for the ZILs, ZFS really isn't fully journalled, and this
could be the real explanation for all lost pools and/or file
systems. It would be great to hear from the ZFS team that writing
a ZIL, presumably a transaction in it's own right, is protected
somehow (by a ZIL for the ZILs?).

Of course the ZIL isn't a journal in the traditional sense, and
AFAIK it has no undo capability the way that a DBMS usually has,
but it needs to be structured so that bizarre things that happen
when something as robust as Solaris crashes don't cause data loss.
The nightmare scenario is when one disk of a mirror begins to
fail and the system comes to a grinding halt where even stop-a
doesn't respond, and a power cycle is the only way out. Who
knows what writes may or may not have been issued or what the
state of the disk cache might be at such a time.

-- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread David Magda

On Mon, July 27, 2009 13:59, Adam Sherman wrote:

> Also, I think it may have already been posted, but I haven't found the
> option to disable VirtualBox' disk cache. Anyone have the incantation
> handy?

http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0

It tells VB not to ignore the sync/flush command. Caching is still enabled
(it wasn't the problem).

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Mike Gerdts

On Mon, Jul 27, 2009 at 12:54 PM, Chris Ridd wrote:
>
> On 27 Jul 2009, at 18:49, Thomas Burgess wrote:
>
>>
>> i was under the impression it was virtualbox and it's default setting that
>> ignored the command, not the hard drive
>
> Do other virtualization products (eg VMware, Parallels, Virtual PC) have the
> same default behaviour as VirtualBox?

I've lost a pool due to LDoms doing the same.  This bug seems to be related.

http://bugs.opensolaris.org/view_bug.do?bug_id=6684721

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Adam Sherman


On 27-Jul-09, at 13:54 , Chris Ridd wrote:
i was under the impression it was virtualbox and it's default  
setting that ignored the command, not the hard drive


Do other virtualization products (eg VMware, Parallels, Virtual PC)  
have the same default behaviour as VirtualBox?


I've a suspicion they all behave similarly dangerously, but actual  
data would be useful.


Also, I think it may have already been posted, but I haven't found the  
option to disable VirtualBox' disk cache. Anyone have the incantation  
handy?


Thanks,

A

--
Adam Sherman
CTO, Versature Corp.
Tel: +1.877.498.3772 x113



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Chris Ridd



On 27 Jul 2009, at 18:49, Thomas Burgess wrote:



i was under the impression it was virtualbox and it's default  
setting that ignored the command, not the hard drive


Do other virtualization products (eg VMware, Parallels, Virtual PC)  
have the same default behaviour as VirtualBox?


I've a suspicion they all behave similarly dangerously, but actual  
data would be useful.


Cheers,

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Thomas Burgess

i was under the impression it was virtualbox and it's default setting that
ignored the command, not the hard drive

On Mon, Jul 27, 2009 at 1:27 PM, Eric D. Mudama
wrote:

> On Sun, Jul 26 at  1:47, David Magda wrote:
>
>>
>> On Jul 25, 2009, at 16:30, Carson Gaspar wrote:
>>
>>  Frank Middleton wrote:
>>>
>>>  Doesn't this mean /any/ hardware might have this problem, albeit with
 much lower probability?

>>>
>>> No. You'll lose unwritten data, but won't corrupt the pool, because the
>>> on-disk state will be sane, as long as your iSCSI stack doesn't lie about
>>> data commits or ignore cache flush commands.
>>>
>>
>> But this entire thread started because Virtual Box's virtual disk /
>> did/ lie about data commits.
>>
>>  Why is this so difficult for people to understand?
>>>
>>
>> Because most people make the (not unreasonable assumption) that disks save
>> data the way that they're supposed to: that the data goes in is the data
>> that comes out, and that when the OS tells them to empty the buffer that
>> they actually flush it.
>>
>> It's only us storage geeks that generally know the ugly truth that this
>> assumption is not always true. :)
>>
>
> Can *someone* please name a single drive+firmware or RAID
> controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
> commands? Or worse, responds "ok" when the flush hasn't occurred?
>
> Everyone on this list seems to blame lying hardware for ignoring
> commands, but disks are relatively mature and I can't believe that
> major OEMs would qualify disks or other hardware that willingly ignore
> commands.
>
> --eric
>
> --
> Eric D. Mudama
> edmud...@mail.bounceswoosh.org
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Eric D. Mudama


On Sun, Jul 26 at  1:47, David Magda wrote:


On Jul 25, 2009, at 16:30, Carson Gaspar wrote:


Frank Middleton wrote:

Doesn't this mean /any/ hardware might have this problem, albeit 
with much lower probability?


No. You'll lose unwritten data, but won't corrupt the pool, because 
the on-disk state will be sane, as long as your iSCSI stack doesn't 
lie about data commits or ignore cache flush commands.


But this entire thread started because Virtual Box's virtual disk /
did/ lie about data commits.


Why is this so difficult for people to understand?


Because most people make the (not unreasonable assumption) that disks 
save data the way that they're supposed to: that the data goes in is 
the data that comes out, and that when the OS tells them to empty the 
buffer that they actually flush it.


It's only us storage geeks that generally know the ugly truth that 
this assumption is not always true. :)


Can *someone* please name a single drive+firmware or RAID
controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
commands? Or worse, responds "ok" when the flush hasn't occurred?

Everyone on this list seems to blame lying hardware for ignoring
commands, but disks are relatively mature and I can't believe that
major OEMs would qualify disks or other hardware that willingly ignore
commands.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Ross

Heh, I'd kill for failures to be handled in 2 or 3 seconds.  I saw the failure 
of a mirrored iSCSI disk lock the entire pool for 3 minutes.  That has been 
addressed now, but device hangs have the potential to be *very* disruptive.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Marcelo Leal

> That's only one element of it Bob.  ZFS also needs
> devices to fail quickly and in a predictable manner.
> 
> A consumer grade hard disk could lock up your entire
> pool as it fails.  The kit Sun supply is more likely
> to fail in a manner ZFS can cope with.

 I agree 100%.
 Hardware, firmware, drivers, should be fully integrated to a mission critical 
app. With the wrong firmware, and consumer grade HD, disks failures stalls the 
entire pool. I have experience with disks failing and taking 2 or tree seconds 
to the system cope with (not just ZFS, but the controller, etc).

 Leal.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-26 Thread Toby Thain



On 26-Jul-09, at 11:08 AM, Frank Middleton wrote:


On 07/25/09 04:30 PM, Carson Gaspar wrote:


No. You'll lose unwritten data, but won't corrupt the pool, because
the on-disk state will be sane, as long as your iSCSI stack doesn't
lie about data commits or ignore cache flush commands. Why is this so
difficult for people to understand? Let me create a simple example
for you.


Are you sure about this example? AFAIK metadata refers to things like
the file's name, atime, ACLs, etc., etc. Your example seems to be more
about how a journal works, which has little to do with metatdata other
than to manage it.

Now if you were too lazy to bother to follow the instructions  
properly,
we could end up with bizarre things. This is what happens when  
storage

lies and re-orders writes across boundaries.


On 07/25/09 07:34 PM, Toby Thain wrote:

The problem is assumed *ordering*. In this respect VB ignoring  
flushes

and real hardware are not going to behave the same.


Why? An ignored flush is ignored. It may be more likely in VB, but it
can always happen.


And whenever it does: guess what happens?


It mystifies me that VB would in some way alter
the ordering.


Carson already went through a more detailed explanation. Let me try a  
different one:


ZFS issues writes A, B, C, FLUSH, D, E, F.

case 1) the semantics of the flush* allow ZFS to presume that A, B, C  
are all 'committed' at the point that D is issued. You can understand  
that A, B, C may be done in any order, and D, E, F may be done in any  
order, due to the numerous abstraction layers involved - all the way  
down to the disk's internal scheduling. ANY of these layers can  
affect the ordering of durable, physical writes _in the absence of a  
flush/barrier_.


case 2) but if the flush does NOT occur with the necessary semantics,  
the ordering of ALL SIX operations is now indeterminate, and by the  
time ZFS issues D, any of the first 3 (A, B, C) may well not have  
been committed at all. There is a very good chance this will violate  
an integrity assumption (I haven't studied the source so I can't  
point you to a specific design detail or line; rather I am working  
from how I understand transactional/journaled systems to work.  
Assuming my argument is valid, I am sure a ZFS engineer can cite a  
specific violation).


As has already been mentioned in this context, I think by David  
Magda, ordinary hardware will show this problem _if flushes are not  
functioning_ (an unusual case on bare metal), while on VirtualBox  
this is the default!




...

Doesn't ZIL effectively make ZFS into a journalled file system


Of course ZFS is transactional, as are other filesystems and software  
systems, such as RDBMS. But integrity of such systems depends on a  
hardware flush primitive that actually works. We are getting hoarse  
repeating this.


--Toby

* Essentially 'commit' semantics: Flush synchronously, operation is  
complete only when data is durably stored.


...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-26 Thread Bob Friesenhahn


On Sun, 26 Jul 2009, David Magda wrote:


That's the whole point of this thread: what should happen, or what should the 
file system do, when the drive (real or virtual) lies about the syncing? It's 
just as much a problem with any other POSIX file system (which have to deal 
with fsync(2))--ZFS isn't that special in that regard. The Linux folks went 
through a protracted debate on a similar issue not too long ago:


Zfs is pretty darn special.  RAIDed disk setups under Linux or *BSD 
work differently than zfs in a rather big way.  Consider that with a 
normal software-based RAID setup, you use OS tools to create a virtual 
RAIDed device (LUN) which appears as a large device that you can then 
create (e.g. mkfs) a traditional filesystem on top of.  Zfs works 
quite differently in that it is uses a pooled design which 
incorporates several RAID strategies directly.  Instead of sending the 
data to a virtual device which then arranges the underlying data 
according to a policy (striping, mirror, RAID5), zfs incorporates 
knowledge of the vdev RAID strategy and intelligently issues data to 
the disks in an ideal order, executing the disk drive commit requests 
directly.  Zfs removes the RAID obfustication which exists in 
traditional RAID systems.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-26 Thread Frank Middleton


On 07/25/09 04:30 PM, Carson Gaspar wrote:


No. You'll lose unwritten data, but won't corrupt the pool, because
the on-disk state will be sane, as long as your iSCSI stack doesn't
lie about data commits or ignore cache flush commands. Why is this so
difficult for people to understand? Let me create a simple example
for you.


Are you sure about this example? AFAIK metadata refers to things like
the file's name, atime, ACLs, etc., etc. Your example seems to be more
about how a journal works, which has little to do with metatdata other
than to manage it.


Now if you were too lazy to bother to follow the instructions properly,
we could end up with bizarre things. This is what happens when storage
lies and re-orders writes across boundaries.


On 07/25/09 07:34 PM, Toby Thain wrote:


The problem is assumed *ordering*. In this respect VB ignoring flushes
and real hardware are not going to behave the same.


Why? An ignored flush is ignored. It may be more likely in VB, but it
can always happen. It mystifies me that VB would in some way alter
the ordering. I wonder if the OP could tell us what actual disks and
controller he used to see if the hardware might actually have done
out-of-order writes despite the fact that ZFS already does write
optimization. Maybe the disk didn't like the physical location of
the log relative to the data so it wrote the data first? Even then
it isn't onvious why this would cause the pool to be lost.

A traditional journalling file system should survive the loss pf a flush.
Either the log entry was written or it wasn't. Even if the disk, for
some bizarre reason, writes some of the actual data before writing the
log, the repair process should undo that,

If written properly, it will use the information in the most current
complete journal entry to repair the file system. Doing synchs are
devastating to performance so usually there's an option to disable
them, at the known risk of losing a lot more data. I've been using
SPARCs and Solaris from the beginning. Ever since UFS supported
journalling, I've never lost a file unless the disk went totally bad,
and none since mirroring. Didn't miss fsck either :-)

Doesn't ZIL effectively make ZFS into a journalled file system (in
another thread, Bob Friesenhahn says it isn't, but I would submit
that the general opinion is correct that it is; "log" and "journal"
have similar semantics). The evil tuning guide is pretty emphatic
about not disabling it!

My intuition (and this is entirely speculative) is that the ZFS ZIL
either doesn't contain everything needed to restore the superstructure,
or that if it does, the recovery process is ignoring it. I think I read
that the ZIL is per-file system, but one hopes it doesn't rely on the
superstructure recursively, or this would be impossible to fix (maybe
there's a ZIL for the ZILs :) ).

On 07/21/09 11:53 AM, George Wilson wrote:


We are working on the pool rollback mechanism and hope to have that
soon. The ZFS team recognizes that not all hardware is created equal and
thus the need for this mechanism. We are using the following CR as the
tracker for this work:

6667683 need a way to rollback to an uberblock from a previous txg


so maybe this discussion is moot :-)

-- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread David Magda


On Jul 25, 2009, at 15:32, Frank Middleton wrote:


Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and copies=2).


Presumably at least on of the drives in the mirror or RAID set would  
have the correct data or non-corrupted data structures.


There was a thread a while back on the risks involved in a SAN LUN  
(served from something like an EMC array), and whether you could trust  
the array or whether you should mirror LUNs. (I think the consensus  
was it was best to mirror LUNs--even from SANs, which presumably are  
more reliable than consumer SATA drives).



I was also wondering if you could explain why the ZIL can't
repair such damage.


Beyond my knowledge.


Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch, and there


Yes, it will sync every 5 to 30 seconds, but how do you know the data  
is actually synced?! If the five second timer triggers and ZFS says  
"okay, time to sync", and goes through the proper procedures, what  
happens if the drive lies about the sync operation? What then?


That's the whole point of this thread: what should happen, or what  
should the file system do, when the drive (real or virtual) lies about  
the syncing? It's just as much a problem with any other POSIX file  
system (which have to deal with fsync(2))--ZFS isn't that special in  
that regard. The Linux folks went through a protracted debate on a  
similar issue not too long ago:


http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/
http://lwn.net/Articles/322823/


tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?


Yes, which is why it's always recommended to have redundancy in your  
configuration (mirroring or RAID-Z). This way, hopefully, at least one  
drive is in a consistent state.


This is also (theoretically) why a drive purchased from Sun is more  
that expensive then a drive purchased from your neighbourhood computer  
shop: Sun (and presumably other manufacturers) takes the time and  
effort to test things to make sure that when a drive says "I've synced  
the data", it actually has synced the data. This testing is what  
you're presumably paying for.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread David Magda



On Jul 25, 2009, at 16:30, Carson Gaspar wrote:


Frank Middleton wrote:

Doesn't this mean /any/ hardware might have this problem, albeit  
with much lower probability?


No. You'll lose unwritten data, but won't corrupt the pool, because  
the on-disk state will be sane, as long as your iSCSI stack doesn't  
lie about data commits or ignore cache flush commands.


But this entire thread started because Virtual Box's virtual disk / 
did/ lie about data commits.



Why is this so difficult for people to understand?


Because most people make the (not unreasonable assumption) that disks  
save data the way that they're supposed to: that the data goes in is  
the data that comes out, and that when the OS tells them to empty the  
buffer that they actually flush it.


It's only us storage geeks that generally know the ugly truth that  
this assumption is not always true. :)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread Toby Thain



On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:


On 07/25/09 02:50 PM, David Magda wrote:


Yes, it can be affected. If the snapshot's data structure / record is
underneath the corrupted data in the tree then it won't be able to be
reached.


Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and copies=2).

I was also wondering if you could explain why the ZIL can't
repair such damage.

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch, and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?


The problem is assumed *ordering*. In this respect VB ignoring  
flushes and real hardware are not going to behave the same.


--Toby



Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread Carson Gaspar


Frank Middleton wrote:

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch, and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?


No. You'll lose unwritten data, but won't corrupt the pool, because the on-disk 
state will be sane, as long as your iSCSI stack doesn't lie about data commits 
or ignore cache flush commands. Why is this so difficult for people to 
understand? Let me create a simple example for you.


Get yourself 4 small pieces of paper, and number them 1 through 4.

On piece 1, write "Four" (app write disk A)
On piece 2, write "Score" (app write disk B)
Place piece 1 and piece 2 together on the side (metadata write, cache flush)
On piece 3, write "Every" (app overwrite disk A)
On piece 4, write "Good" (app overwrite disk B)
Place piece 2 and piece 3 on top of pieces one and 2 (metadata write, cache 
flush)

IFF you obeyed the instructions, the only things you could ever have on the side 
are nothing, "Four Score", or "Every Good" (we assume that side placement is 
atomic). You could get killed after writing something on pieces 3 or 4, and lose 
them, but you could never have garbage.


Now if you were too lazy to bother to follow the instructions properly, we could 
end up with bizarre things. This is what happens when storage lies and re-orders 
writes across boundaries.


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread Frank Middleton


On 07/25/09 02:50 PM, David Magda wrote:


Yes, it can be affected. If the snapshot's data structure / record is
underneath the corrupted data in the tree then it won't be able to be
reached.


Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and copies=2).

I was also wondering if you could explain why the ZIL can't
repair such damage.

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch, and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?

Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread David Magda



On Jul 25, 2009, at 14:17, roland wrote:


thanks for the explanation !

one more question:


there are situations where the disks doing strange things
(like lying) have caused the ZFS data structures to become wonky. The
'broken' data structure will cause all branches underneath it to be
lost--and if it's near the top of the tree, it could mean a good
portion of the pool is inaccessible.


can snapshots also be affected by such issue or are they somewhat  
"immune" here?


Yes, it can be affected. If the snapshot's data structure / record is  
underneath the corrupted data in the tree then it won't be able to be  
reached.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread roland

thanks for the explanation !

one more question:

> there are situations where the disks doing strange things
>(like lying) have caused the ZFS data structures to become wonky. The
>'broken' data structure will cause all branches underneath it to be
>lost--and if it's near the top of the tree, it could mean a good
>portion of the pool is inaccessible.

can snapshots also be affected by such issue or are they somewhat "immune" here?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread David Magda


On Jul 25, 2009, at 12:24, roland wrote:

why can i loose a whole 10TB pool including all the snapshots with  
the logging/transactional nature of zfs?


Because ZFS does not (yet) have an (easy) way to go back a previous  
state. That's what this bug is about:



need a way to rollback to an uberblock from a previous txg


http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6667683

While in most cases ZFS will cleanly recover after a non-clean  
shutdown, there are situations where the disks doing strange things  
(like lying) have caused the ZFS data structures to become wonky. The  
'broken' data structure will cause all branches underneath it to be  
lost--and if it's near the top of the tree, it could mean a good  
portion of the pool is inaccessible.


Fixing the above bug should hopefully allow users / sysadmins to tell  
ZFS to go 'back in time' and look up previous versions of the data  
structures.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread roland

>As soon as you have more then one disk in the equation, then it is
>vital that the disks commit their data when requested since otherwise
>the data on disk will not be in a consistent state.

ok, but doesn`t that refer only to the most recent data?
why can i loose a whole 10TB pool including all the snapshots with the 
logging/transactional nature of zfs?

isn`t the data in the snapshots set to read only so all blocks with snapshotted 
data don`t change over time (and thus give an secure "entry" to a consistent 
point in time) ?

ok, this are probably some short-sighted questions, but i`m trying to 
understand how things could go wrong with zfs and how issues like these happen.

on other filesystems, we have tools for fsck as a last resort or tools to 
recover data from unmountable filesystems. 
with zfs i don`t know any of these, so it`s that "will solaris mount my zfs 
after the next crash?" question which frightens me a little bit.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread Bob Friesenhahn


On Sat, 25 Jul 2009, roland wrote:


When that happens, ZFS believes the data is safely written, but a 
power cut or >crash can cause severe problems with the pool.


didn`t i read a million times that zfs ensures an "always consistent 
state" and is self healing, too?


so, if new blocks are always written at new positions - why can`t we 
just roll back to a point in time (for example last snapshot) which 
is known to be safe/consistent ?


As soon as you have more then one disk in the equation, then it is 
vital that the disks commit their data when requested since otherwise 
the data on disk will not be in a consistent state.  If the disks 
simply do whatever they want then some disks will have written the 
data while other disks will still have it cached.  This blows the 
"consistent state on disk" even though zfs wrote the data in order and 
did all the right things.  Any uncommitted data in disk cache will be 
forgotten if the system loses power.


There is an additional problem if when the disks finally get around to 
writing the cached data that they write it in a different order than 
requested while ignoring the commit request.  It is common that the 
disks write data in the most efficient order, but it absolutely must 
commit all of the data when requested so that the checkpoint is valid.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread roland

>Running this kind of setup absolutely can give you NO garanties at all.
>Virtualisation, OSOL/zfs on WinXP. It's nice to play with and see it
>"working" but would I TRUST precious data to it? No way!

why not?
if i write some data trough virtualization layer which goes straight trough to 
raw disk - what`s the problem?
do a snapshot and you can be sure you have a safe state. or not?
you can check if you are consistent by doing a scrub. or not?
taken buffers/caches into consideration, you could eventually loose some 
seconds/minutes of work, but doesn`t zfs use transactional design which ensures 
consistency? 

so, how can that happen what´s being reported here, if zfs takes so much care 
of consistency?

>When that happens, ZFS believes the data is safely written, but a power cut or 
>>crash can cause severe problems with the pool.

didn`t i read a million times that zfs ensures an "always consistent state" and 
is self healing, too?

so, if new blocks are always written at new positions - why can`t we just roll 
back to a point in time (for example last snapshot) which is known to be 
safe/consistent ?

i give a shit about the last 5 minutes of work if i can recover my TB sized 
pool instead.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-23 Thread Frank Middleton


On 07/21/09 01:21 PM, Richard Elling wrote:


I never win the lottery either :-)


Let's see. Your chance of winning a 49 ball lottery is apparently
around 1 in 14*10^6, although it's much better than that because of
submatches (smaller payoffs for matches on less than 6 balls).

There are about 32*10^6 seconds in a year. If ZFS saves its writes
for 30 seconds and batches them out, that means 1 write leaves the
buffer exposed for roughly one millionth of a year. If you have 4GB
of memory, you might get 50  errors a year, but you say ZFS uses only
1/10 of this for writes, so that memory could see 5 errors/year. If
your single write was 1/70th of that (say around 6 MB), your chance
of a hit is around 5/70/10^-6 or 1 in 14*10^6, so you are correct!

So if you do one 6MB write/year, your chances of a hit in a year are
about the same as that of winning a grand slam lottery. Hopefully
not every hit will trash a file or pool, but odds are that you'll
do many more writes than that, so on the whole I think a ZFS hit
is quite a bit more likely than winning the lottery each year :-).

Conversely, if you average one big write every 3 minutes or so (20%
occupancy), odds are almost certain that you'll get one hit a year.
So some SOHO users who do far fewer writes won't see any hits (say)
over a 5 year period. But some will, and they will be most unhappy --
calculate your odds and then make a decision! I daresay the PC
makers have done this calculation, which is why PCs don't have ECC,
and hence IMO make for insufficiently reliable servers.

Conclusions from what I've gleaned from all the discussions here:
if you are too cheap to opt for mirroring, your best bet is to
disable checksumming and set copies=2. If you mirror but don't
have ECC then at least set copies=2 and consider disabling checksums.
Actually, set copies=2 regardless, so that you have some redundancy
if one half of the mirror fails and you have a 10 hour resilver,
in which time you could easily get a (real) disk read error.

It seems to me some vendor is going to cotton onto the SOHO server
problem and make a bundle at the right price point. Sun's offerings
seem unfortunately mostly overkill for the SOHO market, although the
X4140 looks rather interesting... Shame there aren't any entry
level SPARCs any more :-(. Now what would doctors' front offices do
if they couldn't blame the computer for being down all the time?
 

It is quite simple -- ZFS sent the flush command and VirtualBox
ignored it. Therefore the bits on the persistent store are consistent.


But even on the most majestic of hardware, a flush command could be
lost, could it not? An obvious case in point is ZFS over iscsi and
a router glitch. But the discussion seems to be moot since CR
6667683 is being addressed. Now about those writes to mirrored disks :)

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-22 Thread Mario Goebbels


To All : The ECC discussion was very interesting as I had never
considered it that way! I willl be buying ECC memory for my home
machine!!


You have to make sure your mainboard, chipset and/or CPU support it, 
otherwise any ECC modules will just work like regular modules.


The mainboard needs to have the necessary lanes to either the chipset 
that supports ECC (in case of Intel) or the CPU (in case of AMD).


I think all Xeon chipsets do ECC, as do various consumer ones (I only 
know of X38/X48, there's also some 9xx ones that do). For consumer 
boards, it's hard to figure out which actually do support it. I have an 
X48-DQ6 mainboard from Gigabyte, which does it.


Regards,
-mg
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-22 Thread George Wilson

Once these bits are available in Opensolaris then users will be able to 
upgrade rather easily. This would allow you to take a liveCD running 
these bits and recover older pools.


Do you currently have a pool which needs recovery?

Thanks,
George

Alexander Skwar wrote:

Hi.

Good to Know!

But how do we deal with that on older sStems, which don't have the
patch applied, once it is out?

Thanks, Alexander

On Tuesday, July 21, 2009, George Wilson  wrote:
  

Russel wrote:

OK.

So do we have an zpool import --xtg 56574 mypoolname
or help to do it (script?)

Russel


We are working on the pool rollback mechanism and hope to have that soon. The 
ZFS team recognizes that not all hardware is created equal and thus the need 
for this mechanism. We are using the following CR as the tracker for this work:

6667683 need a way to rollback to an uberblock from a previous txg

Thanks,
George
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-22 Thread Alexander Skwar

Hi.

Good to Know!

But how do we deal with that on older sStems, which don't have the
patch applied, once it is out?

Thanks, Alexander

On Tuesday, July 21, 2009, George Wilson  wrote:
> Russel wrote:
>
> OK.
>
> So do we have an zpool import --xtg 56574 mypoolname
> or help to do it (script?)
>
> Russel
>
>
> We are working on the pool rollback mechanism and hope to have that soon. The 
> ZFS team recognizes that not all hardware is created equal and thus the need 
> for this mechanism. We are using the following CR as the tracker for this 
> work:
>
> 6667683 need a way to rollback to an uberblock from a previous txg
>
> Thanks,
> George
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
Alexander
-- 
[[ http://zensursula.net ]]
[ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ]
[ Mehr => http://zyb.com/alexws77 ]
[ Chat => Jabber: alexw...@jabber80.com | Google Talk: a.sk...@gmail.com ]
[ Mehr => AIM: alexws77 ]
[ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo 'CLICK!'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-22 Thread Russel

Thanks for the feed back George.
I hope we get the tools soon. 

At home I have now blown the ZFS away now and creating
a HW raid-5 set :-( Hopefully in the future when the tools
are there I will return to ZFS.

To All : The ECC discussion was very interesting as I had never 
considered it that way! I willl be buying ECC memory for my home
machine!! 

Again many many thanks to all how have replied it has been a very
interesting and informative discussion for me.

Best regards
Russel
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-21 Thread Richard Elling


On Jul 20, 2009, at 12:48 PM, Frank Middleton wrote:


On 07/19/09 06:10 PM, Richard Elling wrote:


Not that bad. Uncommitted ZFS data in memory does not tend to
live that long. Writes are generally out to media in 30 seconds.


Yes, but memory hits are instantaneous. On a reasonably busy
system there may be buffers in queue all the time. You may have
a buffer in memory for 100uS but it only takes 1nS for that buffer
to be clobbered. If that happened to be metadata about to be written
to both sides of a mirror than you are toast.  Good thing this
never happens, right :-)


I never win the lottery either :-)



Beware, if you go down this path of thought for very long, you'll  
soon
be afraid to get out of bed in the morning... wait... most people  
actually

die in beds, so perhaps you'll be afraid to go to bed instead :-)


Not at all. As with any rational business, my servers all have ECC,
and getting up and out isn't a problem :-). Maybe I've had too many
disks go bad, so I have ECC, mirrors, and backup to a system with
ECC and mirrors (and copies=2, as well). Maybe I've read too many
of your excellent blogs :-).

Sun doesn't even sell machines without ECC. There's a reason for  
that.



Yes, but all of the discussions in this thread can be classified as
systems engineering problems, not product design problems.


Not sure I follow. We've had this discussion before. OSOL+ZFS lets
you build enterprise class systems on cheap hardware that has errors.
ZFS gives the illusion of being fragile because it, uniquely, reports
these errors. Running OSOL as a VM in VirtualBox using MSWanything
as a host is a bit like building on sand, but there's nothing in
documentation anywhere to even warn folks that they shouldn't rely
on software to get them out of trouble on cheap hardware. ECC is
just one (but essential) part of that.


It is a systems engineering problem because ZFS is working as designed
and VirtualBox is also working as designed.  If you file a bug against
either, the bug should be closed as "not a defect." That means the
responsibility for making sure that the two interoperate lies at the
systems level -- where systems engineers do their job. For an analogy,
guns don't kill people, bullets kill people. The gun is just a  
platform for
directing bullets. If you shoot yourself in the foot, then the failure  
is not
with the gun or bullet, it is one layer above -- in the system.  It  
hurts

when you do that, so don't do that.




On 07/19/09 08:29 PM, David Magda wrote:


It's a nice-to-have, but at some point we're getting into the tinfoil
hat-equivalent of data protection.


But it is going to happen! Sun sells only machines with ECC because
that is the only way to ensure reliability. Someone who spends weeks
building a media server at home isn't going to be happy if they lose
one media file let alone a whole pool. At least they should be warned
that without ECC at some point they will lose files. I'm not convinced
that there is any reasonable scenario for losing an entire pool  
though,

which was the original complaint in this thread.

Even trusty old SPARCs occasionally hang without a panic (in my
experience especially when a disk is about to go bad). If this
happens, and you have to power cycle because even stop-A doesn't
respond, are you all saying that there is a risk of losing a pool
at that point? Surely the whole point of a journalled file system
is that it is pretty much proof against any catastrophe, even the
one described initially.

There have been a couple of (to me) unconvincing explanations of
how this pool was lost.


It is quite simple -- ZFS sent the flush command and VirtualBox
ignored it. Therefore the bits on the persistent store are consistent.


Surely if there is a mechanism whereby
unflushed i/os can cause fatal metadata corruption, this should
be a high priority bug since this can happen on /any/ hardware; it
is just more likely if the foundations are shaky, so the explanation
must require more than that if it isn't a bug.


It isn't a bug in ZFS or VirtualBox. They work as designed.
As has been mentioned before, many times, the recovery of the
data is now a forensics exercise.  ZFS knows is that the consistency
is broken and is implementing the policy that consistency is more
important than automated access.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-21 Thread George Wilson


Russel wrote:

OK.

So do we have an zpool import --xtg 56574 mypoolname
or help to do it (script?)

Russel
  
We are working on the pool rollback mechanism and hope to have that 
soon. The ZFS team recognizes that not all hardware is created equal and 
thus the need for this mechanism. We are using the following CR as the 
tracker for this work:


6667683 need a way to rollback to an uberblock from a previous txg

Thanks,
George
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-21 Thread Ross

My understanding of the root cause of these issues is that the vast majority 
are happening with consumer grade hardware that is reporting to ZFS that writes 
have succeeded, when in fact they are still in the cache.

When that happens, ZFS believes the data is safely written, but a power cut or 
crash can cause severe problems with the pool.  This is (I think) the reason 
for comments about this being a system engineering, not design problem - ZFS 
assumes the disks are telling the truth and has been designed this way.  It is 
up to the administrator to engineer the server from components that accurately 
report their status.

However, while the majority of these cases are with consumer hardware, the BBC 
have reported that they hit this problem using Sun T2000 servers and commodity 
SATA drives, so unless somebody from Sun can say otherwise, I feel that there 
is still some risk of this occurring on Sun hardware.

I feel the ZFS marketing and documentation is very misleading in that it 
completely ignores the issue of your entire pool being at risk unless you are 
careful about the hardware used, leading to a lot of stories like this from 
enthusiasts and early adopters.  I also believe ZFS needs recovery tools as a 
matter of urgency, to protect its reputation if nothing else.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-20 Thread Frank Middleton


On 07/19/09 06:10 PM, Richard Elling wrote:


Not that bad. Uncommitted ZFS data in memory does not tend to
live that long. Writes are generally out to media in 30 seconds.


Yes, but memory hits are instantaneous. On a reasonably busy
system there may be buffers in queue all the time. You may have
a buffer in memory for 100uS but it only takes 1nS for that buffer
to be clobbered. If that happened to be metadata about to be written
to both sides of a mirror than you are toast.  Good thing this
never happens, right :-)
 

Beware, if you go down this path of thought for very long, you'll soon
be afraid to get out of bed in the morning... wait... most people actually
die in beds, so perhaps you'll be afraid to go to bed instead :-)


Not at all. As with any rational business, my servers all have ECC,
and getting up and out isn't a problem :-). Maybe I've had too many
disks go bad, so I have ECC, mirrors, and backup to a system with
ECC and mirrors (and copies=2, as well). Maybe I've read too many
of your excellent blogs :-).


Sun doesn't even sell machines without ECC. There's a reason for that.



Yes, but all of the discussions in this thread can be classified as
systems engineering problems, not product design problems.


Not sure I follow. We've had this discussion before. OSOL+ZFS lets
you build enterprise class systems on cheap hardware that has errors.
ZFS gives the illusion of being fragile because it, uniquely, reports
these errors. Running OSOL as a VM in VirtualBox using MSWanything
as a host is a bit like building on sand, but there's nothing in
documentation anywhere to even warn folks that they shouldn't rely
on software to get them out of trouble on cheap hardware. ECC is
just one (but essential) part of that.

On 07/19/09 08:29 PM, David Magda wrote:


It's a nice-to-have, but at some point we're getting into the tinfoil
hat-equivalent of data protection.


But it is going to happen! Sun sells only machines with ECC because
that is the only way to ensure reliability. Someone who spends weeks
building a media server at home isn't going to be happy if they lose
one media file let alone a whole pool. At least they should be warned
that without ECC at some point they will lose files. I'm not convinced
that there is any reasonable scenario for losing an entire pool though,
which was the original complaint in this thread.

Even trusty old SPARCs occasionally hang without a panic (in my
experience especially when a disk is about to go bad). If this
happens, and you have to power cycle because even stop-A doesn't
respond, are you all saying that there is a risk of losing a pool
at that point? Surely the whole point of a journalled file system
is that it is pretty much proof against any catastrophe, even the
one described initially.

There have been a couple of (to me) unconvincing explanations of
how this pool was lost. Surely if there is a mechanism whereby
unflushed i/os can cause fatal metadata corruption, this should
be a high priority bug since this can happen on /any/ hardware; it
is just more likely if the foundations are shaky, so the explanation
must require more than that if it isn't a bug.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-20 Thread Toby Thain



On 20-Jul-09, at 6:26 AM, Russel wrote:


Well I did have a UPS on the machine :-)

but the machine hung and I had to power it off...
(yep it was vertual, but that happens on direct HW too,


As has been discussed here before, the failure modes are different as  
the layer stack from filesystem to disk is obviously very different.


--Toby


and virtualisasion is the happening ting at sun and else where!
I have a version of the data backed up, but will
take ages (10days) to restore).
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-20 Thread Russel

OK.

So do we have an zpool import --xtg 56574 mypoolname
or help to do it (script?)

Russel
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-20 Thread Rob Logan


>  the machine hung and I had to power it off.

kinda getting off the "zpool import --tgx -3" request, but
"hangs" are exceptionally rare and usually ram or other
hardware issue, solairs usually abends on software faults.

r...@pdm #  uptime
  9:33am  up 1116 day(s), 21:12,  1 user,  load average: 0.07, 0.05, 0.05
r...@pdm #  date
Mon Jul 20 09:33:07 EDT 2009
r...@pdm #  uname -a
SunOS pdm 5.9 Generic_112233-12 sun4u sparc SUNW,Ultra-250

Rob

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-20 Thread Russel

Well I did have a UPS on the machine :-)

but the machine hung and I had to power it off...
(yep it was vertual, but that happens on direct HW too,
and virtualisasion is the happening ting at sun and else where!
I have a version of the data backed up, but will
take ages (10days) to restore).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Andre van Eyssen


On Sun, 19 Jul 2009, Richard Elling wrote:

I do, even though I have a small business.  Neither InDesign nor 
Illustrator will be ported to Linux or OpenSolaris in my lifetime... 
besides, iTunes rocks and it is the best iPhone developer's environment 
on the planet.


Richard,

I think the point that Gavin was trying to make is that a sensible 
business would commit their valuable data back to a fileserver running on 
solid hardware with a solid operating system rather than relying on their 
single-spindle laptops to store their valuable content - not making any 
statement on the actual desktop platform.


For example, I use a mixture of Windows, MacOS, Solaris and OpenBSD around 
here, but all the valuable data is stored on a zpool located on a SPARC 
server (obviously with ECC RAM) with UPS power. With Windows around, I 
like the fact that I don't need to think twice before reinstalling those 
machines.


Andre.


--
Andre van Eyssen.
mail: an...@purplecow.org  jabber: an...@interact.purplecow.org
purplecow.org: UNIX for the masses http://www2.purplecow.org
purplecow.org: PCOWpix http://pix.purplecow.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Richard Elling




Gavin Maltby wrote:

Hi,

David Magda wrote:

On Jul 19, 2009, at 20:13, Gavin Maltby wrote:


No, ECC memory is a must too.  ZFS checksumming verifies and corrects
data read back from a disk, but once it is read from disk it is stashed
in memory for your application to use - without ECC you erode 
confidence that

what you read from memory is correct.


Right, because once (say) Apple incorporates ZFS into Mac OS X 
they'll also start shipping MacBooks and iMacs with ECC. 


If customers were committing valuable business data to MacBooks and iMacs
then ECC would be a requirement.  I don't know of terribly many
customers running their business of of a laptop.


I do, even though I have a small business.  Neither InDesign nor 
Illustrator will be
ported to Linux or OpenSolaris in my lifetime... besides, iTunes rocks 
and it is the
best iPhone developer's environment on the planet. The bigger problem is 
that

not all of Intel's CPU products do ECC... the embedded and server models do,
but it is the low-margin PC market that is willing to make that cost 
trade-off.

If people demanded ECC, like they do in the embedded and server markets,
then we wouldn't be having this conversation.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Gavin Maltby


Hi,

David Magda wrote:

On Jul 19, 2009, at 20:13, Gavin Maltby wrote:


No, ECC memory is a must too.  ZFS checksumming verifies and corrects
data read back from a disk, but once it is read from disk it is stashed
in memory for your application to use - without ECC you erode 
confidence that

what you read from memory is correct.


Right, because once (say) Apple incorporates ZFS into Mac OS X they'll 
also start shipping MacBooks and iMacs with ECC. 


If customers were committing valuable business data to MacBooks and iMacs
then ECC would be a requirement.  I don't know of terribly many
customers running their business of of a laptop.

If it's so necessary we 
might as well have any kernel that has ZFS in it only allow 'zpool 
create' to be run if the kernel detects ECC modules.


Come on.

>


It's a nice-to-have, but at some point we're getting into the tinfoil 
hat-equivalent of data protection.


On a laptop zfs is a huge amount safer than other filesystems, still has
all the great usability features etc - but zfs does not magically turn
your laptop into a server-grade system.  What you refer to as a tinfoil hat
is an essential component of any server if that is housing business-vital
data;  obviously it is just a nice-to-have on a laptop, but recognise
what you're losing.

Gavin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Bob Friesenhahn


On Sun, 19 Jul 2009, David Magda wrote:


Right, because once (say) Apple incorporates ZFS into Mac OS X 
they'll also start shipping MacBooks and iMacs with ECC. If it's so 
necessary we might as well have any kernel that has ZFS in it only 
allow 'zpool create' to be run if the kernel detects ECC modules.


The MacBooks and iMacs are only used as an execution environment for 
the Safari web browser.  ECC is only necessary for computers which 
save data somewhere so the MacBook and iMac do not need ECC.


Regardless (in order to stay on topic) it is worth mentioning that the 
10TB data lost to a failed pool was not lost due to lack of ECC.  It 
was lost because VirtualBox intentionally broke the guest operating 
system.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread David Magda


On Jul 19, 2009, at 20:13, Gavin Maltby wrote:


No, ECC memory is a must too.  ZFS checksumming verifies and corrects
data read back from a disk, but once it is read from disk it is  
stashed
in memory for your application to use - without ECC you erode  
confidence that

what you read from memory is correct.


Right, because once (say) Apple incorporates ZFS into Mac OS X they'll  
also start shipping MacBooks and iMacs with ECC. If it's so necessary  
we might as well have any kernel that has ZFS in it only allow 'zpool  
create' to be run if the kernel detects ECC modules.


Come on.

It's a nice-to-have, but at some point we're getting into the tinfoil  
hat-equivalent of data protection.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Gavin Maltby


dick hoogendijk wrote:


true. Furthermore, much so-called consumer hardware is very good these
days. My guess is ZFS should work quite reliably on that hardware.
(i.e. non ECC memory should work fine!) / mirroring is a -must- !


No, ECC memory is a must too.  ZFS checksumming verifies and corrects
data read back from a disk, but once it is read from disk it is stashed
in memory for your application to use - without ECC you erode confidence that
what you read from memory is correct.

Gavin

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Bob Friesenhahn


On Sun, 19 Jul 2009, Miles Nordin wrote:


"r" == Ross   writes:
"tt" == Toby Thain  writes:


r> ZFS was never designed to run on consumer hardware,

this is markedroid garbage, as well as post-facto apologetics.

Don't lower the bar.  Don't blame the victim.


I think that the standard disclaimer "Always use protection" applies 
here.  Victims who do not use protection should assume substantial 
guilt for their subsequent woes.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Richard Elling


Frank Middleton wrote:

On 07/19/09 05:00 AM, dick hoogendijk wrote:


(i.e. non ECC memory should work fine!) / mirroring is a -must- !


Yes, mirroring is a must, although it doesn't help much if you
have memory errors (see several other threads on this topic):

http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction 

 
"Tests[ecc]give widely varying error rates, but about 10^-12

error/bit·h is typical, roughly one bit error, per month, per
gigabyte of memory."

That's roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS
hit, that's one/year per user on average. Some get more, some get
less.That sounds like pretty bad odds...


Not that bad.  Uncommitted ZFS data in memory does not tend to
live that long. Writes are generally out to media in 30 seconds.
Solaris scrubs memory, with a 12-hour cycle time, so memory does
not remain untouched for a month. For high-end systems,
memory scrubs are also performed by the memory controllers.

Beware, if you go down this path of thought for very long, you'll soon
be afraid to get out of bed in the morning... wait... most people actually
die in beds, so perhaps you'll be afraid to go to bed instead :-)



"In most computers used for serious scientific or financial computing
and as servers, ECC is the rule rather than the exception, as can be
seen by examining manufacturers' specifications." Sun doesn't even
sell machines without ECC. There's a reason for that.


Yes, but all of the discussions in this thread can be classified as
systems engineering problems, not product design problems.
If you do your own systems engineering, then add this to your
(hopefully long) checklist.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Miles Nordin

> "r" == Ross   writes:
> "tt" == Toby Thain  writes:

 r> ZFS was never designed to run on consumer hardware,

this is markedroid garbage, as well as post-facto apologetics.

Don't lower the bar.  Don't blame the victim.

tt> I posted about that insane default, six months ago. Obviously
tt> ZFS isn't the only subsystem that this breaks.

yes, but remember, in this case the host did not crash, so the insane
default should be irrelevant.

pgpc8tzQ0aGF0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Bob Friesenhahn


On Sun, 19 Jul 2009, Frank Middleton wrote:


Yes, mirroring is a must, although it doesn't help much if you
have memory errors (see several other threads on this topic):

http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction
"Tests[ecc]give widely varying error rates, but about 10^-12
error/bit·h is typical, roughly one bit error, per month, per
gigabyte of memory."

That's roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS
hit, that's one/year per user on average. Some get more, some get
less.That sounds like pretty bad odds...


I fail to see anything zfs-specific in the above.  It does not have 
anything more to do with zfs than it does with any other software 
running on the system.


I do have a couple of Windows PCs here without ECC, but they were 
gifts from other people, and not hardware that I purchased, and not 
used for any critical application.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Frank Middleton


On 07/19/09 05:00 AM, dick hoogendijk wrote:


(i.e. non ECC memory should work fine!) / mirroring is a -must- !


Yes, mirroring is a must, although it doesn't help much if you
have memory errors (see several other threads on this topic):

http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction
 
"Tests[ecc]give widely varying error rates, but about 10^-12

error/bit·h is typical, roughly one bit error, per month, per
gigabyte of memory."

That's roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS
hit, that's one/year per user on average. Some get more, some get
less.That sounds like pretty bad odds...

"In most computers used for serious scientific or financial computing
and as servers, ECC is the rule rather than the exception, as can be
seen by examining manufacturers' specifications." Sun doesn't even
sell machines without ECC. There's a reason for that.

IMO you'd be nuts to run ZFS on a machine without ECC unless
you don't care about losing some or all of the data. Having
said that, we have yet to lose an entire pool - this is pretty
hard to do! I should add that since setting copies=2 and forcing
the files to be copied, there have been no more unrecoverable
errors on a particularly low end machine that was plagued with
them even with mirrors (and a UPS with a bad battery :-) ).
 
On 19-Jul-09, at 7:12 AM, Russel wrote:



As this was not clear to me. I use VB like others use vmware
etc to run solaris because its the ONLY way I can,


Given that PC hardware is so cheap these days (used SPARCS
even cheaper), surely it makes far more sense to build a nice
robust OSOL/ZFS based file server *with* ECC. Then you can use
iscsi for your VirtualBox VMs and solve all kinds of interesting
problems. But you still need to do backups. My solution for
that is to replicate the server and backup to it using zfs
send/recv. If a disk fails, you switch to the backup and no
worries about the second disk of the mirror failing during a
resilver.  A small price to pay for peace of mind.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Ross

That's only one element of it Bob.  ZFS also needs devices to fail quickly and 
in a predictable manner.

A consumer grade hard disk could lock up your entire pool as it fails.  The kit 
Sun supply is more likely to fail in a manner ZFS can cope with.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Toby Thain



On 19-Jul-09, at 7:12 AM, Russel wrote:


Guys guys please chill...

First thanks to the info about virtualbox option to bypass the
cache (I don't suppose you can give me a reference for that info?
(I'll search the VB site :-))



I posted about that insane default, six months ago. Obviously ZFS  
isn't the only subsystem that this breaks.

http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0


As this was not clear to me. I use VB
like others use vmware etc to run solaris because its the ONLY
way I can,


Convenience always has a price.

--Toby

...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Bob Friesenhahn


On Sun, 19 Jul 2009, Ross wrote:

The success of any ZFS implementation is *very* dependent on the 
hardware you choose to run it on.


To clarify:

"The success of any filesystem implementation is *very* dependent on 
the hardware you choose to run it on."


ZFS requires that the hardware cache sync works and is respected. 
Without taking advantage of the drive caches, zfs would be 
considerably less performant.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Ross

Heh, yes, I assumed similar things Russel.  I also assumed that a faulty disk 
in a raid-z set wouldn't hang my entire pool indefinitely, that hot plugging a 
drive wouldn't reboot Solaris, and that my pool would continue working after I 
disconnected one half of an iscsi mirror.

I also like yourself assumed that if ZFS is using copy on write, then even 
after a really nasty crash, the vast majority of my data would be accessible.

And I also believed that when I had disconnected every drive from a ZFS pool, 
that ZFS wouldn't accept writes to it any more...

Unfortunately, all of these assumptions turned out to be false.  Learning ZFS 
has been a painful experience.  I still like it, but I am very aware of its 
limitations, and am cautious how I apply it these days.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Ross

>From the experience myself and others have had, and Sun's approach with their 
>Amber Road storage (FISHWORKS - fully integrated *hardware* and software), my 
>feeling is very much that ZFS was designed by Sun to run on Sun's own 
>hardware, and as such, they were able to make certain assumptions with their 
>design.  

ZFS was never designed to run on consumer hardware, it makes assumptions that 
devices and drivers will always be well behaved when errors occur, and in 
general is quite fragile if you're running it on the wrong system.  On the 
right hardware, I've no doubt that ZFS is incredibly reliable and easy to 
manage.  On the wrong hardware, disk errors can hang your entire system, hot 
swap can down your pool, and a power cut or other error can render your entire 
pool inaccessible.

The success of any ZFS implementation is *very* dependent on the hardware you 
choose to run it on.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Russel

Guys guys please chill...

First thanks to the info about virtualbox option to bypass the
cache (I don't suppose you can give me a reference for that info?
(I'll search the VB site :-)) As this was not clear to me. I use VB
like others use vmware etc to run solaris because its the ONLY
way I can, as I can't get the drives for most of the H/W out there
in hobby land, so a virtualised system allows me to run my SilImage
chipset to link 3gb/s to the sata multi-port array I got for £160.
===anyway lets top there or we will be off topic even more
===just though you should know why I did it, even bsd does
===not have a driver or I would have gone there so get zfs :-)

Anyway... my view on zfs was quite simple, it looked after
bit rot, and did self healing, and most importantly for me
as running it on consumer kit, was it seemed to avoid the Raid-5
write hole in the case of a crash! So if stuff falls over
eg windows,VB,Opensolaris etc I would not suffer
unknown data corruption and would just loose just that write
which was fine as the thing crashed. so
for a flaky envorment ZFS sounds even more like the one you
want, LOL.

Loved all the technical stuff, I have had rather good deep dives
from Suns best here in UK/europe (I'm lucky as was a very
early employee of sun, and now work for a major firm :-)).
Liked the idea that you can build your own storage server
etc etc. I new most bugs, as I saw them, were fixed in Jan 09 
patch.

I THOUGHT/ASSUMMED (yes you should never :-()) that given everything
else it would be blatantly obvious that when you try to mount a zpool the
thing would either rollback to last consistent state (that includes
the U-block and meta data thank you) or have a tool like fsck
which lets you do it, BUT you know once you start rolling back
(just like clearing inodes) your not going to be in such a good
place and you'd need to scrub or something, even if it say these
files are now corrupt, FINE, but I DIDN'T loose the filesystem
just a file or two. We should never loose the filesystem. But in
ZFS land thats the most likely fault it sounds we have in data-loss.


SUMMARY
=
What I see here is the lack of the (not needed lol) fsck type tool.
WELL WE DO NEED it, we need to be able roll-back and recover
and repair.
 I have lost data stored on large Sun6790 arrays and now my 
home system.

So PLEASE anyone got a beta version of a tool to perform roll back?

Russel
(It will take me 10 days to pull my data off litttle my little
drives again, and 5 days to format with raid5 (H/W) and NFTS
not what I want, nor the raid-5 hole :-))
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread dick hoogendijk

On Sun, 19 Jul 2009 01:48:40 PDT
Ross  wrote:

> As far as I can see, the ZFS Administrator Guide is sorely lacking in
> any warning that you are risking data loss if you run on consumer
> grade hardware.

And yet, ZFS is not only for NON-consumer grade hardware is it? the
fact that many, many people run "normal" consumer hardware does not
rule them out fro ZFS, does it? The "best filesystem ever", the "end of
all other filesystems" would be nothing more than a dream if that was
true. Furthermore, much so-called consumer hardware is very good these
days. My guess is ZFS should work quite reliably on that hardware.
(i.e. non ECC memory should work fine!) / mirroring is a -must- !

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / OpenSolaris 2010.02 B118
+ All that's really worth doing is what we do for others (Lewis Carrol)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Ross

While I agree with Brent, I think this is something that should be stressed in 
the ZFS documentation.  Those of us with long term experience of ZFS know that 
it's really designed to work with hardware meeting quite specific requirements.

Unfortunately, that isn't documented anywhere, and more and more people are 
being bitten by quite severe dataloss by virtue of the fact that ZFS is far 
less forgiving than other filesystems when data hasn't been properly written to 
disk.

As far as I can see, the ZFS Administrator Guide is sorely lacking in any 
warning that you are risking data loss if you run on consumer grade hardware.  
In fact, the requirements section states nothing more than:

"ZFS Hardware and Software Requirements and Recommendations

Make sure you review the following hardware and software requirements and 
recommendations before attempting to use the ZFS software:

* A SPARC® or x86 system that is running the or the Solaris 10 6/06 release 
or later release.
* The minimum disk size is 128 Mbytes. The minimum amount of disk space 
required for a storage pool is approximately 64 Mbytes.
* Currently, the minimum amount of memory recommended to install a Solaris 
system is 768 Mbytes. However, for good ZFS performance, at least one Gbyte or 
more of memory is recommended.
* If you create a mirrored disk configuration, multiple controllers are 
recommended."
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread dick hoogendijk

On Sun, 19 Jul 2009 00:00:06 -0700
Brent Jones  wrote:

> No offense, but you trusted 10TB of important data, running in
> OpenSolaris from inside Virtualbox (not stable) on top of Windows XP
> (arguably not stable, especially for production) on probably consumer
> grade hardware with unknown support for any of the above products?

Running this kind of setup absolutely can give you NO garanties at all.
Virtualisation, OSOL/zfs on WinXP. It's nice to play with and see it
"working" but would I TRUST precious data to it? No way!

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / OpenSolaris 2010.02 B118
+ All that's really worth doing is what we do for others (Lewis Carrol)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Markus Kovero

I would be intrested in how to roll-back to certain txg-points in case of 
disaster, that was what Russel was after anyway.

Yours
Markus Kovero

-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Miles Nordin
Sent: 19. heinäkuuta 2009 11:24
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 
40 days work

>>>>> "bj" == Brent Jones  writes:

bj> many levels of fail here,

pft.  Virtualbox isn't unstable in any of my experience.  It doesn't by default 
pass cache flushes from guest to host unless you set

VBoxManage setextradata VMNAME 
"VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0

however OP does not mention the _host_ crashing, so this questionable 
``optimization'' should not matter.  Yanking the guest's virtual cord is 
something ZFS is supposed to tolerate:  remember the ``crash-consistent 
backup'' concept (not to mention the ``always consistent on disk'' claim, but 
really any filesystem even without that claim should tolerate having the 
guest's virtual cord yanked, or the guest's kernel crashing, without losing all 
its contents---the claim only means no time-consuming fsck after reboot).

bj> to blame ZFS seems misplaced,

-1

The fact that it's a known problem doesn't make it not a problem.

bj> the subject on this thread especially inflammatory.

so what?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Miles Nordin

> "bj" == Brent Jones  writes:

bj> many levels of fail here,

pft.  Virtualbox isn't unstable in any of my experience.  It doesn't
by default pass cache flushes from guest to host unless you set

VBoxManage setextradata VMNAME 
"VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0

however OP does not mention the _host_ crashing, so this questionable
``optimization'' should not matter.  Yanking the guest's virtual cord
is something ZFS is supposed to tolerate:  remember the
``crash-consistent backup'' concept (not to mention the ``always
consistent on disk'' claim, but really any filesystem even without
that claim should tolerate having the guest's virtual cord yanked, or
the guest's kernel crashing, without losing all its contents---the
claim only means no time-consuming fsck after reboot).

bj> to blame ZFS seems misplaced,

-1

The fact that it's a known problem doesn't make it not a problem.

bj> the subject on this thread especially inflammatory.

so what?


pgpsa1Xq1kR3M.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Brent Jones

On Sat, Jul 18, 2009 at 7:39 PM, Russel wrote:
> Yes you'll find my name all over VB at the moment, but I have found it to be 
> stable
> (don't install the addons disk for solaris!!, use 3.0.2, and for me 
> winXP32bit and
> OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris 
> failed
> with extract_boot_list doesn't belong to 101, but noone on opensol, seems
> interested about it as other have reported it to, prob a rare issue.
>
> But yer, I hope Vicktor or someone will take a look. My worry is that if we
> can't recover from this, which a number of people (in variuos forms) have 
> come accross zfs may be introuble. We had this happen at work about 18 months 
> ago
> lost all the data (20TB)(didn't know about zdb nor did sun support) so we 
> have start
> to back away, but I though since jan 2009 patches things were meant to be 
> alot better, esp with sun using it in there storage servers now
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

No offense, but you trusted 10TB of important data, running in
OpenSolaris from inside Virtualbox (not stable) on top of Windows XP
(arguably not stable, especially for production) on probably consumer
grade hardware with unknown support for any of the above products?

I'd like to say this was an unfortunate circumstance, but there are
many levels of fail here, and to blame ZFS seems misplaced, and the
subject on this thread especially inflammatory.



-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-18 Thread Russel

Yes you'll find my name all over VB at the moment, but I have found it to be 
stable
(don't install the addons disk for solaris!!, use 3.0.2, and for me winXP32bit 
and 
OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris failed
with extract_boot_list doesn't belong to 101, but noone on opensol, seems
interested about it as other have reported it to, prob a rare issue.

But yer, I hope Vicktor or someone will take a look. My worry is that if we
can't recover from this, which a number of people (in variuos forms) have come 
accross zfs may be introuble. We had this happen at work about 18 months ago
lost all the data (20TB)(didn't know about zdb nor did sun support) so we have 
start
to back away, but I though since jan 2009 patches things were meant to be alot 
better, esp with sun using it in there storage servers now
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-18 Thread Orvar Korvar

Sorry to hear that, but you do know that VirtualBox is not really stable? 
VirtualBox does show some instability from time to time. You havent read the 
VirtualBox forums? I would advice against VirtualBox for saving all your data 
in ZFS. I would use OpenSolaris without virtualization. I hope your problem 
gets fixed, though.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

80 matches

Mail list logo