Re: [zfs-discuss] zfs-discuss mailing list & opensolaris EOL

2013-02-16 Thread Toby Thain

On 16/02/13 3:51 PM, Sašo Kiselkov wrote:

On 02/16/2013 06:44 PM, Tim Cook wrote:

We've got Oracle employees on the mailing list, that while helpful, in no
way have the authority to speak for company policy.  They've made that
clear on numerous occasions   And that doesn't change the fact that we
literally have heard NOTHING from Oracle since the closing of OpensSolaris.
  0 official statements, so I once gain ask: what do you think you were
going to get in response to your questions?

The reason you hear nothing from them on anything official is because it's
a good way to lose your job.


People, let's get down to brass tax here. Either:

1) You will continue to get zfs-discuss@opensolaris.org e-mail after
March 24th and can carry on as before, or
2) You won't, in which case we welcome everybody at
z...@lists.illumos.org.


Signed up, thanks.

The ZFS list has been very high value and I thank everyone whose wisdom 
I have enjoyed, especially people like you Sašo, Mr Elling, Mr 
Friesenhahn, Mr Harvey, the distinguished Sun and Oracle engineers who 
post here, and many others.


Let the Illumos list thrive.

--Toby



It's really that simple.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub and checksum permutations

2012-10-27 Thread Toby Thain

On 27/10/12 11:56 AM, Ray Arachelian wrote:

On 10/26/2012 04:29 AM, Karl Wagner wrote:


Does it not store a separate checksum for a parity block? If so, it
should not even need to recalculate the parity: assuming checksums
match for all data and parity blocks, the data is good.
...



Parity is very simple to calculate and doesn't use a lot of CPU - just
slightly more work than reading all the blocks: read all the stripe
blocks on all the drives involved in a stripe, then do a simple XOR
operation across all the data.  The actual checksums are more expensive
as they're MD5 - much nicer when these can be hardware accelerated.


Checksums are MD5??

--Toby



Also, on x86,  ...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-11 Thread Toby Thain

On 11/10/12 5:47 PM, andy thomas wrote:

...
This doesn't sound like a very good idea to me as surelt disk seeks for
swap and for ZFS file I/O are bound to clash. aren't they?



As Phil implied, if your system is swapping, you already have bigger 
problems.


--Toby



Andy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] single-disk pool - Re: Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-08-02 Thread Toby Thain

On 01/08/12 3:34 PM, opensolarisisdeadlongliveopensolaris wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov

Well, there is at least a couple of failure scenarios where
copies>1 are good:

1) A single-disk pool, as in a laptop. Noise on the bus,
 media degradation, or any other reason to misread or
 miswrite a block can result in a failed pool.


How does mac/win/lin handle this situation?  (Not counting btrfs.)



Is this a trick question? :)

--Toby


Such noise might result in a temporarily faulted pool (blue screen of death) 
that is fully recovered after reboot.  Meanwhile you're always paying for it in 
terms of performance, and it's all solvable via pool redundancy.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Apple's ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Toby Thain

On 15/01/12 10:38 AM, Edward Ned Harvey wrote:

...
Linux is going with btrfs.  MS has their own thing.  Oracle continues with
ZFS closed source.  Apple needs a filesystem that doesn't suck, but they're
not showing inclinations toward ZFS or anything else that I know of.



Rumours have long circulated, even before the brief public debacle of 
ZFS in OS X - "is it in Leopard...yes it's in...no it's not...yes it's 
in...oh damn, it's really not" - that Apple is building their own clone 
of ZFS.


--Toby


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-15 Thread Toby Thain

On 15/10/11 2:43 PM, Richard Elling wrote:

On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote:


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Tim Cook

In my example - probably not a completely clustered FS.
A clustered ZFS pool with datasets individually owned by
specific nodes at any given time would suffice for such
VM farms. This would give users the benefits of ZFS
(resilience, snapshots and clones, shared free space)
merged with the speed of direct disk access instead of
lagging through a storage server accessing these disks.


I think I see a couple of points of disconnect.

#1 - You seem to be assuming storage is slower when it's on a remote storage
server as opposed to a local disk.  While this is typically true over
ethernet, it's not necessarily true over infiniband or fibre channel.


Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
10Mbps Ethernet it was faster than the 30ms average access time for the disks of
the day. I tested a simple server the other day and round-trip for 4KB of data 
on a
busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs
have trouble reaching that rate under load.


Hmm, of course the *latency* of Ethernet has always been much less, but 
I did not see it reaching the *throughput* of a single direct attached 
disk until gigabit.


I'm pretty sure direct attached disk throughput in the Sun 3 era was 
much better than 10Mbit Ethernet could manage. Iirc, NFS on a Sun 3 
running NetBSD over 10B2 was only *just* capable of streaming MP3, with 
tweaking, from my own experiments (I ran 10B2 at home until 2004; hey, 
it was good enough!)


--Toby



Many people today are deploying 10GbE and it is relatively easy to get wire 
speed
for bandwidth and<  0.1 ms average access for storage.

Today, HDDs aren't fast, and are not getting faster.
  -- richard



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs scripts

2011-09-10 Thread Toby Thain
On 10/09/11 8:31 AM, LaoTsao wrote:
> imho, there is not harm to  use & in both cmd
> 

There is a difference.

--T

> Sent from my iPad
> Hung-Sheng Tsao ( LaoTsao) Ph.D
> 
> On Sep 10, 2011, at 4:59, Toby Thain  wrote:
> 
>> On 09/09/11 6:33 AM, Sriram Narayanan wrote:
>>> Plus, you'll need an & character at the end of each command.
>>>
>>
>> Only one of the commands needs to be backgrounded.
>>
>> --Toby
>>
>>> -- Sriram
>>>
>>> On 9/9/11, Tomas Forsman  wrote:
>>>> On 09 September, 2011 - cephas maposah sent me these 0,4K bytes:
>>>>
>>>>> i am trying to come up with a script that incorporates other scripts.
>>>>>
>>>>> eg
>>>>> zfs send pool/filesystem1@100911 > /backup/filesystem1.snap
>>>>> zfs send pool/filesystem2@100911 > /backup/filesystem2.snap
>>>>
>>>> #!/bin/sh
>>>> zfs send pool/filesystem1@100911 > /backup/filesystem1.snap &
>>>> zfs send pool/filesystem2@100911 > /backup/filesystem2.snap
>>>>
>>>> ..?
>>>>
>>>>> i need to incorporate these 2 into a single script with both commands
>>>>> running concurrently.
>>>>
>>>> /Tomas
>>>> --
>>>> Tomas Forsman, st...@acc.umu.se, http://www.acc.umu.se/~stric/
>>>> |- Student at Computing Science, University of Umeå
>>>> `- Sysadmin at {cs,acc}.umu.se
>>>> ___
>>>> zfs-discuss mailing list
>>>> zfs-discuss@opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>
>>>
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs scripts

2011-09-10 Thread Toby Thain
On 09/09/11 6:33 AM, Sriram Narayanan wrote:
> Plus, you'll need an & character at the end of each command.
> 

Only one of the commands needs to be backgrounded.

--Toby

> -- Sriram
> 
> On 9/9/11, Tomas Forsman  wrote:
>> On 09 September, 2011 - cephas maposah sent me these 0,4K bytes:
>>
>>> i am trying to come up with a script that incorporates other scripts.
>>>
>>> eg
>>> zfs send pool/filesystem1@100911 > /backup/filesystem1.snap
>>> zfs send pool/filesystem2@100911 > /backup/filesystem2.snap
>>
>> #!/bin/sh
>> zfs send pool/filesystem1@100911 > /backup/filesystem1.snap &
>> zfs send pool/filesystem2@100911 > /backup/filesystem2.snap
>>
>> ..?
>>
>>> i need to incorporate these 2 into a single script with both commands
>>> running concurrently.
>>
>> /Tomas
>> --
>> Tomas Forsman, st...@acc.umu.se, http://www.acc.umu.se/~stric/
>> |- Student at Computing Science, University of Umeå
>> `- Sysadmin at {cs,acc}.umu.se
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool with data errors

2011-06-21 Thread Toby Thain
On 21/06/11 7:54 AM, Todd Urie wrote:
> The volumes sit on HDS SAN.  The only reason for the volumes is to
> prevent inadvertent import of the zpool on two nodes of a cluster
> simultaneously.  Since we're on SAN with Raid internally, didn't seem to
> we would need zfs to provide that redundancy also.

You do if you want self-healing, as Tomas points out. A non-redundant
pool, even on mirrored or RAID storage, offers no ability to recover
from detected errors anywhere on the data path. To gain this benefit of
ZFS, it needs to manage redundancy.

On the upside, ZFS at least *detected* the errors, while other systems
would not.

--Toby

> 
> On Tue, Jun 21, 2011 at 4:17 AM, Remco Lengers  > wrote:
> 
> Todd,
> 
> Is that ZFS on top of VxVM ?  Are those volumes okay? I wonder if
> this is really a sensible combination?
> 
> ..Remco
> 
> 
> On 6/21/11 7:36 AM, Todd Urie wrote:
>> I have a zpool that shows the following from a zpool status -v
>> 
>>...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-18 Thread Toby Thain
On 18/06/11 12:44 AM, Michael Sullivan wrote:
> ...
> Way off-topic, but Smalltalk and its variants do this by maintaining the 
> state of everything in an operating environment image.
> 

...Which is in memory, so things are rather different from the world of
filesystems.

--Toby

> But then again, I could be wrong.
> 
> Mike
> 
> ---
> Michael Sullivan   
> m...@axsh.us
> http://www.axsh.us/
> Phone: +1-662-259-
> Mobile: +1-662-202-7716
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-16 Thread Toby Thain
On 16/06/11 3:09 AM, Simon Walter wrote:
> On 06/16/2011 09:09 AM, Erik Trimble wrote:
>> We had a similar discussion a couple of years ago here, under the
>> title "A Versioning FS". Look through the archives for the full
>> discussion.
>>
>> The jist is that application-level versioning (and consistency) is
>> completely orthogonal to filesystem-level snapshots and consistency. 
>> IMHO, they should never be mixed together - there are way too many
>> corner cases and application-specific memes for a filesystem to ever
>> fully handle file-level versioning and *application*-level data
>> consistency.  Don't mistake one for the other, and, don't try to *use*
>> one for the other.  They're completely different creatures.
>>
> 
> I guess that is true of the current FSs available. Though it would be
> nice to essentially have a versioning FS in the kernel rather than an
> application in userspace. But I regress. I'll use SVN and webdav.


To use Svn correctly here, you have to resolve the same issue. Svn has a
global revision, just as a snapshot is a state for an *entire*
filesystem. You don't seem to have taken that into sufficient account
when talking about ZFS; it doesn't align with your goal of consistency
from the point of view of a *single document*.

You'll only be able to make a useful snapshot in Svn at moments when
*all* documents in the repository are in a consistent state (I'm
assuming this is a multi-user system). That's a much stronger guarantee
than you probably 'require' for your purpose, so it makes me wonder
whether what you really want is a document database (or, to be honest,
an ordinary filesystem; you can "snapshot" single documents in an
ordinary filesystem using say hard links) where the state of each
session/document is *independent*. You can see that the latter model is
much more like Google Docs, not to mention simpler; and the "snapshot"
model is not like it at all.

--Toby

> 
> Thanks for the advice everyone.
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-15 Thread Toby Thain
On 15/06/11 8:30 AM, Simon Walter wrote:
> On 06/15/2011 09:01 PM, Toby Thain wrote:
>>>> I know I've certainly had many situations where people wanted to
>>>> snapshot or
>>>> rev individual files everytime they're modified.  As I said - perfect
>>>> example is Google Docs.  Yes it is useful.  But no, it's not what ZFS
>>>> does.
>>> Exactly versions of a whole file, but that is different to a snapshot on
>>> every write.
>>>
>>> How you interpret "on every write" depends on where in the stack you are
>>> coming from.  If you think about an application a "write" is whey you
>>> save the document but at the ZPL layer that is multiple write(2) calls
>>> and maybe even some rename(2)/unlink(2)/close(2) calls as well.
>> That's one big problem with the naive plan of using snapshots.
>>
>> Another one is that snapshots are per-filesystem, while the intention
>> here is to capture a document in one user session. Taking a snapshot
>> will of course say nothing about the state of other user sessions. Any
>> document in the process of being saved by another user, for example,
>> will be corrupt.
> 
> Would it be? I think that's pretty lame for ZFS to corrupt data. 

ZFS isn't corrupting anything (Michael is correct, "inconsistent" would
have been a better word). The inevitable inconsistency *from the
application's perspective* results from the error of thinking a snapshot
is automagically correct for document sessions (which are not aware of
what happens at levels underneath).

Likewise, you can backup your RDBMS with tar, if you like, but the
result may not have integrity from the database's point of view. (Same
applies to filesystem backups with dd, etc, etc).


> If I
> were to manually create a snapshot and two users were writing to the FS,
> how would ZFS handle that? Are you saying it would corrupt the data? I
> thought snapshots could be taken regardless of if there is activity.

Of course they can, but (as Jim explains) this cannot guarantee
consistency on higher levels without *interacting* with higher levels
(for example, quiescing a database, or fully flushing a document).

> 
> If I monitor (via Dtrace?) for what equates to a "save", would that be
> sufficient? 

Darren explained some reasons why this may not be trivial.

> If It's particular sequence, then should it not be able to
> be monitored for? Since it is a NAS, only one or two daemons will be
> writing to the particular FS. I can get expected behaviour from these
> daemons.
> 
> If it really is a retarded idea, at least with the current FSs
> available, 

It's not a fault of the filesystem. It's just an architectural problem
to solve, most likely on a different layer.

--Toby

> then I'll just use SVN and manage the repos somehow. I just
> thought I'd see if it's an option.
> 
> Anyone know how Google Docs does it?
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-15 Thread Toby Thain
On 15/06/11 7:45 AM, Darren J Moffat wrote:
> On 06/15/11 12:29, Edward Ned Harvey wrote:
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Richard Elling
>>>
>>> That would suck worse.
>>
>> Don't mind Richard.  He is of the mind that ZFS is perfect for everything
>> just the way it is, and anybody who wants anything different should
>> adjust
>> their thought process.
> 
> I suspect rather than that it is more that Richard equated "write" to
> write(2) / dmu_write() calls and that would suck performance wise.
> 
> I also suspect that what Simon wants isn't a snapshot of every little
> write(2) level call but when the file is completed being updated, maybe
> on close(2) [ but that assumes the app does actually call close() ].
> 

That's how I interpreted it.

>> I know I've certainly had many situations where people wanted to
>> snapshot or
>> rev individual files everytime they're modified.  As I said - perfect
>> example is Google Docs.  Yes it is useful.  But no, it's not what ZFS
>> does.
> 
> Exactly versions of a whole file, but that is different to a snapshot on
> every write.
> 
> How you interpret "on every write" depends on where in the stack you are
> coming from.  If you think about an application a "write" is whey you
> save the document but at the ZPL layer that is multiple write(2) calls
> and maybe even some rename(2)/unlink(2)/close(2) calls as well.

That's one big problem with the naive plan of using snapshots.

Another one is that snapshots are per-filesystem, while the intention
here is to capture a document in one user session. Taking a snapshot
will of course say nothing about the state of other user sessions. Any
document in the process of being saved by another user, for example,
will be corrupt.

The proposal seems to be aimed at the wrong part of the stack. The
comparison with Google Docs is revealing.

--Toby

> If you move further down then doing a snapshot on every dmu_write() call
> is fundamentally at odds with how ZFS works.
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-09 Thread Toby Thain
On 09/06/11 1:33 PM, Paul Kraus wrote:
> On Thu, Jun 9, 2011 at 1:17 PM, Jim Klimov  wrote:
>> 2011-06-09 18:52, Paul Kraus пишет:
>>>
>>> On Thu, Jun 9, 2011 at 8:59 AM, Jonathan Walker  wrote:
>>>
 New to ZFS, I made a critical error when migrating data and
 configuring zpools according to needs - I stored a snapshot stream to
 a file using "zfs send -R [filesystem]@[snapshot]>[stream_file]".
>>>
>>> Why is this a critical error, I thought you were supposed to be
>>> able to save the output from zfs send to a file (just as with tar or
>>> ufsdump you can save the output to a file or a stream) ?
>>> Was the cause of the checksum mismatch just that the stream data
>>> was stored as a file ? That does not seem right to me.
>>>
>> As recently mentioned on the list (regarding tape backups, I believe)
>> the zfs send stream format was not intended for long-term storage.
> 
> Only due to possible changes in the format.
> 
>> If some bits in the saved file flipped,
> 
> Then you have a bigger problem, namely that the file was corrupted.

This fragility is one of the main reasons it has always been discouraged
(& regularly on this list) as an archive.

--Toby

> ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Toby Thain
On 08/05/11 10:31 AM, Edward Ned Harvey wrote:
>...
> Incidentally, does fsync() and sync return instantly or wait?  Cuz "time
> sync" might product 0 sec every time even if there were something waiting to
> be flushed to disk.

The semantics need to be synchronous. Anything else would be a horrible bug.

--Toby

> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Toby Thain
On 06/05/11 9:17 PM, Erik Trimble wrote:
> On 5/6/2011 5:46 PM, Richard Elling wrote:
>> ...
>> Yes, perhaps a bit longer for recursive destruction, but everyone here
>> knows recursion is evil, right? :-)
>>   -- richard

> You, my friend, have obviously never worshipped at the Temple of the
> Lamba Calculus, nor been exposed to the Holy Writ that is "Structure and
> Interpretation of Computer Programs"

As someone who is studying Scheme and SICP, I had no trouble seeing that
Richard was not being serious :)

> (http://mitpress.mit.edu/sicp/full-text/book/book.html).
> 
> I sentence you to a semester of 6.001 problem sets, written by Prof
> Sussman sometime in the 1980s.
> 


--Toby



> (yes, I went to MIT.)
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on MIPS

2011-04-07 Thread Toby Thain
On 07/04/11 7:53 PM, Learner Study wrote:
> Hello,
> 
> I was thinking of moving (porting) ZFS into my linux environment
> (2.6.30sh kernel) on MIPS architecture i.e. instead of using native
> ext4/xfs file systems, I'd like to try out ZFS.
> 
> I tried to google for it but couldn't find anything relevant. Has
> someone done this? Does it make sense? The key motivation of doing
> this is de-duplication support that comes along with ZFS.
> 

The existing ZFS on Linux projects should be relevant. See fairly recent
list postings.

--Toby


> Thanks for any pointers...
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and standard backup programs

2011-03-23 Thread Toby Thain
On 23/03/11 12:13 PM, Linder, Doug wrote:
> OK, I know this is only tangentially related to ZFS, but we’re desperate
> and I thought someone might have a clue or idea of what kind of thing to
> look for.  Also, this issue is holding up widespread adoption of ZFS at
> our shop.  It’s making the powers-that-be balk a little –
> understandably.  If we can’t back up stuff on ZFS, we can’t really use it.
> 
>  
> 
> We have a ZFS filesystem that’s guarded by the Vormetric encryption
> product to prevent unauthorized users from reading it.  Our backup
> software, HP’s Data Protector, refuses to back up this dataset even
> though it runs as a user with privileges to read the files.  When we
> guard a ZFS dataset with Vormetric, we get the alerts below in HP DP and
> the data is not backed up.  Any suggestions at all are welcome.
> 
>  
> 
> Note that, yes - files in similarly protected directories on UFS file
> systems do get backed up correctly.  So it has **something** to do with
> ZFS.
> 
>  

Wouldn't this firstly be a question for the vendor of Vormetric?

--Toby

> 
> Warning] From: v...@hostname.ourdomain.com
>  "/directoryname"  Time: 3/23/2011
> 3:02:25 AM
> 
>   /directoryname
> 
>   Directory is a mount point to a different filesystem.
> 
>   Backed up as empty directory.
> 
>  
> 
> [Minor] From: v...@hostname.ourdomain.com
>  "/directoryname"  Time: 3/23/2011
> 3:02:25 AM
> 
> [ 81:84 ] /directoryname
> 
>   Cannot read ACLs: ([89] Operation not applicable).
> 
>  
> 
>  
> 
> --
> Learn more about Merchant Link at www.merchantlink.com.
> 
> THIS MESSAGE IS CONFIDENTIAL.  This e-mail message and any attachments are 
> proprietary and confidential information intended only for the use of the 
> recipient(s) named above.  If you are not the intended recipient, you may not 
> print, distribute, or copy this message or any attachments.  If you have 
> received this communication in error, please notify the sender by return 
> e-mail and delete this message and any attachments from your computer.
> 
> 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] best migration path from Solaris 10

2011-03-18 Thread Toby Thain
On 18/03/11 5:56 PM, Paul B. Henson wrote:
> We've been running Solaris 10 for the past couple of years, primarily to
> leverage zfs to provide storage for about 40,000 faculty, staff, and
> students ... and at this point want to start reevaluating our best
> migration option to move forward from Solaris 10.
> 
> There's really nothing else available that is comparable to zfs (perhaps
> btrfs someday in the indefinite future, but who knows when that day
> might come), so our options would appear to be Solaris 11 Express,
> Nexenta (either NexentaStor or NexentaCore), and OpenIndiana (FreeBSD is
> occasionally mentioned as a possibility, but I don't really see that as
> suitable for our enterprise needs).
> 

You're not the only institution asking this question; here's a couple of
blog posts by Chris Siebenmann:
 * http://utcc.utoronto.ca/~cks/space/blog/solaris/OurFutureWithSolaris
 * http://utcc.utoronto.ca/~cks/space/blog/solaris/OurSolarisAlternatives

regards
--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance

2011-02-27 Thread Toby Thain
On 27/02/11 9:59 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of David Blasingame Oracle
>>
>> Keep pool space under 80% utilization to maintain pool performance.
> 
> For what it's worth, the same is true for any other filesystem too. 

I would expect COW puts more pressure on near-full behaviour compared to
write-in-place filesystems. If that's not true, somebody correct me.

--Toby

> What
> really matters is the availability of suitably large sized unused sections
> of the hard drive. ...
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] stupid ZFS question - floating point operations

2010-12-22 Thread Toby Thain
On 22/12/10 2:44 PM, Jerry Kemp wrote:
> I have a coworker, who's primary expertise is in another flavor of Unix.
> 
> This coworker lists floating point operations as one of ZFS detriments.
> 

Perhaps he can point you also to the equally mythical competing
filesystem which offers ZFS' advantages.

--Toby

> I's not really sure what he means specifically, or where he got this
> reference from.
> ...
> Jerry
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideas for ghetto file server data reliability?

2010-11-15 Thread Toby Thain
On 15/11/10 9:28 PM, Edward Ned Harvey wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Toby Thain
>>
>> The corruption will at least be detected by a scrub, even in cases where
> it
>> cannot be repaired.
> 
> Not necessarily.  Let's suppose you have some bad memory, and no ECC.  Your
> application does 1 + 1 = 3.  Then your application writes the answer to a
> file.  Without ECC, the corruption happened in memory and went undetected.
> Then the corruption was written to file, with a correct checksum.  So in
> fact it's not filesystem corruption, and ZFS will correctly mark the
> filesystem as clean and free of checksum errors.
> 

I meant corruption after the point at which the application passes its
buffer to zfs. But you are right, the checksum could conceivably be
correct in this case as well.



> In conclusion:
> 
> Use ECC if you care about your data.
> Do backups if you care about your data.
> 

Yes. Especially the latter :)

--Toby

> Don't be a cheapskate, or else, don't complain when you get bitten by lack
> of adequate data protection.
> 
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideas for ghetto file server data reliability?

2010-11-15 Thread Toby Thain
On 15/11/10 7:54 PM, Bryan Horstmann-Allen wrote:
> +--
> | On 2010-11-15 11:27:02, Toby Thain wrote:
> | 
> | > Backups are not going to save you from bad memory writing corrupted data 
> to
> | > disk.
> | 
> | It is, however, a major motive for using ZFS in the first place.
> 
> In this context, not trusting your disks is the motive. If corruption (even
> against metadata) happens in-memory, ZFS will happily write it to disk. 

The corruption will at least be detected by a scrub, even in cases where
it cannot be repaired.

--Toby

> Has
> this behavior changed in the last 6 months?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideas for ghetto file server data reliability?

2010-11-15 Thread Toby Thain
On 15/11/10 10:32 AM, Bryan Horstmann-Allen wrote:
> +--
> | On 2010-11-15 10:21:06, Edward Ned Harvey wrote:
> | 
> | Backups.
> | 
> | Even if you upgrade your hardware to better stuff... with ECC and so on ...
> | There is no substitute for backups.  Period.  If you care about your data,
> | you will do backups.  Period.
> 
> Backups are not going to save you from bad memory writing corrupted data to
> disk.

It is, however, a major motive for using ZFS in the first place.

--Toby

> 
> If your RAM flips a bit and writes garbage to disk, and you back up that
> garbage, guess what: Your backups are full of garbage.
> 
> Invest in ECC RAM and hardware that is, at the least, less likely to screw 
> you.
> 
> Test your backups to ensure you can trust them.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any limit on pool hierarchy?

2010-11-09 Thread Toby Thain
On 09/11/10 11:46 AM, Maurice Volaski wrote:
> ...
> 

Is that horrendous mess Outlook's fault? If so, please consider not
using it.

--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Toby Thain
On 27/10/10 4:21 PM, Krunal Desai wrote:
> I believe he meant a memory stress test, i.e. booting with a
> memtest86+ CD and seeing if it passed. 

Correct. The POST tests are not adequate.

--Toby


Even if the memory is OK, the
> stress from that test may expose defects in the power supply or other
> components.
> 
> Your CPU temperature is 56C, which is not out-of-line for most modern
> CPUs (you didn't state what type of CPU it is). Heck, 56C would be
> positively cool for a NetBurst-based Xeon.
> 
> On Wed, Oct 27, 2010 at 4:17 PM, Harry Putnam  wrote:
>> Toby Thain  writes:
>>
>>> On 27/10/10 3:14 PM, Harry Putnam wrote:
>>>> It seems my hardware is getting bad, and I can't keep the os running
>>>> for more than a few minutes until the machine shuts down.
>>>>
>>>> It will run 15 or 20 minutes and then shutdown
>>>> I haven't found the exact reason for it.
>>>>
>>>
>>> One thing to try is a thorough memory test (few hours).
>>>
>>
>> It does some kind of memory test on bootup.  I recall seeing something
>> about high memory.  And shows all of the 3GB installed
>>
>> I just now saw last time it came down, that the cpu was at 134
>> degrees.
>>
>> And that would of been after it cooled a couple minutes.
>>
>> I don't think that is astronomical but it may have been a good bit
>> higher under load.  But still wouldn't something show in
>> /var/adm/messages if that were the problem?
>>
>> Are there not a list of standard things to grep for in logs that would
>> indicate various troubles?  Surely system admins would have  somekind
>> of reporting tool to get ahead of serious troubles.
>>
>> I've had one or another problem with this machine for a couple of
>> months now so thinking of scrapping it out, and putting a new setup in
>> that roomy  midtower.
>>
>> Where can I find a guide to help me understand how to build up a
>> machine and then plug my existing discs and data into the new OS?
>>
>> I don't mean the hardware part but that part particularly opensolaris
>> and zfs related.
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
> 
> 
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Toby Thain
On 27/10/10 3:14 PM, Harry Putnam wrote:
> It seems my hardware is getting bad, and I can't keep the os running
> for more than a few minutes until the machine shuts down.
> 
> It will run 15 or 20 minutes and then shutdown
> I haven't found the exact reason for it.
> 

One thing to try is a thorough memory test (few hours).

--Toby

> Or really any thing in logs that seems like a reason.
> 
> It may be because I don't know what to look for.
> 
> I have been having some trouble with corrupted data in one pool but
> I thought I'd gotten it cleared up and posted to that effect in
> another thread.
> 
> zpool status on all pools shows thumbs up.
> 
> What are some key words I should be looking for in /var/adm/messages?
> 
> On this next shutdown (the machine is currently running) I'm going
> into bios and see what temperatures are like... but passing my hand
> around the insides of the box seems to indicate nothing unusual.
> 
> I'm not sure how to query the OS for temperatures while its running.
> 
> But if heat is a problem something would be in /var/adm/messages right?
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-14 Thread Toby Thain


On 14-Oct-10, at 11:48 AM, Edward Ned Harvey wrote:


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Toby Thain


I don't want to heat up the discussion about ZFS managed discs vs.
HW raids, but if RAID5/6 would be that bad, no one would use it
anymore.


It is. And there's no reason not to point it out. The world has


Well, neither one of the above statements is really fair.

The truth is:  radi5/6 are generally not that bad.  Data integrity  
failures
are not terribly common (maybe one bit per year out of 20 large  
disks or

something like that.)


Such statistics assume that no part of the stack (drive, cable,  
network, controller, memory, etc) has any fault and is operating  
normally. This is, indeed, the base presumption of RAID (which also  
assumes a perfect error reporting chain).




And in order to reach the conclusion "nobody would use it," the  
people using
it would have to first *notice* the failure.  Which they don't.   
That's kind

of the point.


Indeed it is. And then we could talk about self healing (also missing  
from RAID).


--Toby



Since I started using ZFS in production, about a year ago, on three  
servers
totaling approx 1.5TB used, I have had precisely one checksum error,  
which
ZFS corrected.  I have every reason to believe, if that were on a  
raid5/6,

the error would have gone undetected and nobody would have noticed.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-14 Thread Toby Thain


On 14-Oct-10, at 3:27 AM, Stephan Budach wrote:


I'd like to see those docs as well.
As all HW raids are driven by software, of course - and software can  
be buggy.




It's not that the software 'can be buggy' - that's not the point here.  
The point being made is that conventional RAID just doesn't offer data  
*integrity* - it's not a design factor. The necessary mechanisms  
simply aren't there.


Contrariwise, with ZFS, end to end integrity is *designed in*. The  
'papers' which demonstrate this difference are the design documents;  
anyone could start with Mr Bonwick's blog - with which I am sure most  
list readers are already familiar.


http://blogs.sun.com/bonwick/en_US/category/ZFS
e.g. http://blogs.sun.com/bonwick/en_US/entry/zfs_end_to_end_data

I don't want to heat up the discussion about ZFS managed discs vs.  
HW raids, but if RAID5/6 would be that bad, no one would use it  
anymore.


It is. And there's no reason not to point it out. The world has  
changed a lot since RAID was 'state of the art'. It is important to  
understand its limitations (most RAID users apparently don't).


The saddest part is that your experience clearly shows these  
limitations. As expected, the hardware RAID didn't protect your data,  
since it's designed neither to detect nor repair such errors.


If you had been running any other filesystem on your RAID you would  
never even have found out about it until you accessed a damaged part  
of it. Furthermore, backups would probably have been silently corrupt,  
too.


As many other replies have said: The correct solution is to let ZFS,  
and not conventional RAID, manage your redundancy. That's the bottom  
line of any discussion of "ZFS managed discs vs. HW raids". If still  
unclear, read Bonwick's blog posts, or the detailed reply to you from  
Edward Harvey (10/6).


--Toby



So… just post the link and I will take a close look at the docs.

Thanks,
budy
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Toby Thain


On 7-Oct-10, at 1:22 AM, Stephan Budach wrote:


Hi Edward,

these are interesting points. I have considered a couple of them,  
when I started playing around with ZFS.


I am not sure whether I disagree with all of your points, but I  
conducted a couple of tests, where I configured my raids as jbods  
and mapped each drive out as a seperate LUN and I couldn't notice a  
difference in performance in any way.





The integrity issue is, however, clear cut. ZFS must manage the  
redundancy.


ZFS just alerted you that your 'FC RAID' doesn't actually provide data  
integrity, & you just lost the 'calculated' bet. :)


--Toby


I'd love to discuss this in a seperate thread, but first I will have  
to check the archives an Google. ;)


Thanks,
budy
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Toby Thain


On 21-Aug-10, at 3:06 PM, Ross Walker wrote:

On Aug 21, 2010, at 2:14 PM, Bill Sommerfeld > wrote:



On 08/21/10 10:14, Ross Walker wrote:
...
Would I be better off forgoing resiliency for simplicity, putting  
all my faith into the Equallogic to handle data resiliency?


IMHO, no; the resulting system will be significantly more brittle.


Exactly how brittle I guess depends on the Equallogic system.


If you don't let zfs manage redundancy, Bill is correct: it's a more  
fragile system that *cannot* self heal data errors in the (deep)  
stack. Quantifying the increased risk, is a question that Richard  
Elling could probably answer :)


--Toby



-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-17 Thread Toby Thain


On 17-Aug-10, at 1:05 PM, Andrej Podzimek wrote:

I did not say there is something wrong about published reports. I  
often read
them. (Who doesn't?) However, there are no trustworthy reports on  
this topic

yet, since Btrfs is unfinished. Let's see some examples:

(1) http://www.phoronix.com/scan.php?page=article&item=zfs_ext4_btrfs&num=1


My little few yen in this massacre: Phoronix usually compares apples
with oranges and pigs with candies. So be careful.


Nobody said one should blindly trust Phoronix. ;-) In fact I clearly  
said the contrary. I mentioned the famous example of a totally  
absurd "benchmark" that used crippled and crashing code from the ZEN  
patchset to benchmark Reiser4.



Disclaimer: I use Reiser4


A "Killer FS"™. :-)


I had been using Reiser4 for quite a long time before Hans Reiser  
was convicted for the murder of his wife. There was absolutely no  
(objective technical) reason to make a change afterwards. :-)


Thankyou, well said!! The 'killer' gag wasn't funny the first time and  
it certainly isn't any funnier now. It's in extremely poor taste,  
apart from being childish.


As far as speed is concerned, Reiser4 really is a "Killer FS" (in a  
very positive sense).


Reiser3 is fast and solid too, I like others have used it happily on  
dozens of servers for many years and continue to do so. (At least,  
where I can't use ZFS :-X)



It is now maintained by Edward Shishkin, a former Namesys employee.


Who is also sharing his expertise with the btrfs project, a very  
positive outcome.


--Toby

Patches are available for each kernel version. (http://www.kernel.org/pub/linux/kernel/people/edward/reiser4/reiser4-for-2.6/ 
)


Admittedly, with the advent of Ext4 and Btrfs, Reiser4 is not so  
"brilliant" any more. Reiser4 could have been a much larger project  
with many features known from today's ZFS/Btrfs (encryption,  
compression and perhaps even snapshots and subvolumes), but long  
disputes around kernel integration and the events around Hans Reiser  
blocked the whole effort and Reiser4 lost its advantage.


Andrej

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cache flush (or the lack of such) and corruption

2010-07-10 Thread Toby Thain


On 10-Jul-10, at 4:57 PM, Roy Sigurd Karlsbakk wrote:


- Original Message -
Depends on the failure mode. I've spent hundreds (thousands?) of  
hours

attempting to recover data from backup tape because of bad hardware,
firmware,
and file systems. The major difference is that ZFS cares that the  
data

is not
correct, while older file systems did not care about the data.


It still seems like ZFS has a problem with its metadata. Reports of  
loss of pools because of metadata errors is what is worrying me. Can  
you give me any input on how to avoid this?


Roy

It needs to be pointed out that this only causes an integrity problem  
for zfs *when the hardware stack is faulty* (not respecting flush).  
And it obviously then affects all systems which assume working  
barriers (RDBMS, reiser3fs, ext3fs, other journaling systems, etc).


--Toby



Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres  
intelligibelt. Det er et elementært imperativ for alle pedagoger å  
unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de  
fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs corruptions in pool

2010-06-08 Thread Toby Thain


On 6-Jun-10, at 7:11 AM, Thomas Maier-Komor wrote:


On 06.06.2010 08:06, devsk wrote:
I had an unclean shutdown because of a hang and suddenly my pool is  
degraded (I realized something is wrong when python dumped core a  
couple of times).


This is before I ran scrub:

 pool: mypool
state: DEGRADED
status: One or more devices has experienced an error resulting in  
data

   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise  
restore the

   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scan: scrub repaired 0 in 0h7m with 0 errors on Mon May 31 09:00:27  
2010

config:

   NAMESTATE READ WRITE CKSUM
   mypool  DEGRADED 0 0 0
 c6t0d0s0  DEGRADED 0 0 0  too many errors

errors: Permanent errors have been detected in the following files:

   mypool/ROOT/May25-2010-Image-Update:<0x3041e>
   mypool/ROOT/May25-2010-Image-Update:<0x31524>
   mypool/ROOT/May25-2010-Image-Update:<0x26d24>
   mypool/ROOT/May25-2010-Image-Update:<0x37234>
   //var/pkg/download/d6/d6be0ef348e3c81f18eca38085721f6d6503af7a
   mypool/ROOT/May25-2010-Image-Update:<0x25db3>
   //var/pkg/download/cb/cbb0ff02bcdc6649da3763900363de7cff78ec72
   mypool/ROOT/May25-2010-Image-Update:<0x26cf6>


I ran scrub and this is what it has to say afterwards.

 pool: mypool
state: DEGRADED
status: One or more devices has experienced an unrecoverable  
error.  An
   attempt was made to correct the error.  Applications are  
unaffected.
action: Determine if the device needs to be replaced, and clear the  
errors
   using 'zpool clear' or replace the device with 'zpool  
replace'.

  see: http://www.sun.com/msg/ZFS-8000-9P
scan: scrub repaired 0 in 0h11m with 0 errors on Sat Jun  5  
22:43:54 2010

config:

   NAMESTATE READ WRITE CKSUM
   mypool  DEGRADED 0 0 0
 c6t0d0s0  DEGRADED 0 0 0  too many errors

errors: No known data errors

Few of questions:

1. Have the errors really gone away? Can I just clear and be  
content that errors are really gone?


2. Why did the errors occur anyway if ZFS guarantees on-disk  
consistency? I wasn't writing anything. Those files were definitely  
not being touched when the hang and unclean shutdown happened.


I mean I don't mind if I create or modify a file and it doesn't  
land on disk because on unclean shutdown happened but a bunch of  
unrelated files getting corrupted, is sort of painful to digest.


3. The action says "Determine if the device needs to be replaced".  
How the heck do I do that?



Is it possible that this system runs on a virtual box? At least I've
seen such a thing happen on a Virtual Box but never on a real machine.


As I postulated in the relevant forum thread there:
http://forums.virtualbox.org/viewtopic.php?t=13661
(can't check URL, the site seems down for me atm)



The reason why the error have gone away might be that meta data has
three copies IIRC. So if your disk only had corruptions in the meta  
data

area these errors can be repaired by scrubbing the pool.

The smartmontools might help you figuring out if the disk is broken.  
But
if you only had an unexpected shutdown and now everything is clean  
after

a scrub, I wouldn't expect the disk to be broken. You can get the
smartmontools from opencsw.org.

If your system is really running on a Virtual Box I'd recommend that  
you

turn of disk write caching of Virtual Box.


Specifically, stop it from ignoring cache flush. Caching is irrelevant  
if flushes are being correctly handled.


ZFS isn't the only software system that will suffer inconsistencies/ 
corruption in the guest if flushes are ignored, of course.


--Toby



Search the OpenSolaris forum
of Virtual Box. There is an article somewhere how to do this. IIRC the
subject is somethink like 'zfs pool curruption'. But it is also
somewhere in the docs.

HTH,
Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Consolidating a huge stack of DVDs using ZFS dedup: automation?

2010-03-02 Thread Toby Thain


On 2-Mar-10, at 4:31 PM, valrh...@gmail.com wrote:


Freddie: I think you understand my intent correctly.

This is not about a perfect backup system. The point is that I have  
hundreds of DVDs that I don't particularly want to sort out, but  
they are pretty useless from a management standpoint in their  
current form. ZFS + dedup would be the way to at least get them all  
in one place, where at least I can search, etc.---which is pretty  
much impossible on a stack of disks.


I also don't want file-level dedup, as a lot of these disks are a  
"oh, it's the end of the day; I'm going to burn what I worked on  
today, so if my computer dies I won't be completely stuck on this  
project..."



Wow, you are going to like snapshots and redundancy a whole lot  
better, as a solution to that.


--Toby



File-level dedup would be a nightmare to sort out, because of lots  
of incremental changes---exactly the point of block-level dedup.


This is not an organized archive at all; I just want to consolidate  
a bunch of old disks, in the small case they could be useful, and  
do it without investing much time.


So does anyone know of an autoloader solution that would do this?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Toby Thain


On 24-Feb-10, at 3:38 PM, Tomas Ögren wrote:


On 24 February, 2010 - Bob Friesenhahn sent me these 1,0K bytes:


On Wed, 24 Feb 2010, Steve wrote:


The overhead I was thinking of was more in the pointer structures...
(bearing in mind this is a 128 bit file system), I would guess that
memory requirements would be HUGE for all these files...otherwise  
arc

is gonna struggle, and paging system is going mental?


It is not reasonable to assume that zfs has to retain everything in
memory.

I have a directory here containing a million files and it has not  
caused
any strain for zfs at all although it can cause considerable  
stress on

applications.

400 million tiny files is quite a lot and I would hate to use  
anything

but mirrors with so many tiny files.


Another tought is "am I using the correct storage model for this  
data"?



You're not the only one wondering that. :)

--Toby



/Tomas
--
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Poor ZIL SLC SSD performance

2010-02-19 Thread Toby Thain


On 19-Feb-10, at 5:40 PM, Eugen Leitl wrote:


On Fri, Feb 19, 2010 at 11:17:29PM +0100, Felix Buenemann wrote:


I found the Hyperdrive 5/5M, which is a half-height drive bay sata
ramdisk with battery backup and auto-backup to compact flash at power
failure.
Promises 65,000 IOPS and thus should be great for ZIL. It's pretty
reasonable priced (~230 EUR) and stacked with 4GB or 8GB DDR2-ECC  
should

be more than sufficient.


Wouldn't it be better investing these 300-350 EUR into 16 GByte or  
more of

system memory, and a cheap UPS?



That would depend on the read/write mix, I think?

--Toby





http://www.hyperossystems.co.uk/07042003/hardware.htm


--
Eugen* Leitl http://leitl.org";>leitl http://leitl.org
__
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-09 Thread Toby Thain


On 9-Feb-10, at 2:02 PM, Frank Cusack wrote:


On 2/9/10 12:03 PM +1100 Daniel Carosone wrote:>

Snorcle wants to sell hardware.


LOL ... snorcle

But apparently they don't.  Have you seen the new website?   Seems  
like a

blatant attempt to kill the hardware business to me.



That's very sad. I love, love to spec the "rebooted" Bechtolsheim  
hardware designs.


--Toby

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recover ZFS Array after OS Crash?

2010-02-05 Thread Toby Thain


On 5-Feb-10, at 11:35 AM, J wrote:


Hi all,

I'm building a whole new server system for my employer, and I  
really want to use OpenSolaris as the OS for the new file server.   
One thing is keeping me back, though: is it possible to recover a  
ZFS Raid Array after the OS crashes?  I've spent hours with Google  
to avail


To be more descriptive, I plan to have a Raid 1 array for the OS,  
and then I will need 3 additional Raid5/RaidZ/etc arrays for data  
archiving, backups and other purposes.  There is plenty of  
documentation on how to recover an array if one of the drives in  
the array fails, but what if the OS crashes?  Since ZFS is a  
software-based RAID, if the OS crashes is it even possible to  
recover any of the arrays?



Being a software system it is inherently more recoverable than  
hardware RAID (the latter is probably only going to be readable on  
exactly the same configuration, and if the constellations are aligned  
just right, and the black rooster has crowed four times, etc).


As Darren says, you can simply take either or both sides of the  
mirror and boot or access the pool on another ZFS-capable system.


It doesn't even have to use the same interfaces; last week I built a  
new Solaris 10 web server and migrated pool data from one half of a  
ZFS pool from the old server, connected by USB/SATA adapter. This  
kind of flexibility (not to mention data integrity) just isn't there  
with HW RAID.


--Toby


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-03 Thread Toby Thain


On 2-Feb-10, at 10:11 PM, Marc Nicholas wrote:




On Tue, Feb 2, 2010 at 9:52 PM, Toby Thain  
 wrote:


On 2-Feb-10, at 1:54 PM, Orvar Korvar wrote:

100% uptime for 20 years?

So what makes OpenVMS so much more stable than Unix? What is the  
difference?



The short answer is that uptimes like that are VMS *cluster*  
uptimes. Individual hosts don't necessarily have that uptime, but  
the cluster availability is maintained for extremely long periods.


You can probably find more discussion of this in comp.os.vms.

And the 15MB/sec of I/O throughput on that state-of-the-art cluster  
is something to write home about? ;)


Seriously, as someone alluded to earlier, we're not comparing  
apples to applies. And a 9000 series VAX Cluster was one of the  
earlier multi-user systems I worked on for reference ;)


Making that kind of stuff work with modern expectations and  
tolerances is a whole new kettle of fish...



OpenVMS runs on modern gear (Itanium).

--Toby




-marc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-02 Thread Toby Thain


On 2-Feb-10, at 1:54 PM, Orvar Korvar wrote:


100% uptime for 20 years?

So what makes OpenVMS so much more stable than Unix? What is the  
difference?



The short answer is that uptimes like that are VMS *cluster* uptimes.  
Individual hosts don't necessarily have that uptime, but the cluster  
availability is maintained for extremely long periods.


You can probably find more discussion of this in comp.os.vms.

--Toby


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best 1.5TB drives for consumer RAID?

2010-01-25 Thread Toby Thain


On 25-Jan-10, at 2:59 PM, Freddie Cash wrote:


We have the WDC WD15EADS-00P8B0 1.5 TB Caviar Green drives.

Unfortunately, these drives have the "fixed" firmware and the 8  
second idle timeout cannot be changed.


That sounds like a laptop spec, not a server spec! How silly. Maybe  
you can set up a tickle job to stop them idling during busy periods. :(


--Toby

Since we starting replacing these drives in our pool about 6 weeks  
ago (replacing 1 drive per week), the drives has registered almost  
40,000 Load Cycles (head parking cycles).  At this rate, they won't  
last more than a year.  :(  Neither the wdidle3 nor the wdtler  
utilities will work with these drives.


The RE2/RE4-GP drives can be configured with a 5 minute idle timeout.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best 1.5TB drives for consumer RAID?

2010-01-24 Thread Toby Thain


On 24-Jan-10, at 11:26 AM, R.G. Keen wrote:


...
I’ll just blather a bit. The most durable data backup medium humans  
have come up with was invented about 4000-6000 years ago. It’s  
fired cuniform tablets as used in the Middle East. Perhaps one  
could include stone carvings of Egyptian and/or Maya cultures in  
that. ...


The modern computer era has nothing that even comes close. ...


 And I can’t bet on a really archival data storage technology  
becoming available. It may not get there in my lifetime.



A better digital archival medium may already exist:
http://hardware.slashdot.org/story/09/11/13/019202/Synthetic-Stone- 
DVD-Claimed-To-Last-1000-Years


--Toby

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-16 Thread Toby Thain


On 16-Jan-10, at 6:51 PM, Mike Gerdts wrote:

On Sat, Jan 16, 2010 at 5:31 PM, Toby Thain  
 wrote:

On 16-Jan-10, at 7:30 AM, Edward Ned Harvey wrote:

I am considering building a modest sized storage system with  
zfs. Some
of the data on this is quite valuable, some small subset to be  
backed
up "forever", and I am evaluating back-up options with that in  
mind.


You don't need to store the "zfs send" data stream on your backup  
media.
This would be annoying for the reasons mentioned - some risk of  
being able
to restore in future (although that's a pretty small risk) and  
inability

to
restore with any granularity, i.e. you have to restore the whole  
FS if you

restore anything at all.

A better approach would be "zfs send" and pipe directly to "zfs  
receive"

on
the external media.  This way, in the future, anything which can  
read ZFS
can read the backup media, and you have granularity to restore  
either the

whole FS, or individual things inside there.


There have also been comments about the extreme fragility of the  
data stream
compared to other archive formats. In general it is strongly  
discouraged for

these purposes.



Yet it is used in ZFS flash archives on Solaris 10


I can see the temptation, but isn't it a bit under-designed? I think  
Mr Nordin might have ranted about this in the past...


--Toby



and are slated for
use in the successor to flash archives.  This initial proposal seems
to imply using the same mechanism for a system image backup (instead
of just system provisioning).

http://mail.opensolaris.org/pipermail/caiman-discuss/2010-January/ 
015909.html


--
Mike Gerdts
http://mgerdts.blogspot.com/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-16 Thread Toby Thain


On 16-Jan-10, at 7:30 AM, Edward Ned Harvey wrote:

I am considering building a modest sized storage system with zfs.  
Some

of the data on this is quite valuable, some small subset to be backed
up "forever", and I am evaluating back-up options with that in mind.


You don't need to store the "zfs send" data stream on your backup  
media.
This would be annoying for the reasons mentioned - some risk of  
being able
to restore in future (although that's a pretty small risk) and  
inability to
restore with any granularity, i.e. you have to restore the whole FS  
if you

restore anything at all.

A better approach would be "zfs send" and pipe directly to "zfs  
receive" on
the external media.  This way, in the future, anything which can  
read ZFS
can read the backup media, and you have granularity to restore  
either the

whole FS, or individual things inside there.


There have also been comments about the extreme fragility of the data  
stream compared to other archive formats. In general it is strongly  
discouraged for these purposes.


--Toby




Plus, the only way to guarantee the integrity of a "zfs send" data  
stream is

to perform a "zfs receive" on that data stream.  So by performing a
successful receive, you've guaranteed the datastream is not  
corrupt.  Yet.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-12 Thread Toby Thain


On 12-Jan-10, at 10:40 PM, Brad wrote:


"(Caching isn't the problem; ordering is.)"

Weird I was reading about a problem where using SSDs (intel x25-e)  
if the power goes out and the data in cache is not flushed, you  
would have loss of data.


Could you elaborate on "ordering"?



ZFS integrity is maintained if the device correctly respects flush/ 
barrier semantics, which, as required, enforce an ordering of  
operations. The synchronous completion of flush guarantees that prior  
writes have durably completed. This is irrespective of write caching.


When a device does not properly flush, all bets are off, because  
inflight data (including unwritten data in the write cache) is not  
written in any determinate manner (you cannot know what was written,  
or in what order). The precondition for an atomic überblock update is  
that the tree of blocks it references has been fully written.


This has been mentioned periodically on the list. I thought somebody  
(Richard Elling?) did a nice capsule summary recently but I can't  
find it, so here are some other past list snippets by more  
knowledgeable people than I.


Neil Perrin, 6 Dec, 2009:


ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing.
Transactions enter in Open. Quiescing is where a new Open stage has
started and waits for transactions that have yet to commit to finish.
Syncing is where all the completed transactions are pushed to the pool
in an atomic manner with the last write being the root of the new tree
of blocks (uberblock).

All the guarantees assume good hardware. As part of the new  
uberblock update
we flush the write caches of the pool devices. If this is broken  
all bets

are off.


14 Oct, 2009, James R. Van Artsdalen:


ZFS is different because it uses a different "superblock" every few
seconds (every transaction commit), and more importantly, the top  
levels
of the filesystem and some pool metadata are moved too.  After  
every tx

commit the uberblock is in a different place and some of its pointers
are to different places.

Moreover, blocks that were freed by this process are rapidly  
reclaimed.
The uberblock itself is not reclaimed for another 127 commits -  
several
minutes - but the things it points to are.  In other words as soon  
as tx
group N is committed, blocks from N-1 that are no longer referenced  
are

reclaimed as free space.

What goes wrong when the write fence / cache flush doesn't happen:

As soon as the uberblock for tx group N is written everything from N-1
that is no longer referenced is marked free for reallocation, and  
these
newly-freed blocks often contain part of the top levels of the N-1  
pool

/ filesystems and metadata.

If the uberblock for N is _not_ written to media when it was  
supposed to

be then ZFS may happily reuse the blocks from N-1 while the uberblock
for N-1 is still the most recent on media, instead of N as ZFS  
expects.

In other words there might be a window where the most recent uberblock
on disk media (N-1) points to a toplevel directory block that is
overwritten with unrelated data - disaster.

That window closes once uberblock N hits media.  Unfortunately with no
write fence it might be a long time before that happens.  ...


10 Oct, 2009, James Relph quotes Dominic Giampaolo:


"Last, I do not believe that the crash protection scheme used
by ZFS can ever work reliably on drives that drop the flush
track cache request.  The only approach that is guaranteed to
work is to keep enough data in a log that when you remount the
drive, you can replay more data than the drive could have kept
cached."


Nicolas Williams, 13 Feb, 2009:


Also, note that ignoring barriers is effectively as bad as dropping
writes if there's any chance that some writes will never hit the disk
because of, say, power failures.  Imagine 100 txgs, but some writes  
from
the first txg never hitting the disk because the drive keeps them  
in the
cache without flushing them for too long, then you pull out the  
disk, or

power fails -- in that case not even fallback to older txgs will help
you, there'd be nothing that ZFS could do to help you.


Peter Schuller, 10 Feb, 2009:


What's stopping a RAID device from,
for example, ACK:ing an I/O before it is even in the cache? I have not
designed RAID controller firmware so I am not sure how likely that is,
but I don't see it as an impossibility. Disabling flushing because you
have battery backed nvram implies that your battery-backed nvram
guarantees ordering of all writes, and that nothing is ever placed in
said battery backed cache out of order.



Jeff Bonwick, 12 Feb, 2007:


Even if you disable the intent log, the transactional nature
of ZFS ensures preservation of event ordering.  Note that disk caches
don't come into it: ZFS builds up a wad of transactions in memory,
then pushes them out as a transaction group.  That entire group will
either commit or not.  ZFS writes all the new data to new locations,
then flushes all disk write caches, t

Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?

2010-01-12 Thread Toby Thain


On 12-Jan-10, at 5:53 AM, Brad wrote:

Has anyone worked with a x4500/x4540 and know if the internal raid  
controllers have a bbu?  I'm concern that we won't be able to turn  
off the write-cache on the internal hds and SSDs to prevent data  
corruption in case of a power failure.



A power fail won't corrupt data even with write cache enabled, under  
the assumptions about device behaviour recently mentioned on the  
list. (Caching isn't the problem; ordering is.)


The Sun machines must be tested and qualified for correct behaviour.

--Toby


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] internal backup power supplies?

2010-01-11 Thread Toby Thain


On 11-Jan-10, at 5:59 PM, Daniel Carosone wrote:


With all the recent discussion of SSD's that lack suitable
power-failure cache protection, surely there's an opportunity for a
separate modular solution?

I know there used to be (years and years ago) small internal UPS's
that fit in a few 5.25" drive bays. They were designed to power the
motherboard and peripherals, with the advantage of simplicity and
efficiency that comes from being behind the PC PSU and working
entirely on DC.
...
Does anyone know of such a device being made and sold? Feel like
designing and marketing one, or publising the design?


FWIW I think Google server farm uses something like this.

--Toby



--
Dan.___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HW raid vs ZFS

2010-01-11 Thread Toby Thain


On 11-Jan-10, at 1:12 PM, Bob Friesenhahn wrote:


On Mon, 11 Jan 2010, Anil wrote:


What is the recommended way to make use of a Hardware RAID  
controller/HBA along with ZFS?

...


Many people will recommend against using RAID5 in "hardware" since  
then zfs is not as capable of repairing errors, and because most  
RAID5 controller cards use a particular format on the drives so  
that the drives become tied to the controller brand/model and it is  
not possible to move the pool to a different system without using  
an identical controller.  If the controller fails and is no longer  
available for purchase, or the controller is found to have a design  
defect, then the pool may be toast.


+1 These drawbacks of proprietary RAID are frequently overlooked.

Marty Scholes had a neat summary in a posting here, 21 October 2009:
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg30452.html

Back when I did storage admin for a smaller company where  
availability was hyper-critical (but we couldn't afford EMC/ 
Veritas), we had a hardware RAID5 array.  After a few years of  
service, we ran into some problems:

* Need to restripe the array?  Screwed.
* Need to replace the array because current one is EOL?  Screwed.
* Array controller barfed for whatever reason?  Screwed.
* Need to flash the controller with latest firmware?  Screwed.
* Need to replace a component on the array, e.g. NIC, controller or  
power supply?  Screwed.

* Need to relocate the array?  Screwed.

If we could stomach downtime or short-lived storage solutions, none  
of this would have mattered.







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-30 Thread Toby Thain


On 29-Dec-09, at 11:53 PM, Ross Walker wrote:

On Dec 29, 2009, at 12:36 PM, Bob Friesenhahn  
 wrote:


...
 However, zfs does not implement "RAID 1" either.  This is easily  
demonstrated since you can unplug one side of the mirror and the  
writes to the zfs mirror will still succeed, catching up the  
mirror which is behind as soon as it is plugged back in.  When  
using mirrors, zfs supports logic which will catch that mirror  
back up (only sending the missing updates) when connectivity  
improves.  With RAID 1 where is no way to recover a mirror other  
than a full copy from the other drive.


That's not completely true these days as a lot of raid  
implementations use bitmaps to track changed blocks and a raid1  
continues to function when the other side disappears. The real  
difference is the mirror implementation in ZFS is in the file  
system and not at an abstracted block-io layer so it is more  
intelligent in it's use and layout.


Another important difference is that ZFS has the means to know which  
side of a mirror returned valid data.


--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-22 Thread Toby Thain


On 22-Dec-09, at 3:33 PM, James Risner wrote:


...
Joerg Moellenkamp:
 I do "consider RAID5 as 'Stripeset with an interleaved  
Parity'", so I don't agree with the strong objection in this thread  
by many about the use of RAID5 to describe what raidz does.  I  
don't think many particularly care about the nuanced differences  
between hardware card RAID5 and raidz, other than knowing they  
would rather have raidz over RAID5.


These are hardly "nuanced differences". The most powerful  
capabilities of ZFS simply aren't available in RAID.


* Because ZFS is labelled a "filesystem", people assume it is  
analogous to a conventional filesystem then make misleading  
comparisons which fail to expose the profound differences;
* or people think it's a RAID or volume manager, assume it's just  
RAID relabelled, and fail to see where it goes beyond.


Of course it is neither, exactly, but a synthesis of the two which is  
far more capable than the two conventionally discrete layers in  
combination. (I know most of the list knows this :)


--Toby





--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-22 Thread Toby Thain


On 22-Dec-09, at 12:42 PM, Roman Naumenko wrote:


On Tue, 22 Dec 2009, Ross Walker wrote:
Applying classic RAID terms to zfs is just plain
wrong and misleading  since zfs does not directly implement these  
classic RAID approaches

even though it re-uses some of the algorithms for data recovery.
Details do matter.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us,


I wouldn't agree.
SUN introduced just another marketing names for the well known  
things, even adding some new functionality.


raid6 is raid6, not matter how you name it: raidz2, raid-dp, raid- 
ADG or somehow else.

Sounds nice, but it's is just buzzwords.


The implied equivalence is wrong and confusing. That's the kind of  
mislabelling that Bob was complaining about.


--Toby



--
Roman
ro...@naumenko.ca
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I determine dedupe effectiveness?

2009-12-19 Thread Toby Thain


On 19-Dec-09, at 11:34 AM, Colin Raven wrote:



...
When we are children, we are told that sharing is good.  In the  
case or references, sharing is usually good, but if there is a huge  
amount of sharing, then it can take longer to delete a set of files  
since the mutual references create a "hot spot" which must be  
updated sequentially.


Y'know, that is a GREAT point. Taking this one step further then -  
does that also imply that there's one "hot spot" physically on a  
disk that keeps getting read/written to?


Also, copy-on-write generally means that physical location of updates  
is ever-changing.


--T

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I determine dedupe effectiveness?

2009-12-19 Thread Toby Thain


On 19-Dec-09, at 2:01 PM, Colin Raven wrote:




On Sat, Dec 19, 2009 at 19:08, Toby Thain  
 wrote:


On 19-Dec-09, at 11:34 AM, Colin Raven wrote

Then again (not sure how gurus feel on this point) but I have this  
probably naive and foolish belief that snapshots (mostly) oughtta  
reside on a separate physical box/disk_array...



That is not possible, except in the case of a mirror, where one  
side is recoverable separately.
I was referring to zipping up a snapshot and getting it outta Dodge  
onto another physical box, or separate array.


or zfs send



You seem to be confusing "snapshots" with "backup".

No, I wasn't confusing them at all. Backups are backups. Snapshots  
however, do have some limited value as backups. They're no  
substitute, but augment a planned backup schedule rather nicely in  
many situations.


--T___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I determine dedupe effectiveness?

2009-12-19 Thread Toby Thain


On 19-Dec-09, at 11:34 AM, Colin Raven wrote:



...
Wait...whoah, hold on.
If snapshots reside within the confines of the pool, are you saying  
that dedup will also count what's contained inside the snapshots?


Snapshots themselves are only references, so yes.

I'm not sure why, but that thought is vaguely disturbing on some  
level.


Then again (not sure how gurus feel on this point) but I have this  
probably naive and foolish belief that snapshots (mostly) oughtta  
reside on a separate physical box/disk_array...



That is not possible, except in the case of a mirror, where one side  
is recoverable separately. You seem to be confusing "snapshots" with  
"backup".



"someplace else" anyway. I say "mostly" because I s'pose keeping 15  
minute snapshots on board is perfectly OK - and in fact handy.  
Hourly...ummm, maybe the same - but Daily/Monthly should reside  
"elsewhere".


When we are children, we are told that sharing is good.  In the  
case or references, sharing is usually good, but if there is a huge  
amount of sharing, then it can take longer to delete a set of files  
since the mutual references create a "hot spot" which must be  
updated sequentially.


Y'know, that is a GREAT point. Taking this one step further then -  
does that also imply that there's one "hot spot" physically on a  
disk that keeps getting read/written to?
if so then your point has even greater merit for more  
reasons...disk wear for starters,


That is not a problem. Disks don't "wear" - it is a non-contact medium.

--Toby


and other stuff too, no doubt.

Files are usually created slowly so we don't notice much impact  
from this sharing, but we expect (hope) that files will be deleted  
almost instantaneously.
Indeed, that's is completely logical. Also, something most of us  
don't spend time thinking about.

...___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I determine dedupe effectiveness?

2009-12-19 Thread Toby Thain


On 19-Dec-09, at 4:35 AM, Colin Raven wrote:


...
There is no original, there is no copy. There is one block with  
reference counters.


Many blocks, potentially shared, make up a de-dup'd file. Not sure  
why you write "one" here.




- Fred can rm his "file" (because clearly it isn't a file, it's a  
filename and that's all)
- result: the reference count is decremented by one - the data  
remains on disk.

OR
- Janet can rm her "filename"
- result: the reference count is decremented by one - the data  
remains on disk

OR
-both can rm the filename the reference count is now decremented by  
two - but there were only two so now it's really REALLY gone.


That explanation describes hard links.

--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool fragmentation issues?

2009-12-16 Thread Toby Thain


On 16-Dec-09, at 10:47 AM, Bill Sprouse wrote:


Hi Brent,

I'm not sure why Dovecot was chosen.  It was most likely a  
recommendation by a fellow University.  I agree that it lacking in  
efficiencies in a lot of areas.  I don't think I would be  
successful in suggesting a change at this point as I have already  
suggested a couple of alternatives without success.


(As Damon pointed out) The problem seems not Dovecot per se but the  
choice of mbox format, which is rather self-evidently inefficient.





Do you a have a pointer to the "block/parity rewrite" tool  
mentioned below?




It headlines the informal roadmap presented by Jeff Bonwick.

http://www.snia.org/events/storage-developer2009/presentations/monday/ 
JeffBonwick_zfs-What_Next-SDC09.pdf



--Toby



bill

On Dec 15, 2009, at 9:38 PM, Brent Jones wrote:

On Tue, Dec 15, 2009 at 5:28 PM, Bill Sprouse  
 wrote:

Hi Everyone,

I hope this is the right forum for this question.  A customer is  
using a
Thumper as an NFS file server to provide the mail store for  
multiple email
servers (Dovecot).  They find that when a zpool is freshly  
created and

populated with mail boxes, even to the extent of 80-90% capacity,
performance is ok for the users, backups and scrubs take a few  
hours (4TB of
data). There are around 100 file systems.  After running for a  
while (couple
of months) the zpool seems to get "fragmented", backups take 72  
hours and a
scrub takes about 180 hours.  They are running mirrors with about  
5TB usable
per pool (500GB disks).  Being a mail store, the writes and reads  
are small

and random.  Record size has been set to 8k (improved performance
dramatically).  The backup application is Amanda.  Once backups  
become too
tedious, the remedy is to replicate the pool and start over.   
Things get

fast again for a while.

Is this expected behavior given the application (email - small,  
random
writes/reads)?  Are there recommendations for system/ZFS/NFS  
configurations
to improve this sort of thing?  Are there best practices for  
structuring

backups to avoid a directory walk?

Thanks,
bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Anyone reason in particular they chose to use Dovecot with the old  
Mbox format?

Mbox has been proven many times over to be painfully slow when the
files get larger, and in this day and age, I can't imagine anyone
having smaller than a 50MB mailbox. We have about 30,000 e-mail users
on various systems, and it seems the average size these days is
approaching close to a GB. Though Dovecot has done a lot to improve
the performance of Mbox mailboxes, Maildir might be more rounded for
your system.

I wonder if the "soon to be released" block/parity rewrite tool will
"freshen" up a pool thats heavily fragmented, without having to redo
the pools.

--
Brent Jones
br...@servuhome.net


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Boot Recovery after Motherboard Death

2009-12-12 Thread Toby Thain


On 12-Dec-09, at 1:32 PM, Mattias Pantzare wrote:

On Sat, Dec 12, 2009 at 18:08, Richard Elling  
 wrote:

On Dec 12, 2009, at 12:53 AM, dick hoogendijk wrote:


On Sat, 2009-12-12 at 00:22 +, Moritz Willers wrote:

The host identity had - of course - changed with the new  
motherboard

and it no longer recognised the zpool as its own.  'zpool import -f
rpool' to take ownership, reboot and it all worked no problem  
(which

was amazing in itself as I had switched from AMD to Intel ...).


Do I understand correctly if I read this as: OpenSolaris is able to
switch between systems without reinstalling? Just a zfs import -f  
and

everything runs? Wow, that would be an improvemment and would make
things more like *BSD/linux.


Solaris has been able to do that for 20+ years.  Why do you think
it should be broken now?


Solaris has _not_ been able to do that for 20+ years. In fact Sun has
always recommended a reinstall. You could do it if you really knew
how, but it was not easy.

If you switch between identical system it will of course work fine



Linux can't do it either, of course, unless one is deliberately using  
a sufficiently generic kernel.


--Toby
(who doesn't really wish to start an O/S pissing contest)



(before zfs that is, now you may have to import the pool on the new
system).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Transaction consistency of ZFS

2009-12-06 Thread Toby Thain


On 5-Dec-09, at 9:32 PM, nxyyt wrote:

The "rename trick" may not work here. Even if I renamed the file  
successfully, the data of the file may still reside in the memory  
instead of flushing back to the disk.  If I made any mistake here,  
please correct me. Thank you!


I'll try to find out whether ZFS binding the same file always to  
the same opening transaction group. If so, I guess my assumption  
here would be true. Seems like there is only one opening  
transaction group at anytime. Can anybody give me a definitive  
answer here?


For ZIL, it must be flushed back to disk in the order of fsync().  
So that the last append of the file would happen as the last  
transaction log in ZIL for this file, I think. The assumption  
should still be true.


fsync or fdatasync may be too heavyweight for my case because it's  
a write intensive workload.


That's the point, isn't it? :)

I hope replicating the data to different machines to protect the  
data from power outage would be better.


This is the Durability referred to in "ACID". This is a very well  
studied problem, I suggest you look at the literature and  
architecture surrounding transactional databases, if you find that  
tackling this through a POSIX filesystem is problematic.


--Toby


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Transaction consistency of ZFS

2009-12-05 Thread Toby Thain


On 5-Dec-09, at 8:32 AM, nxyyt wrote:


Thank you very much for your quick response.

My question is I  want to figure out whether there is data loss  
after power outage. I have replicas on other machines so I can  
recovery from the data loss. But I need a way to know whether there  
is data loss without comparing the different data replicas.


I suppose if I append a footer to the end of file before I close  
it, I can detect the data loss by validating the footer. Is it a  
work aroud for me ? Or is there a better alternative? In my  
scenario, the file is append-only, no in-place overwrite.


You seem to be looking for fsync() and/or fdatasync(); or, take  
advantage of existing systems with durable commits (e.g. [R]DBMS).


--Toby


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best practices for zpools on zfs

2009-11-26 Thread Toby Thain


On 26-Nov-09, at 8:57 PM, Richard Elling wrote:


On Nov 26, 2009, at 1:20 PM, Toby Thain wrote:

On 25-Nov-09, at 4:31 PM, Peter Jeremy wrote:

On 2009-Nov-24 14:07:06 -0600, Mike Gerdts   
wrote:

... fill a 128
KB buffer with random data then do bitwise rotations for each
successive use of the buffer.  Unless my math is wrong, it should
allow 128 KB of random data to be write 128 GB of data with very
little deduplication or compression.  A much larger data set  
could be
generated with the use of a 128 KB linear feedback shift  
register...


This strikes me as much harder to use than just filling the buffer
with 8/32/64-bit random numbers


I think Mike's reasoning is that a single bit shift (and  
propagation) is cheaper than generating a new random word. After  
the whole buffer is shifted, you have a new very-likely-unique  
block. (This seems like overkill if you know the dedup unit size  
in advance.)


You should be able to get a unique block by shifting one word, as long
as the shift doesn't duplicate the word.


That is true, but you will run out of permutations sooner.

--Toby


  I don't think many of the existing
benchmarks do this, though.
 -- richard



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best practices for zpools on zfs

2009-11-26 Thread Toby Thain


On 25-Nov-09, at 4:31 PM, Peter Jeremy wrote:


On 2009-Nov-24 14:07:06 -0600, Mike Gerdts  wrote:

... fill a 128
KB buffer with random data then do bitwise rotations for each
successive use of the buffer.  Unless my math is wrong, it should
allow 128 KB of random data to be write 128 GB of data with very
little deduplication or compression.  A much larger data set could be
generated with the use of a 128 KB linear feedback shift register...


This strikes me as much harder to use than just filling the buffer
with 8/32/64-bit random numbers


I think Mike's reasoning is that a single bit shift (and propagation)  
is cheaper than generating a new random word. After the whole buffer  
is shifted, you have a new very-likely-unique block. (This seems like  
overkill if you know the dedup unit size in advance.)


--Toby


from a linear congruential generator,
lagged fibonacci generator, mersenne twister or even random(3)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAID-Z and virtualization

2009-11-09 Thread Toby Thain


On 8-Nov-09, at 12:20 PM, Joe Auty wrote:


Tim Cook wrote:


On Sun, Nov 8, 2009 at 2:03 AM, besson3c  wrote:
...

Why not just convert the VM's to run in virtualbox and run Solaris  
directly on the hardware?




That's another possibility, but it depends on how Virtualbox stacks  
up against VMWare Server. At this point a lot of planning would be  
necessary to switch to something else, although this is possibility.


How would Virtualbox stack up against VMWare Server? Last I checked  
it doesn't have a remote console of any sort, which would be a deal  
breaker. Can I disable allocating virtual memory to Virtualbox VMs?  
Can I get my VMs to auto boot in a specific order at runlevel 3?  
Can I control my VMs via the command line?


Yes you certainly can. Works well, even for GUI based guests, as  
there is vm-level VRDP (VNC/Remote Desktop) access as well as  
whatever remote access the guest provides.





I thought Virtualbox was GUI only, designed for Desktop use primarily?


Not at all. Read up on VBoxHeadless.

--Toby



This switch will only make sense if all of this points to a net  
positive.





--Tim



--
Joe Auty
NetMusician: web publishing software for musicians
http://www.netmusician.org
j...@netmusician.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup question

2009-11-03 Thread Toby Thain


On 2-Nov-09, at 3:16 PM, Nicolas Williams wrote:


On Mon, Nov 02, 2009 at 11:01:34AM -0800, Jeremy Kitchen wrote:

forgive my ignorance, but what's the advantage of this new dedup over
the existing compression option?  Wouldn't full-filesystem  
compression

naturally de-dupe?

...
There are many examples where snapshot/clone isn't feasible but dedup
can help.  For example: mail stores (though they can do dedup at the
application layer by using message IDs and hashes).  For example: home
directories (think of users saving documents sent via e-mail).  For
example: source code workspaces (ONNV, Xorg, Linux, whatever), where
users might not think ahead to snapshot/clone a local clone (I also  
tend

to maintain a local SCM clone that I then snapshot/clone to get
workspaces for bug fixes and projects; it's a pain, really).  I'm sure
there are many, many other examples.


A couple that come to mind... Some patterns become much cheaper with  
dedup:


- The Subversion working copy format where you have the reference  
checked out file alongside the working file
- QA/testing system where you might have dozens or hundreds of builds  
of iterations an application, mostly identical


Exposing checksum metadata might have interesting implications for  
operations like diff, cmp, rsync, even tar.


--Toby



The workspace example is particularly interesting: with the
snapshot/clone approach you get to deduplicate the _source code_, but
not the _object code_, while with dedup you get both dedup'ed
automatically.

As for compression, that helps whether you dedup or not, and it  
helps by
about the same factor either way -- dedup and compression are  
unrelated,

really.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sniping a bad inode in zfs?

2009-10-27 Thread Toby Thain


On 27-Oct-09, at 1:43 PM, Dale Ghent wrote:


I've have a single-fs, mirrored pool on my hands which recently went
through a bout of corruption. I've managed to clean up a good bit of
it


How did this occur? Isn't a mirrored pool supposed to self heal?

--Toby


but it appears that I'm left with some directories which have bad
refcounts.
...
/dale
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS mirror resilver process

2009-10-18 Thread Toby Thain


On 18-Oct-09, at 6:41 AM, Adam Mellor wrote:


I Too have seen this problem.

I had done a zfs send from my main pool "terra" (6 disk raidz on  
seagate 1TB drives) to a mirror pair of WD Green 1TB drives.
ZFS send was successful, however i noticed the pool was degraded  
after a while (~1 week) with one of the mirror disks constantly re- 
silvering (40 TB resilvered on a 1TB disk) something was fishy.


I removed the disk that was getting the re-silver and replaced it  
with another WD Green 1TB (factory new) and added it as a mirror to  
the pool again it re-silvered successfully. i performed a scrub the  
next day (couple of reboots etc) and it started re-silvering the  
replaced drive.


I still had most of the data in the original pool, i performed a  
md5sum against some of the original files (~20GB files) and the ex- 
mirror copy and the md5 sums came back the same.


This doesn't test much; ZFS will use whichever side of the mirror is  
good.


--Toby



I have since blown away the ex-mirror and re-created the zpool  
mirror and copied the data back.

...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best way to convert checksums

2009-10-05 Thread Toby Thain


On 5-Oct-09, at 3:32 PM, Miles Nordin wrote:


"bm" == Brandon Mercer  writes:



I'm now starting to feel that I understand this issue,
and I didn't for quite a while.  And that I understand the
risks better, and have a clearer idea of what the possible
fixes are.  And I didn't before.


haha, yes, I think I can explain it to people when advocating ZFS, but
the story goes something like ``ZFS is serious business and pretty
useful, but it has some pretty hilarious problems that you wouldn't
expect


Let's talk about the "hilarious problems" that a naive RAID stack  
has, and most users "don't expect". For a start, no crash safe  
behaviour, and no way to self-heal from unexpected mirror desync.  
Then we could compare always-consistent COW with conventionally  
fragile metadata needing regular consistency checks...




from some of the blog hype you read.  Let me give you a couple
examples of things that still aren't fixed


...and can't be fixed, in RAID, or conventional filesystems.

--Toby


and how the discussion
went...''

...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Incremental snapshot size

2009-09-30 Thread Toby Thain


On 30-Sep-09, at 10:48 AM, Brian Hubbleday wrote:

I had a 50mb zfs volume that was an iscsi target. This was mounted  
into a Windows system (ntfs) and shared on the network. I used  
notepad.exe on a remote system to add/remove a few bytes at the end  
of a 25mb file.


I'm astonished that's even possible with notepad.

I agree with Richard, it looks like your workflow needs attention.  
Making random edits to very large, remotely stored flat files with  
super-simplistic tools seems in defiance of 5 decades of data  
management technology...


--T


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-26 Thread Toby Thain


On 26-Sep-09, at 2:55 PM, Frank Middleton wrote:


On 09/26/09 12:11 PM, Toby Thain wrote:


Yes, but unless they fixed it recently (>=RHFC11), Linux doesn't
actually nuke /tmp, which seems to be mapped to disk. One side
effect is that (like MSWindows) AFAIK there isn't a native tmpfs,
...


Are you sure about that? My Linux systems do.

http://lxr.linux.no/linux+v2.6.31/Documentation/filesystems/tmpfs.txt


OK, so you can mount /dev/shm on /tmp and /var/tmp, but that's
not the default,



It has long been the default in Gentoo. This system in particular was  
installed in 2004.



at least as of RHFC10. I have files in /tmp
going back to Feb 2008 :-). Evidently, quoting Wikipedia,
"tmpfs is supported by the Linux kernel from version 2.4 and up."
http://en.wikipedia.org/wiki/TMPFS, FC1 6 years ago. Solaris /tmp
has been a tmpfs since 1990...


The question wasn't "who was first".

--Toby



Now back to the thread...





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-26 Thread Toby Thain


On 26-Sep-09, at 9:56 AM, Frank Middleton wrote:


On 09/25/09 09:58 PM, David Magda wrote:
...


Similar definition for [/tmp] Linux FWIW:


Yes, but unless they fixed it recently (>=RHFC11), Linux doesn't  
actually
nuke /tmp, which seems to be mapped to disk. One side effect is  
that (like

MSWindows) AFAIK there isn't a native tmpfs, ...


Are you sure about that? My Linux systems do.

http://lxr.linux.no/linux+v2.6.31/Documentation/filesystems/tmpfs.txt

--Toby



Cheers -- Frank





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Toby Thain


On 25-Sep-09, at 2:58 PM, Frank Middleton wrote:


On 09/25/09 11:08 AM, Travis Tabbal wrote:

... haven't heard if it's a known
bug or if it will be fixed in the next version...


Out of courtesy to our host, Sun makes some quite competitive
X86 hardware. I have absolutely no idea how difficult it is
to buy Sun machines retail,


Not very difficult. And there is try and buy.

People overestimate the cost of Sun, and underestimate the real value  
of "fully integrated".


--Toby


but it seems they might be missing
out on an interesting market - robust and scalable SOHO servers
for the DYI gang ...

Cheers -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Books on File Systems and File System Programming

2009-08-14 Thread Toby Thain


On 14-Aug-09, at 11:14 AM, Peter Schow wrote:

On Thu, Aug 13, 2009 at 05:02:46PM -0600, Louis-Fr?d?ric Feuillette  
wrote:

I saw this question on another mailing list, and I too would like to
know. And I have a couple questions of my own.

== Paraphrased from other list ==
Does anyone have any recommendations for books on File Systems and/or
File Systems Programming?
== end ==


Going back ten years, but still a good tutorial:

   "Practical File System Design with the Be File System"
   by Dominic Giampaolo

   http://www.nobius.org/~dbg/practical-file-system-design.pdf



Great cite (that I have not read) because Giampaolo is a noted expert  
on the second part of Louis-Frederic's question, how filesystems are  
merging with databases. Namesys' papers relating to Reiser4 are also  
worth reading in this respect.


--Toby


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-08-04 Thread Toby Thain


On 4-Aug-09, at 9:28 AM, Roch Bourbonnais wrote:



Le 26 juil. 09 à 01:34, Toby Thain a écrit :



On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:


On 07/25/09 02:50 PM, David Magda wrote:

Yes, it can be affected. If the snapshot's data structure /  
record is
underneath the corrupted data in the tree then it won't be able  
to be

reached.


Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and  
copies=2).


I was also wondering if you could explain why the ZIL can't
repair such damage.

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch,


^^ of course this can never cause inconsistency. The issue under  
discussion is inconsistency - unexpected corruption of on-disk  
structures.



and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?


The problem is assumed *ordering*. In this respect VB ignoring  
flushes and real hardware are not going to behave the same.


--Toby


I agree that noone should be ignoring cache flushes. However the  
path to corruption must involve
some dropped acknowledged I/Os. The ueberblock I/O was issued to  
stable storage but the blocks it pointed to,  which had reached the  
disk firmware earlier,
never make it to stable storage. I can see this scenerio when the  
disk looses power


Or if the host O/S crashes. All this applies to virtual IDE devices  
alone, of course. iSCSI is a different case entirely as presumably  
flushes/barriers are processed normally.



but I don't see it with cutting power to the guest.


Right, in this case it's unlikely or nearly impossible.

--Toby



When managing a zpool on external storage, do people export the  
pool before taking snapshots of the guest ?


-r






Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40

2009-07-31 Thread Toby Thain


On 31-Jul-09, at 7:15 PM, Richard Elling wrote:


wow, talk about a knee jerk reaction...

On Jul 31, 2009, at 3:23 PM, Dave Stubbs wrote:


I don't mean to be offensive Russel, but if you do
ever return to ZFS, please promise me that you will
never, ever, EVER run it virtualized on top of NTFS
(a.k.a. worst file system ever) in a production
environment. Microsoft Windows is a horribly
unreliable operating system in situations where
things like protecting against data corruption are
important. Microsoft knows this


Oh WOW!  Whether or not our friend Russel virtualized on top of  
NTFS (he didn't - he used raw disk access) this point is amazing!


This point doesn't matter. VB sits between the guest OS and the raw  
disk and

drops cache flush requests.

System5 - based on this thread I'd say you can't really make this  
claim at all.  Solaris suffered a crash and the ZFS filesystem  
lost EVERYTHING!  And there aren't even any recovery tools?


As has been described many times over the past few years, there is  
a manual

procedure.


HANG YOUR HEADS!!!


Recovery from the same situation is EASY on NTFS.  There are piles  
of tools out there that will recover the file system, and failing  
that, locate and extract data.  The key parts of the file system  
are stored in multiple locations on the disk just in case.  It's  
been this way for over 10 years.


ZFS also has redundant metadata written at different places on the  
disk.

ZFS, like NTFS, issues cache flush requests with the expectation that
the disk honors that request.



Can anyone name a widely used transactional or journaled filesystem  
or RDBMS that *doesn't* need working barriers?





 I'd say it seems from this thread that my data is a lot safer on  
NTFS than it is on ZFS!


Nope.  NTFS doesn't know when data is corrupted.  Until it does, it is
blissfully ignorant.



People still choose systems that don't even know which side of a  
mirror is good. Do they ever wonder what happens when you turn off a  
busy RAID-1? Or why checksumming and COW make a difference?


This thread hasn't shaken my preference for ZFS at all; just about  
everything else out there relies on nothing more than dumb luck to  
maintain integrity.


--Toby






I can't believe my eyes as I read all these responses blaming  
system engineering and hiding behind ECC memory excuses and "well,  
you know, ZFS is intended for more Professional systems and not  
consumer devices, etc etc."  My goodness!  You DO realize that Sun  
has this website called opensolaris.org which actually proposes to  
have people use ZFS on commodity hardware, don't you?  I don't see  
a huge warning on that site saying "ATTENTION:  YOU PROBABLY WILL  
LOSE ALL YOUR DATA".


You probably won't lose all of your data. Statistically speaking,  
there

are very few people who have seen this. There are many more cases
where ZFS detected and repaired corruption.
...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-27 Thread Toby Thain


On 27-Jul-09, at 3:44 PM, Frank Middleton wrote:


On 07/27/09 01:27 PM, Eric D. Mudama wrote:


Everyone on this list seems to blame lying hardware for ignoring
commands, but disks are relatively mature and I can't believe that
major OEMs would qualify disks or other hardware that willingly  
ignore

commands.


You are absolutely correct, but if the cache flush command never makes
it to the disk, then it won't see it. The contention is that by not
relaying the cache flush to the disk,


No - by COMPLETELY ignoring the flush.


VirtualBox caused the OP to lose
his pool.

IMO this argument is bogus because AFAIK the OP didn't actually power
his system down, so the data would still have been in the cache, and
presumably have eventually have been written. The out-of-order writes
theory is also somewhat dubious, since he was able to write 10TB  
without

VB relaying the cache flushes.


Huh? Of course he could. The guest didn't crash while he was doing it!

The corruption occurred when the guest crashed (iirc). And the "out  
of order theory" need not be the *only possible* explanation, but it  
*is* sufficient.



This is all highly hardware dependant,


Not in the least. It's a logical problem.


and AFAIK no one ever asked the OP what hardware he had, instead,
blasting him for running VB on MSWindows.


Which is certainly not relevant to my hypothesis of what broke. I  
don't care what host he is running. The argument is the same for all.



Since IIRC he was using raw
disk access, it is questionable whether or not MS was to blame, but
in general it simply shouldn't be possible to lose a pool under
any conditions.


How about "when flushes are ignored"?



It does raise the question of what happens in general if a cache
flush doesn't happen if, for example, a system crashes in such a way
that it requires a power cycle to restart, and the cache never gets
flushed.


Previous explanations have not dented your misunderstanding one iota.

The problem is not that an attempted flush did not complete. It was  
that any and all flushes *prior to crash* were ignored. This is where  
the failure mode diverges from real hardware.


Again, look:

A B C FLUSH D E F FLUSH

Note that it does not matter *at all* whether the 2nd flush  
completed. What matters from an integrity point of view is that the  
*previous* flush was completed (and synchronously). Visualise this on  
the two scenarios:


1) real hardware: (barring actual defects) that A,B,C were written  
was guaranteed by the first flush (otherwise D would never have been  
issued). Integrity of system is intact regardless of whether the 2nd  
flush completed.


2) VirtualBox: flush never happened. Integrity of system is lost, or  
at best unknown, if it depends on A,B,C all completing before D.




...

Of course the ZIL isn't a journal in the traditional sense, and
AFAIK it has no undo capability the way that a DBMS usually has,
but it needs to be structured so that bizarre things that happen
when something as robust as Solaris crashes don't cause data loss.


A lot of engineering effort has been expended in UFS and ZFS to  
achieve just that. Which is why it's so nutty to undermine that by  
violating semantics in lower layers.



The nightmare scenario is when one disk of a mirror begins to
fail and the system comes to a grinding halt where even stop-a
doesn't respond, and a power cycle is the only way out. Who
knows what writes may or may not have been issued or what the
state of the disk cache might be at such a time.


Again, if the flush semantics are respected*, this is not a problem.

--Toby

* - "When this operation completes, previous writes are verifiably on  
durable media**."


** - Durable media meaning physical media in a bare metal  
environment, and potentially "virtual media" in a virtualised  
environment.





-- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with setting up ZFS

2009-07-27 Thread Toby Thain


On 27-Jul-09, at 5:46 AM, erik.ableson wrote:

The zfs send command generates a differential file between the two  
selected snapshots so you can send that to anything you'd like.   
The catch of course is that then you have a collection of files on  
your Linux box that are pretty much useless since your can't mount  
them or read the contents in any meaningful way.  If you're running  
a Linux server as the destination the easiest solution is to create  
a virtual machine running the same revision of OpenSolaris as the  
server and use that as a destination.


It doesn't necessarily need a publicly exposed IP address - you can  
get the source to send the differential file to the Linux box and  
then have the VM "import" the file using a recv command to  
integrate the contents into a local ZFS filesystem. I think that  
VirtualBox lets you access shared folders so you could write a  
script to check for new files and then use the recv command to  
process them.


VirtualBox can forward a host port to a guest, so one can ssh from  
outside and process the stream directly. Also note Erik's public key  
idea below.


--Toby

The trick as always for this kind of thing is determining that the  
file is complete before attempting to import it.


There's some good examples in the ZFS Administration Guide (p187)  
for handling remote transfers.

zfs send tank/ci...@today | ssh newsys zfs recv sandbox/res...@today

For a staged approach you could pipe the output to a compressed  
file and send that over to the Linux box.


Combined with a key exchange between the two systems you don't need  
to keep passwords in your scripts either.


Cheers,

Erik

On 27 juil. 09, at 11:15, Brian wrote:

The ZFS send/receive command can presumably only send the  
filesystem to another OpenSolaris OS right?  Is there anyone way  
to send it to a normal Linux distribution (ext3)?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-26 Thread Toby Thain


On 26-Jul-09, at 11:08 AM, Frank Middleton wrote:


On 07/25/09 04:30 PM, Carson Gaspar wrote:


No. You'll lose unwritten data, but won't corrupt the pool, because
the on-disk state will be sane, as long as your iSCSI stack doesn't
lie about data commits or ignore cache flush commands. Why is this so
difficult for people to understand? Let me create a simple example
for you.


Are you sure about this example? AFAIK metadata refers to things like
the file's name, atime, ACLs, etc., etc. Your example seems to be more
about how a journal works, which has little to do with metatdata other
than to manage it.

Now if you were too lazy to bother to follow the instructions  
properly,
we could end up with bizarre things. This is what happens when  
storage

lies and re-orders writes across boundaries.


On 07/25/09 07:34 PM, Toby Thain wrote:

The problem is assumed *ordering*. In this respect VB ignoring  
flushes

and real hardware are not going to behave the same.


Why? An ignored flush is ignored. It may be more likely in VB, but it
can always happen.


And whenever it does: guess what happens?


It mystifies me that VB would in some way alter
the ordering.


Carson already went through a more detailed explanation. Let me try a  
different one:


ZFS issues writes A, B, C, FLUSH, D, E, F.

case 1) the semantics of the flush* allow ZFS to presume that A, B, C  
are all 'committed' at the point that D is issued. You can understand  
that A, B, C may be done in any order, and D, E, F may be done in any  
order, due to the numerous abstraction layers involved - all the way  
down to the disk's internal scheduling. ANY of these layers can  
affect the ordering of durable, physical writes _in the absence of a  
flush/barrier_.


case 2) but if the flush does NOT occur with the necessary semantics,  
the ordering of ALL SIX operations is now indeterminate, and by the  
time ZFS issues D, any of the first 3 (A, B, C) may well not have  
been committed at all. There is a very good chance this will violate  
an integrity assumption (I haven't studied the source so I can't  
point you to a specific design detail or line; rather I am working  
from how I understand transactional/journaled systems to work.  
Assuming my argument is valid, I am sure a ZFS engineer can cite a  
specific violation).


As has already been mentioned in this context, I think by David  
Magda, ordinary hardware will show this problem _if flushes are not  
functioning_ (an unusual case on bare metal), while on VirtualBox  
this is the default!




...

Doesn't ZIL effectively make ZFS into a journalled file system


Of course ZFS is transactional, as are other filesystems and software  
systems, such as RDBMS. But integrity of such systems depends on a  
hardware flush primitive that actually works. We are getting hoarse  
repeating this.


--Toby

* Essentially 'commit' semantics: Flush synchronously, operation is  
complete only when data is durably stored.


...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-25 Thread Toby Thain


On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:


On 07/25/09 02:50 PM, David Magda wrote:


Yes, it can be affected. If the snapshot's data structure / record is
underneath the corrupted data in the tree then it won't be able to be
reached.


Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and copies=2).

I was also wondering if you could explain why the ZIL can't
repair such damage.

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch, and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn't
this mean /any/ hardware might have this problem, albeit with much
lower probability?


The problem is assumed *ordering*. In this respect VB ignoring  
flushes and real hardware are not going to behave the same.


--Toby



Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] The importance of ECC RAM for ZFS

2009-07-24 Thread Toby Thain


On 24-Jul-09, at 6:41 PM, Frank Middleton wrote:


On 07/24/09 04:35 PM, Bob Friesenhahn wrote:

 Regardless, it [VirtualBox] has committed a crime.


But ZFS is a journalled file system! Any hardware can lose a flush;


No, the problematic default in VirtualBox is flushes being *ignored*,  
which has a different failure mode. A host crash under this regime  
can potentially corrupt *any* journaled and transactional system  
(starting with filesystems and RDBMS) in a manner that does not occur  
on properly functioning bare metal that honours flushes, because  
their ordering assumptions no longer hold.


Whether this is 'possible' with a guest-only crash is arguable - I  
don't want to speak for Miles, but I suspect he was reasoning that a  
guest crash would not interact with ignore-flush, as all requested  
issued I/O up until the crash "should" finally complete - making a  
guest crash similar to a "real" crash. But the virtualised stack is  
complex enough that I don't know if we can be certain about that.


I would say that ignoring flushes is still a suspect.



it's just more likely in a VM, especially when anything Microsoft
is involved,


I originally saw the problem on a Ubuntu system, 6 months ago. The  
subsystems which broke were ext3fs and InnoDB - both supposedly  
"journaling".



and the whole point of journalling is to prevent things
like this happening.



It can ONLY do that when flushes/barriers/ordering are respected.

--Toby



...
HTH -- Frank







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-20 Thread Toby Thain


On 20-Jul-09, at 6:26 AM, Russel wrote:


Well I did have a UPS on the machine :-)

but the machine hung and I had to power it off...
(yep it was vertual, but that happens on direct HW too,


As has been discussed here before, the failure modes are different as  
the layer stack from filesystem to disk is obviously very different.


--Toby


and virtualisasion is the happening ting at sun and else where!
I have a version of the data backed up, but will
take ages (10days) to restore).
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-19 Thread Toby Thain


On 19-Jul-09, at 7:12 AM, Russel wrote:


Guys guys please chill...

First thanks to the info about virtualbox option to bypass the
cache (I don't suppose you can give me a reference for that info?
(I'll search the VB site :-))



I posted about that insane default, six months ago. Obviously ZFS  
isn't the only subsystem that this breaks.

http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0


As this was not clear to me. I use VB
like others use vmware etc to run solaris because its the ONLY
way I can,


Convenience always has a price.

--Toby

...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] De-duplication: possible to identify duplicate files?

2009-07-14 Thread Toby Thain


On 14-Jul-09, at 5:18 PM, Orvar Korvar wrote:

With dedup, will it be possible somehow to identify files that are  
identical but has different names? Then I can find and remove all  
duplicates. I know that with dedup, removal is not really needed  
because the duplicate will just be a reference to an existing file.  
But nevertheless I want to keep down the file count.


You can do this on any filesystem easily enough by taking hashes.

--Toby


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speeding up resilver on x4500

2009-06-23 Thread Toby Thain


On 23-Jun-09, at 1:58 PM, Erik Trimble wrote:


Richard Elling wrote:

Erik Trimble wrote:
All this discussion hasn't answered one thing for me:   exactly  
_how_ does ZFS do resilvering?  Both in the case of mirrors, and  
of RAIDZ[2] ?


I've seen some mention that it goes in cronological order (which  
to me, means that the metadata must be read first) of file  
creation, and that only used blocks are rebuilt, but exactly what  
is the methodology being used?


See Jeff Bonwick's blog on the topic
http://blogs.sun.com/bonwick/entry/smokin_mirrors
-- richard



That's very informative. Thanks, Richard.

So, ZFS walks the used block tree to see what still needs  
rebuilding.   I guess I have two related questions then:


(1) Are these blocks some fixed size (based on the media - usually  
512 bytes), or are they "ZFS blocks" - the fungible size based on  
the requirements of the original file size being written?
(2) is there some reasonable way to read in multiples of these  
blocks in a single IOP?   Theoretically, if the blocks are in  
chronological creation order, they should be (relatively)  
sequential on the drive(s).  Thus, ZFS should be able to read in  
several of them without forcing a random seek.


(I think) the disk's internal scheduling could help out here if they  
are indeed close to physically sequential.


--Toby


That is, you should be able to get multiple blocks in a single IOP.


If we can't get multiple ZFS blocks in one sequential read, we're  
screwed - ZFS is going to be IOPS bound on the replacement disk,  
with no real workaround.  Which means rebuild times for disks with  
lots of small files is going to be hideous.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] APPLE: ZFS need bug corrections instead of new func! Or?

2009-06-18 Thread Toby Thain


On 18-Jun-09, at 12:14 PM, Miles Nordin wrote:


"bmm" == Bogdan M Maryniuk  writes:
"tt" == Toby Thain  writes:

...
tt> /. is no person...



... you and I both know it's plausible
speculation that Apple delayed unleashing ZFS on their consumers
because of the lost pool problems.  ZFS doesn't suck, I do use it, I
hope and predict it will get better---so just back off and calm down
with the rotten fruit.  But neither who's saying it nor your not
wanting to hear it makes it less plausible.


In my opinion, a more plausible explanation is: Apple has not made  
ZFS integration a high priority [for 10.6].


There is no doubt Apple has the engineering resources to make it  
perfectly reliable as a component of Mac OS X, if that were a high  
priority goal.


I run OS X but I am not at all tempted to play with ZFS on it there;  
life is too short for betas. If I want ZFS I install Solaris 10.


--Toby



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] APPLE: ZFS need bug corrections instead of new func! Or?

2009-06-17 Thread Toby Thain


On 17-Jun-09, at 5:42 PM, Miles Nordin wrote:


"bmm" == Bogdan M Maryniuk  writes:
"tt" == Toby Thain  writes:
"ok" == Orvar Korvar  writes:


tt> Slashdot was never the place to go for accurate information
tt> about ZFS.

again, the posts in the slashdot thread complaining about corruption
were just pointers to original posts on this list, so attacking the
forum where you saw the pointer instead of the content of its
destination really is clearly _ad hominem_.



Ad foruminem? !! Or did you simply mean, "uncalled-for"?

/. is no person... And most of the thread really was rubbish. If one  
or two posts linked to the mailing list, that doesn't change it.


--Toby



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] APPLE: ZFS need bug corrections instead of new func! Or?

2009-06-17 Thread Toby Thain


On 17-Jun-09, at 7:37 AM, Orvar Korvar wrote:

Ok, so you mean the comments are mostly FUD and bull shit? Because  
there are no bug reports from the whiners? Could this be the case?  
It is mostly FUD? Hmmm...?




Having read the thread, I would say "without a doubt".

Slashdot was never the place to go for accurate information about ZFS.

--Toby


--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs on 32 bit?

2009-06-16 Thread Toby Thain


On 16-Jun-09, at 6:22 PM, Ray Van Dolson wrote:


On Tue, Jun 16, 2009 at 03:16:09PM -0700, milosz wrote:

yeah i pretty much agree with you on this.  the fact that no one has
brought this up before is a pretty good indication of the demand.
there are about 1000 things i'd rather see fixed/improved than max
disk size on a 32bit platform.


I'd say a lot of folks out there have plenty of enterprise-class 32- 
bit

hardware still in production in their datacenters.  I know I do.
Several IBM BladeCenters with 32-bit blades and attached storage...

It would be "nice" to be able to do ZFS on these platforms (>1TB that
is), but I understand if it's not a priority.  But there's certainly a
lot of life left in 32-bit hardware, and not all of it is cheap to
replace.



I bet 1+ TB drives in the right format (e.g. SCSI) aren't exactly  
cheap or even available...


Let's be reminded that this is about maximum size of a single drive  
(or slice?) not dataset or pool.




milosz wrote,



 the fact that no one has
brought this up before is a pretty good indication of the demand.
there are about 1000 things i'd rather see fixed/improved than max
disk size on a 32bit platform.



+1

--Toby



Ray



On Tue, Jun 16, 2009 at 5:55 PM, Neal  
Pollack wrote:

On 06/16/09 02:39 PM, roland wrote:


so, we have a 128bit fs, but only support for 1tb on 32bit?

i`d call that a bug, isn`t it ?  is there a bugid for this? ;)



Well, opinion is welcome.
I'd call it an RFE.

With 64 bit versions of the CPU chips so inexpensive these days,
how much money do you want me to invest in moving modern features
and support to old versions of the OS?

I mean, Microsoft could, on a technical level, backport all new  
features

from
Vista and Windows Seven to Windows 95.  But if they did that,  
their current

offering
would lag, since all the engineers would be working on the older  
stuff.


Heck, you can buy a 64 bit CPU motherboard very very cheap.  The  
staff that

we do have
are working on modern features for the 64bit version, rather than  
spending

all their time
"in the rear-view mirror".   Live life forward.  Upgrade.
Changing all the data structures in the 32 bit OS to handle super  
larger

disks, is, well, sorta
like trying to get a Pentium II to handle HD Video.  I'm sure,  
with enough

time and money,
you might find a way.  But is it worth it?  Or is it cheaper to  
buy a new

pump?

Neal

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple Removes Nearly All Reference To ZFS

2009-06-10 Thread Toby Thain


On 10-Jun-09, at 7:25 PM, Alex Lam S.L. wrote:

On Thu, Jun 11, 2009 at 2:08 AM, Aaron Blew  
wrote:

That's quite a blanket statement.  MANY companies (including Oracle)
purchased Xserve RAID arrays for important applications because of  
their
price point and capabilities.  You easily could buy two Xserve  
RAIDs and

mirror them for what comparable arrays of the time cost.

-Aaron


I'd very much doubt that, but I guess one can always push their time
budgets around ;-)


Hm, as someone who personally installed a 1st gen 1.1TB (half full)  
Xserve RAID + Xserve in a production environment, back when such a  
configuration cost AUD $40,000, I can tell you that it was child's  
play to set up, and ran flawlessly. The cost halved within a few  
months, iirc. :)


--Toby



Alex.




On Wed, Jun 10, 2009 at 8:53 AM, Bob Friesenhahn
 wrote:


On Wed, 10 Jun 2009, Rodrigo E. De León Plicet wrote:



http://hardware.slashdot.org/story/09/06/09/2336223/Apple- 
Removes-Nearly-All-Reference-To-ZFS


Maybe Apple will drop the server version of OS-X and will  
eliminate their
only server hardware (Xserve) since all it manages to do is lose  
money for

Apple and distracts from releasing the next iPhone?

Only a lunatic would rely on Apple for a mission-critical server
application.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/ 
bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss






--

Josh Billings  - "Every man has his follies - and often they are the
most interesting thing he has got." -
http://www.brainyquote.com/quotes/authors/j/josh_billings.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Toby Thain


On 26-May-09, at 10:21 AM, Frank Middleton wrote:


On 05/26/09 03:23, casper@sun.com wrote:


And where exactly do you get the second good copy of the data?


From the first. And if it is already bad, as noted previously, this
is no worse than the UFS/ext3 case. If you want total freedom from
this class of errors, use ECC.

If you copy the code you've just doubled your chance of using bad  
memory.
The original copy can be good or bad; the second copy cannot be  
better

than the first copy.


The whole point is that the memory isn't bad. About once a month, 4GB
of memory of any quality can experience 1 bit being flipped, perhaps
more or less often.



What you are proposing does practically nothing to mitigate "random  
bit flips". Think about the probabilities involved. You're testing  
one tiny buffer, very occasionally, for an extremely improbable  
event. It is also nothing to do with ZFS, and leaves every other byte  
of your RAM untested. See the reasoning?


--Toby


...

Cheers -- Frank



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-26 Thread Toby Thain


On 25-May-09, at 11:16 PM, Frank Middleton wrote:


On 05/22/09 21:08, Toby Thain wrote:
Yes, the important thing is to *detect* them, no system can run  
reliably

with bad memory, and that includes any system with ZFS. Doing nutty
things like calculating the checksum twice does not buy anything of
value here.


All memory is "bad" if it doesn't have ECC. There are only varying
degrees of badness. Calculating the checksum twice on its own would
be nutty, as you say, but doing so on a separate copy of the data
might prevent unrecoverable errors


I don't see this at all. The kernel reads the application buffer. How  
does reading it twice buy you anything?? It sounds like you are  
assuming 1) the buffer includes faulty RAM; and 2) the faulty RAM  
reads differently each time. Doesn't that seem statistically unlikely  
to you? And even if you really are chasing this improbable scenario,  
why make ZFS do the job of a memory tester?



after writes to mirrored drives.
You can't detect memory errors if you don't have ECC. But you can
try to mitigate them. Without doing so makes ZFS less reliable than
the memory it is running on. The problem is that ZFS makes any file
with a bad checksum inaccessible, even if one really doesn't care
if the data has been corrupted. A workaround might be a way to allow
such files to be readable despite the bad checksum...


I am not sure what you are trying to say here.



...


How can a machine with bad memory "work fine with ext3"?


It does. It works fine with ZFS too. Just really annoying  
unrecoverable
files every now and then on mirrored drives. This shouldn't happen  
even

with lousy memory and wouldn't (doesn't) with ECC. If there was a way
to examine the files and their checksums, I would be surprised if they
were different (If they were, it would almost certainly be the  
controller

or the PCI bus itself causing the problem). But I speculate that it is
predictable memory hits.


You're making this harder than it really is. Run a memory test. If it  
fails, take the machine out of service until it's fixed. There's no  
reasonable way to keep running faulty hardware.


--Toby



-- Frank



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-05-22 Thread Toby Thain


On 22-May-09, at 5:24 PM, Frank Middleton wrote:

There have been a number of threads here on the reliability of ZFS  
in the
face of flaky hardware. ZFS certainly runs well on decent (e.g.,  
SPARC)
hardware, but isn't it reasonable to expect it to run well on  
something

less well engineered? I am a real ZFS fan, and I'd hate to see folks
trash it because it appears to be unreliable.

In an attempt to bolster the proposition that there should at least be
an option to buffer the data before checksumming and writing, we've
been doing a lot of testing on presumed flaky (cheap) hardware, with a
peculiar result - see below.

On 04/21/09 12:16, casper@sun.com wrote:


And so what?  You can't write two different checksums; I mean, we're
mirroring the data so it MUST BE THE SAME.  (A different checksum  
would be

wrong: I don't think ZFS will allow different checksums for different
sides of a mirror)


Unless it does a read after write on each disk, how would it know that
the checksums are the same? If the data is damaged before the checksum
is calculated then it is no worse than the ufs/ext3 case. If data +
checksum is damaged whilst the (single) checksum is being calculated,
or after, then the file is already lost before it is even written!
There is a significant probability that this could occur on a machine
with no ecc. Evidently memory concerns /are/ an issue


Yes, the important thing is to *detect* them, no system can run  
reliably with bad memory, and that includes any system with ZFS.  
Doing nutty things like calculating the checksum twice does not buy  
anything of value here.


If the memory is this bad then applications will be dying all over  
the place, compilers will be segfaulting, and databases will be  
writing bad data even before it reaches ZFS.



- this thread
http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests
including a memory diagnostic with the distribution CD (Fedora already
does so).


Absolutely, memory diags are essential. And you certainly run them if  
you see unexpected behaviour that has no other obvious cause.




Memory diagnostics just test memory. Disk diagnostics just test disks.
ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.


Your logic is rather tortuous. If the hardware is that crappy then  
there's not much ZFS can do about it.



It might also explain why errors don't really begin until ~15 minutes
after the busy time starts.

You might argue that this problem could only affect systems doing a
lot of disk i/o and such systems probably have ecc memory. But doing
an o/s install is the one time where a consumer grade computer does
a *lot* of disk i/o for quite a long time and is hence vulnerable.
Ironically,  the Open Solaris installer does not allow for ZFS
mirroring at install time, one time where it might be really  
important!

Now that sounds like a more useful RFE, especially since it would be
relatively easy to implement. Anaconda does it...

A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look
at Cypress on ECC, see http://www.edn.com/article/CA454636.html.
Possibly, statistically likely random memory glitches could actually
explain the error rate that is occurring.


You are assuming that the error is the memory being modified after
computing the checksums; I would say that that is unlikely; I  
think it's a
bit more likely that the data gets corrupted when it's handled by  
the disk
controller or the disk itself.  (The data is continuously re- 
written by

the DRAM controller)


See below for an example where a checksum error occurs without the
disk subsystem being involved. There seems to be no other plausible
explanation other than an improbable bug in X86 ZFS itself.

It would have been nice if we were able to recover the contents of  
the
file; if you also know what was supposed to be there, you can diff  
and

then we can find out what was wrong.


"file" on those files resulted in "bus error". Is there a way to  
actually

read a file reported by ZFS as unrecoverable to do just that (and to
separately retrieve the copy from each half of the mirror)?

Maybe this should be a new thread, but I suspect the following
proves that the problem must be memory, and that begs the question
as to how memory glitches can cause fatal ZFS checksum errors.


Of course they can; but they will also break anything else on the  
machine.


...


If a memory that can pass diagnostics for 24 hours at a
stretch can cause glitches in huge datastreams, then IMO it
behooves ZFS to defend itself against them. Buffering disk
i/o on machines with no ECC seems like reasonably cheap
insurance against a whole class of errors, and could make
ZFS usable on PCs that, although they work fine with ext3,


How can a machine with bad memory "work fine with ext3"?

--Toby


fail annoyingly with ZFS. Ironically this wouldn't fix the
peculiar recv problem, which non

Re: [zfs-discuss] [on-discuss] Reliability at power failure?

2009-04-19 Thread Toby Thain


On 19-Apr-09, at 10:38 AM, Uwe Dippel wrote:


casper@sun.com wrote:

We are back at square one; or, at the subject line.
I did a zpool status -v, everything was hunky dory.
Next, a power failure, 2 hours later, and this is what zpool  
status -v thinks:


zpool status -v
 pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in  
data

   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise  
restore the

   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   rpool   ONLINE   0 0 0
 c1d0s0ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

   //etc/svc/repository-boot-20090419_174236

I know, the hord-core defenders of ZFS will repeat for the  
umpteenth time that I should be grateful that ZFS can NOTICE and  
inform about the problem.




:-)

The file is created on boot and I assume this was created directly  
after the boot after the  power-failure.


Am I correct in thinking that:
the last boot happened on 2009/04/19_17:42:36
the system hasn't reboot since that time



Good guess, but wrong. Another two to go ...   :)


Others might want to repeat that this is not supposed to happen  
in the first place.




ZFS guarantees that does cannot happen, unless the hardware is  
bad.  Bad means here "the hardware doesn't promise what ZFS  
believes the hardware promises".


But anything can cause this:

hardware problems:
- bad memory
- bad disk
- bad disk controller
- bad power supply

software problem
- memory corruption through any odd driver
- any part of the zfs stack

My memory would still be a hardware problem.  I remember a  
particular case where ZFS continuously found checksums; replacing  
the power supply fixed that.




Chances are. That Ubuntu as double boot here never finds anything  
wrong, crashes, etc.


Why should it? It isn't designed to do so.



And again, someone will inform me that this is the beauty of ZFS:  
That I know of the corruption.


After a scrub, what I see is:

zpool status -v
 pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise  
restore the

   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed after 0h48m with 1 errors on Sun Apr 19  
19:09:26 2009

config:

   NAMESTATE READ WRITE CKSUM
   rpool   ONLINE   0 0 1
 c1d0s0ONLINE   0 0 2

errors: Permanent errors have been detected in the following files:

   <0xa6>:<0x4f002>

Which file to replace?


Have you thoroughly checked your hardware?

Why are you running a non-redundant pool?

--Toby



Serious, what would a normal user expected to do here? No, I don't  
have a backup of a file that has recently been created, true, at  
17:42 on April 19th.
Reinstall? While everything was okay 12 hours ago, after some 30  
crashes due to power-failures, that were - until recently -  
rectified with crashes at boot, Failsafe, reboot.
A system that has been going up and down without much hassle for  
1.5 years, both on OpenSolaris on UFS and Ubuntu?


(Let's not forget the thread started with my question "Why do I  
have to Failsafe so frequently after a power failure, to correct a  
corrupted bootarchive?")


Uwe


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-04-17 Thread Toby Thain


On 17-Apr-09, at 11:49 AM, Frank Middleton wrote:


... One might argue that a machine this flaky should
be retired, but it is actually working quite well,


If it has bad memory, you won't get much useful work done on it until  
the memory is replaced - unless you want to risk your data with  
random failues, and potentially waste large amounts of time.


You should do a comprehensive memory test ASAP and replace what's not  
working.


ZFS' job isn't to test your memory, so I think the proposed patch is  
pointless. It also doesn't address the case where the application  
buffer is corrupt.


--T


and perhaps represents
not even the extreme of bad hardware that ZFS might encounter.

Cheers -- Frank


 ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How recoverable is an 'unrecoverable error'?

2009-04-16 Thread Toby Thain


On 16-Apr-09, at 5:27 PM, Florian Ermisch wrote:


Uwe Dippel schrieb:

Bob Friesenhahn wrote:


Since it was not reported that user data was impacted, it seems  
likely that there was a read failure (or bad checksum) for ZFS  
metadata which is redundantly stored.
(Maybe I am too much of a linguist to not stumble over the wording  
here.) If it is 'redundant', it is 'recoverable', am I right? Why,  
if this is the case, does scrub not recover it, and scrub even  
fails to correct the CKSUM error as long as it is flagged  
'unrecoverable', but can do exactly that after the 'clear' command?


Ubuntu Linux is unlikely to notice data problems unless the drive  
reports hard errors.  ZFS is much better at checking for errors.
No doubt. But ext3 also seems to need much less attention, very  
much fewer commands. Which leaves it as a viable alternative. I  
still hope that one day ZFS will be maintainable as simple as  
ext3; respectively do all that maintenance on its own.  :)

Ext3 has no (optional) redundancy by using more than one disc and no
volume managment. You need Device Mapper for redundancy (Multiple
Devices or Linux Volume Management) and volume management (LVM again).



And you'll still be lacking checksumming and self healing.

--Toby



If you want such features on Linux Ext3 is the top of at least 2,
probably 3 layers of storage managment.
Should I add NFS, CIFS and iSCSI exports or the needlessness of  
resizing

volumes?

You're comparing a single tool with a whole production line.
Sorry for the flaming but yesterday I spend 4 additional hours at work
with recovery of a xen server with a single error somewhere in it's  
LVM

causing the virtual servers to freeze.


Uwe


Kind Regards, FLorian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Errors on mirrored drive

2009-04-15 Thread Toby Thain


On 15-Apr-09, at 8:31 PM, Frank Middleton wrote:


On 04/15/09 14:30, Bob Friesenhahn wrote:

On Wed, 15 Apr 2009, Frank Middleton wrote:

zpool status shows errors after a pkg image-update
followed by a scrub.


If a corruption occured in the main memory, the backplane, or the  
disk

controller during the writes to these files, then the original data
written could be corrupted, even though you are using mirrors. If the
system experienced a physical shock, or power supply glitch, while  
the

data was written, then it could impact both drives.


Quite. Sounds like an architectural problem. This old machine probably
doesn't have ecc memory (AFAIK still rare on most PCs), but it is on
a serial UPS and isolated from shocks, and this has happened more
than once. These drives on this machine recently passed both the purge
and verify cycles (format/analyze) several times. Unless the data is
written to both drives from the same buffer and checksum (surely  
not!),


Doesn't seem that far-fetched...


it is still unclear how it could get written to *both* drives with a
bad checksum. It looks like the files really are bad - neither of
them can be read - unless ZFS sensibly refuses to allow possibly good
files with bad checksums to be read (cannot read: I/O error).

BTW fmdump -ev doesn't seem to report any disk errors  at all.

So my question remains - even with the grottiest hardware, how can
several files get written with bad checksums to mirrored drives?


Bad RAM would seem a possible cause, wouldn't it?

--Toby


ZFS
has so many cool features this would be easy to live with if there
was a reasonably simple way to get copies of these files to restore
them, short of getting the source and recompiling, or pkg uninstall
followed by install (if you can figure out which pkg(s) the bad files
are in), but it seems to defeat the purpose of softwaremirroring...







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great

2009-04-10 Thread Toby Thain


On 10-Apr-09, at 5:05 PM, Mark J Musante wrote:


On Fri, 10 Apr 2009, Patrick Skerrett wrote:

degradation) when these write bursts come in, and if I could  
buffer them even for 60 seconds, it would make everything much  
smoother.


ZFS already batches up writes into a transaction group, which  
currently happens every 30 seconds.



Isn't that 5 seconds?

--T


  Have you tested zfs against a real-world workload?


Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data size grew.. with compression on

2009-04-10 Thread Toby Thain


On 10-Apr-09, at 2:03 PM, Harry Putnam wrote:


David Magda  writes:


On Apr 7, 2009, at 16:43, OpenSolaris Forums wrote:


if you have a snapshot of your files and rsync the same files again,
you need to use "--inplace" rsync option , otherwise completely new
blocks will be allocated for the new files. that`s because rsync
will write entirely new file and rename it over the old one.




not sure if this applies here, but i think it`s worth mentioning and
not obvious.


With ZFS new blocks will always be allocated: it's copy-on-write  
(COW)

file system.


So who is right here...


As far as I can see - the effect of --inplace would be that new  
blocks are allocated for the deltas, not the whole file, so Daniel  
Rock's finding does not contradict "OpenSolaris Forums". But in  
either case, COW is involved.


--Toby


Daniel Rock says he can see on disk that it
doesn't work that way... that is only a small amount of space is taken
when rsyncing in this way.
See his post:

  From: Daniel Rock 
  Subject: Re: Data size grew.. with compression on
  Newsgroups: gmane.os.solaris.opensolaris.zfs
  To: zfs-discuss@opensolaris.org
  Date: Thu, 09 Apr 2009 16:35:07 +0200
  Message-ID: <49de079b.2040...@deadcafe.de>

  [...]

  Johnathon wrote:

ZFS will allocate new blocks either way


  Daniel R replied: No it won't. --inplace doesn't rewrite blocks
  identical on source and target but only blocks which have been
  changed.

  I use rsync to synchronize a directory with a few large files (each
  up to 32 GB). Data normally gets appended to one file until it
  reaches the size limit of 32 GB. Before I used --inplace a snapshot
  needed on average ~16 GB. Now with --inplace it is just a few
  kBytes.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-17 Thread Toby Thain


On 17-Mar-09, at 3:32 PM, cindy.swearin...@sun.com wrote:


Neal,

You'll need to use the text-based initial install option.
The steps for configuring a ZFS root pool during an initial
install are covered here:

http://opensolaris.org/os/community/zfs/docs/

Page 114:

Example 4–1 Initial Installation of a Bootable ZFS Root File System

Step 3, you'll be presented with the disks to be selected as in  
previous releases. So, for example, to select the boot disks on the  
Thumper,

select both of them:


Right, but what if you didn't realise on that screen that you needed  
to select both to make a mirror? The wording isn't very explicit, in  
my opinion. Yesterday I did my first Solaris 10 ZFS root install and  
didn't interpret this screen correctly. I chose one disk, so I'm the  
OP's situation and want to set up the mirror retrospectively.


I'm using an X2100. Unfortunately when I try to zpool attach, I get a  
Device busy error on the 2nd drive. But probably I'm making a n00b  
error.


--Toby



[x] c5t0d0
[x] c4t0d0
.
.
.


On our lab Thumper, they are c5t0 and c4t0.

Cindy

Neal Pollack wrote:

I'm setting up a new X4500 Thumper, and noticed suggestions/blogs
for setting up two boot disks as a zfs rpool mirror during  
installation.
But I can't seem to find instructions/examples for how to do this  
using

google, the blogs, or the Sun docs for X4500.
Can anyone share some instructions for setting up the rpool mirror
of the boot disks during the Solaris Nevada (SXCE) install?
Thanks,
Neal
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS GSoC ideas page rough draft

2009-03-14 Thread Toby Thain


On 14-Mar-09, at 12:09 PM, Blake wrote:


I just thought of an enhancement to zfs that would be very helpful in
disaster recovery situations - having zfs cache device serial/model
numbers - the information we see in cfgadm -v.


+1  I haven't needed this but it sounds very sensible. I can imagine  
it could help a lot in some drive replacement situations.


--Toby



I'm feeling the pain of this now as I try to figure out which disks on
my failed filer belonged to my raidz2 pool - zpool status tells me the
pool is faulted (I don't have enough working SATA ports to connect all
the drives from the pool), but doesn't tell me which individual
devices were in that pool (just the devids of the devices).


On Mon, Mar 9, 2009 at 9:30 AM, C.  wrote:


Here's my rough draft of GSoC ideas

http://www.osunix.org/docs/DOC-1022

Also want to thank everyone for their feedback.

Please keep in mind that for creating a stronger application we  
only have a

few days.

We still need to :

1) Find more mentors.  (Please add your name to the doc or confirm  
via email

and which idea you're most interested in)
2) Add contacts from each organization that may be interested  
(OpenSolaris,

FreeBSD...)
3) Finalize the application, student checklist, mentor checklist and
template
4) Start to give ideas for very accurate project descriptions/ 
details (We

have some time for this)

Thanks

./Christopher

---
Community driven OpenSolaris Technology - http://www.osunix.org
blog: http://www.codestrom.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs related google summer of code ideas - your vote

2009-03-05 Thread Toby Thain


On 5-Mar-09, at 2:03 PM, Miles Nordin wrote:


"gm" == Gary Mills  writes:


gm> There are many different components that could contribute to
gm> such errors.

yes of course.

gm> Since only the lower ZFS has data redundancy, only it can
gm> correct the error.

um, no?
...
For writing, application-level checksums do NOT work at all, because
you would write corrupt data to the disk, and notice only later when
you read it back, when there's nothing you can do to fix it.


Right, it would have to be combined with an always read-back policy  
in the application...?


--Toby


  ZFS
redundancy will not help you here either, because you write corrupt
data redundantly!  With a single protection domain for writing, the
write would arrive at ZFS along with a never-regenerated checksum
wrapper-seal attached to it by the something-like-an-NFS-client.  Just
before ZFS sends the write to the disk driver, ZFS would crack the
protection domain open, validate the checksum, reblock the write, and
send it to disk with a new checksum.  (so, ``single protection
domain'' is really a single domain for reads, and two protection
domains for write) If the checksum does not match, ZFS must convince
the writing client to resend---in the write direction I think cached
bad data will be less of a problem.
...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   >