Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-23 Thread Frank . Hofmann
On Wed, 23 Jul 2008, Tharindu Rukshan Bamunuarachchi wrote:

> 10,000 x 700 = 7MB per second ..
> 
> We have this rate for whole day 
> 
> 10,000 orders per second is minimum requirments of modern day stock exchanges 
> ...
> 
> Cache still help us for ~1 hours, but after that who will help us ...
> 
> We are using 2540 for current testing ...
> I have tried same with 6140, but no significant improvement ... only one or 
> two hours ...

It might not be exactly what you have in mind, but this "how do I get 
latency down at all costs" thing reminded me of this old paper:

http://www.sun.com/blueprints/1000/layout.pdf

I'm not a storage architect, someone with more experience in the area care 
to comment on this ? With huge disks as we have these days, the "wide 
thin" idea has gone under a bit - but how to replace such setups with 
modern arrays, if the workload is such that caches eventually must get 
blown and you're down to spindle speed ?

FrankH.

> 
> Robert Milkowski wrote:
>
>  Hello Tharindu,
> 
> Wednesday, July 23, 2008, 6:35:33 AM, you wrote:
> 
> TRB> Dear Mark/All,
> 
> TRB> Our trading system is writing to local and/or array volume at 10k 
> TRB> messages per second.
> TRB> Each message is about 700bytes in size.
> 
> TRB> Before ZFS, we used UFS.
> TRB> Even with UFS, there was evey 5 second peak due to fsflush invocation.
> 
> TRB> However each peak is about ~5ms.
> TRB> Our application can not recover from such higher latency.
> 
> TRB> So we used several tuning parameters (tune_r_* and autoup) to decrease
> TRB> the flush interval.
> TRB> As a result peaks came down to ~1.5ms. But it is still too high for our
> TRB> application.
> 
> TRB> I believe, if we could reduce ZFS sync interval down to ~1s, peaks will
> TRB> be reduced to ~1ms or less.
> TRB> We like <1ms peaks per second than 5ms peak per 5 second :-)
> 
> TRB> Are there any tunable, so i can reduce ZFS sync interval.
> TRB> If there is no any tunable, can not I use "mdb" for the job ...?
> 
> TRB> This is not general and we are ok with increased I/O rate.
> TRB> Please advice/help.
> 
> txt_time/D
> 
> btw:
>  10,000 * 700 = ~7MB
> 
> What's your storage subsystem? Any, even small, raid device with write
> cache should help.
> 
>
> 
> 
> 
>

--
No good can come from selling your freedom, not for all the gold in the world,
for the value of this heavenly gift far exceeds that of any fortune on earth.
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS for write-only media?

2008-04-24 Thread Frank . Hofmann
On Thu, 24 Apr 2008, Daniel Rock wrote:

> Joerg Schilling schrieb:
>> WOM  Write-only media
>
> http://www.national.com/rap/files/datasheet.pdf

I love this part of the specification:

Cooling

The 25120 is easily cooled by employment of a six-foot fan,
1/2" from the package. If the device fails, you have exceeded
the ratings. In such cases, more air is recommended.

There was an article in German c't magazine's issue exactly 13 years ago 
this month that benchmarked various operating system's Null devices. They 
tested an unnamed "hardware null device prototype", now I finally know 
what that one actually was !

:-)

FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2007-12-31 Thread Frank . Hofmann
On Mon, 31 Dec 2007, Darren Reed wrote:

> Frank Hofmann wrote:
>>
>>
>> On Fri, 28 Dec 2007, Darren Reed wrote:
>> [ ... ]
>>> Is this behaviour defined by a standard (such as POSIX or the
>>> VFS design) or are we free to innovate here and do something
>>> that allowed such a shortcut as required?
>>
>> Wrt. to standards, quote from:
>>
>> http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
>>
>> ERRORS
>> The rename() function shall fail if:
>> [ ... ]
>> [EXDEV]
>> [CX]  The links named by new and old are on different file systems
>> and the
>> implementation does not support links between file systems.
>>
>> Hence, it's implementation-dependent, as per IEEE1003.1.
>
> This implies that we'd also have to look at allowing
> link(2) to also function between filesystems where
> rename(2) was going to work without doing a copy,
> correct?  Which I suppose makes sense.

Copy-on-write. rename() is just defined as an "atomic" sequence of:

link(old, new);
unlink(old);

If cross-fs rename is possible, then cross-fs link is as well. It's 
"per-file clone".

Btw, Joerg, this addresses the concern you had in any case. It's cross-fs, 
that means st_dev/st_ino _WILL_ change. Persistence of open files is not 
related to that. If you hold a file open, the st_dev/st_ino associated 
with the open fd will stay around and continue to be accessible with 
fstat() - but not necessarily with stat(). It definitely would not be in 
case the file got removed. That cross-fs rename would, on the source fs, 
remove the file is, for all I can see, not violating anything.
The location of the file's data is _NOT_ the only way to derive a unique 
st_dev/st_ino pair.
rename() _within_ a filesystem (as defined by the set of nodes with a 
common st_dev) should preserve st_ino if the fs supports link counts 
larger than one, agreed. But let's not confuse this with cross-fs rename, 
where by definition (cross-fs) st_dev must change. The identity of that 
file, therefore, has changed.
We're just in the happy situation with ZFS that the storage low-level 
implementation can know that the contents haven't.

That's a sad situation for backup utilities, by the way - a backup tool 
would have no way of finding out that file X on fs A already existed as 
file Z on fs B. So what ? If the file got copied, byte by byte, the same 
situation exists, the contents are identical. I don't think just because 
this makes backups slower than they could be if the backup utility were 
omniscient, that makes a reason to slow file copy/rename operations down.

Happy new year !
FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2007-12-28 Thread Frank Hofmann


On Fri, 28 Dec 2007, Joerg Schilling wrote:
[ ... ]
> POSIX grants that st_dev and st_ino together uniquely identify a file
> on a system. As long as neither st_dev nor st_ino change during the
> rename(2) call, POSIX does not prevent this rename operation.

Clarification request: Where's the piece in the standard that forces an 
interpretation:

"rename() operations shall not change st_ino/st_dev"

I don't see where such a requirement would come from.


FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2007-12-28 Thread Frank Hofmann


On Fri, 28 Dec 2007, Joerg Schilling wrote:

> Frank Hofmann <[EMAIL PROTECTED]> wrote:
>
>> I don't think the standards would prevent us from adding "cross-fs rename"
>> capabilities. It's beyond the standards as of now, and I'd expect that
>> were it ever added to that it'd be an optional feature as well, to be
>> queried for via e.g. pathconf().
>
> Why do you beliece there is a need for a pathconf() call?
> Either rename(2) succeeds or it fails with a cross-device error.

Why do you have a NAME_MAX / SYMLINK_MAX query - you can just as well let 
such requests fail with ENAMETOOLONG.

Why do you have a FILESIZEBITS query - there's EOVERFLOW to tell you.


There's no _need_. But the convenience exists for others as well.


FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2007-12-28 Thread Frank Hofmann


On Fri, 28 Dec 2007, Darren Reed wrote:
[ ... ]
> Is this behaviour defined by a standard (such as POSIX or the
> VFS design) or are we free to innovate here and do something
> that allowed such a shortcut as required?

Wrt. to standards, quote from:

http://www.opengroup.org/onlinepubs/009695399/functions/rename.html

ERRORS
The rename() function shall fail if:
[ ... ]
[EXDEV]
[CX]  The links named by new and old are on different file systems and 
the
implementation does not support links between file systems.

Hence, it's implementation-dependent, as per IEEE1003.1.

FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2007-12-28 Thread Frank Hofmann


On Fri, 28 Dec 2007, Darren Reed wrote:

> [EMAIL PROTECTED] wrote:
>> On Thu, 27 Dec 2007, [EMAIL PROTECTED] wrote:
>> 
>>> 
 
 I would guess that this is caused by different st_dev values in the new
 filesystem. In such a case, mv copies the files instead of renaming them.
>>> 
>>> 
>>> No, it's because they are different filesystems and the data needs to be
>>> copied; zfs doesn't allow data movement between filesystems within a pool.
>> 
>> It's not ZFS that blocks this by design - it's the VFS framework. 
>> vn_rename() has this piece:
[ ... ]
>> ZFS will never even see such a rename request.
>
> Is this behaviour defined by a standard (such as POSIX or the
> VFS design) or are we free to innovate here and do something
> that allowed such a shortcut as required?
>
> Although I'm not sure the effort required would be worth the
> added complexity (to VFS and ZFS) for such a minor "feature".
>
> Darren

Hi Darren,

I don't think the standards would prevent us from adding "cross-fs rename" 
capabilities. It's beyond the standards as of now, and I'd expect that 
were it ever added to that it'd be an optional feature as well, to be 
queried for via e.g. pathconf().

The VFS design/framework is "ours" - the OpenSolaris community is free to 
innovate there and change it as desired. It's not on the stability level 
of the DDI. You can't revamp it at a whim, but you can change/extend it.

Precedence exists for things that FS X can do but FS Y cannot, and 
changing the framework to check "does this fs claim to support cross-fs 
rename ?" wouldn't be too hard.

A filesystem could advertise that e.g. via a VFSSW capabilities flags (the 
VSW_* stuff from ), or via VFS features (VFSFT_*, again see 
, this is relatively recent, got added by the CIFS projects).

I don't know enough about ZFS internals to help you code the backend 
support, but if you wish to work on it, I'd be happy to help you with the 
framework changes. Those won't be more than ~50 lines.

"Minor feature" ? I guess that depends how you look at it. It would be 
another thing that highlights what noone else but ZFS can do for you.
Who knows what users will do with it in ten years :)

FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2007-12-27 Thread Frank . Hofmann
On Thu, 27 Dec 2007, [EMAIL PROTECTED] wrote:

>
>>
>> I would guess that this is caused by different st_dev values in the new
>> filesystem. In such a case, mv copies the files instead of renaming them.
>
>
> No, it's because they are different filesystems and the data needs to be
> copied; zfs doesn't allow data movement between filesystems within a pool.

It's not ZFS that blocks this by design - it's the VFS framework. 
vn_rename() has this piece:

 /*
  * Make sure both the from vnode directory and the to directory
  * are in the same vfs and the to directory is writable.
  * We check fsid's, not vfs pointers, so loopback fs works.
  */
 if (fromvp != tovp) {
 vattr.va_mask = AT_FSID;
 if (error = VOP_GETATTR(fromvp, &vattr, 0, CRED(), NULL))
 goto out;
 fsid = vattr.va_fsid;
 vattr.va_mask = AT_FSID;
 if (error = VOP_GETATTR(tovp, &vattr, 0, CRED(), NULL))
 goto out;
 if (fsid != vattr.va_fsid) {
 error = EXDEV;
 goto out;
 }
 }

ZFS will never even see such a rename request.

FrankH.

>
> The code inside "mv" would immediately support such renames as it *first*
> checks whether rename works and only then will it try "plan B":
>
>if (rename(source, target) >= 0)
>return (0);
>if (errno != EXDEV) {
>   /* fatal errors */
>   }
>   ... continue with plan B: copy & remove ...
>
>
> Casper
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

--
No good can come from selling your freedom, not for all the gold in the world,
for the value of this heavenly gift far exceeds that of any fortune on earth.
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [fuse-discuss] Filesystem Community? [was: SquashFS port, interested?]

2007-11-05 Thread Frank . Hofmann
On Mon, 5 Nov 2007, Mark Phalan wrote:

>
> On Mon, 2007-11-05 at 02:16 -0800, Thomas Lecomte wrote:
>> Hello there -
>>
>> I'm still waiting for an answer from Phillip Lougher [the SquashFS 
>> developer].
>> I had already contacted him some month ago, without any answer though.
>>
>> I'll still write a proposal, and probably start the work soon too.
>
> Sounds good!
>
> *me thinks it would be cool to finally have a generic filesystem
> community*

_Do_ we finally get one ? Can't wait :-)

FrankH.

>
> -M
>
> ___
> fuse-discuss mailing list
> [EMAIL PROTECTED]
> http://mail.opensolaris.org/mailman/listinfo/fuse-discuss
>

--
No good can come from selling your freedom, not for all the gold in the world,
for the value of this heavenly gift far exceeds that of any fortune on earth.
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs corruption w/ sil3114 sata controllers

2007-10-30 Thread Frank . Hofmann
On Tue, 30 Oct 2007, Tomasz Torcz wrote:

> On 10/30/07, Neal Pollack <[EMAIL PROTECTED]> wrote:
>>> I'm experiencing major checksum errors when using a syba silicon image 3114 
>>> based pci sata controller w/ nonraid firmware.  I've tested by copying data 
>>> via sftp and smb.  With everything I've swapped out, I can't fathom this 
>>> being a hardware problem.
>> Even before ZFS, I've had numerous situations where various si3112 and
>> 3114 chips
>> would corrupt data on UFS and PCFS, with very simple  copy and checksum
>> test scripts, doing large bulk transfers.
>
>  Those SIL chips are really broken when used with certain Seagate drivers.
> But I have data corrupted by them with WD drive also.
> Linux can workaround this bug by reducing transfer sizes (and thus
> dramatically impacting speed). Solaris probably don't have workaround.

Might be slightly off-topic for the whole, but _this_ specific thing 
(reducing transfer sizes) is possible on Solaris as well. As documented 
here:

http://docs.sun.com/app/docs/doc/819-2724/chapter2-29?a=view

You can also read a bit more on the following thread:

http://www.opensolaris.org/jive/thread.jspa?threadID=6866

It's possible to limit this system-wide or per-LUN.

Best regards,
FrankH.

> With this quirk enabled (on Linux), I get at most 20 MB/s from drives,
> but ZFS do not report any corruption. Before I had corruptions hourly.
>
> More info about SIL issue: http://home-tj.org/wiki/index.php/Sil_m15w
> I have Si 3112, but despite SIL claims other chips seem to be affected also.
>
>
> -- 
> Tomasz Torcz
> [EMAIL PROTECTED]
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

--
No good can come from selling your freedom, not for all the gold in the world,
for the value of this heavenly gift far exceeds that of any fortune on earth.
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] UC Davis Cyrus Incident September 2007

2007-10-18 Thread Frank Hofmann
On Thu, 18 Oct 2007, Mike Gerdts wrote:

> On 10/18/07, Bill Sommerfeld <[EMAIL PROTECTED]> wrote:
>> that sounds like a somewhat mangled description of the cross-calls done
>> to invalidate the TLB on other processors when a page is unmapped.
>> (it certainly doesn't happen on *every* update to a mapped file).
>
> I've seen systems running Veritas Cluster & Oracle Cluster Ready
> Services idle at about 10% sys due to the huge number of monitoring
> scripts that kept firing.  This was on a 12 - 16 CPU 25k domain.  A

Monitoring scripts and mmap users ... URGH :(

That runs into procfs' notorious keenness on locking the address spaces of 
inspected processes. Even as much as an "ls -l /proc//" is acquiring 
address space locks on that process, and I can see how/why this leads to 
CPU spikes when you have an application that heavily uses mmap()/munmap().

One could say, if you want this workload to perform well, trust it to 
perform well and restrain the urge to watch it all the time ...

> quite similar configuration on T2000's had negligible overhead.
> Lesson learned: cross-calls (and thread migrations, and ...) are much
> cheaper on systems with lower latency between CPUs.

And quantum theory tells us: If you hadn't looked, that cat might still be 
living happily ever after ... /proc isn't for free.

FrankH.

>
> -- 
> Mike Gerdts
> http://mgerdts.blogspot.com/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] use 32-bit inode scripts on zfs?

2007-10-15 Thread Frank Hofmann
On Mon, 15 Oct 2007, Tom Davies wrote:

> Say for an example of old custom 32-bit perl scripts.Can it work with 
> 128bit ZFS?

That question was posted either here or on some other help aliases 
recently ...

If you have any non-largefile-aware application that must under all 
circumstances be kept alive, run it within a filesystem that's smaller 
than 4GB, or in ZFS case a filesystem with a quota of 4GB.

That'll preserve compatibility "for all eternity".

(A more precise answer needs more information about what '32-bit' means in 
the context of the question)

FrankH.

>
>
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] safe zfs-level snapshots with a UFS-on-ZVOL filesystem?

2007-10-08 Thread Frank Hofmann
On Mon, 8 Oct 2007, Dick Davies wrote:

> I had some trouble installing a zone on ZFS with S10u4
> (bug in the postgres packages) that went away when I  used a
> ZVOL-backed UFS filesystem
> for the zonepath.
>
> I thought I'd push on with the experiment (in the hope Live Upgrade
> would be able to upgrade such a zone).
> It's a bit unwieldy, but everything worked reasonably well -
> performance isn't much worse than straight ZFS (it gets much faster
> with compression enabled, but that's another story).
>
> The only fly in the ointment is that ZVOL level snapshots don't
> capture unsynced data up at the FS level. There's a workaround at:
>
>  http://blogs.sun.com/pgdh/entry/taking_ufs_new_places_safely
>
> but I wondered if there was anything else that could be done to avoid
> having to take such measures?
> I don't want to stop writes to get a snap, and I'd really like to avoid UFS
> snapshots if at all possible.

Hmm - "Difficult Problem" (TM :), that is.
UFS, by design, isn't atomic / self-consistent-at-all-times. Not even with 
UFS logging active, because the latter doesn't actually log userdata.

Which is why establishing a snapshot on UFS, whatever method (and that 
includes UFS' fssnap), involves creating a write barrier temporarily. You 
need to flush all data on UFS, while at the same time blocking updates 
beyond that point, before UFS (or, if not using fssnap, the admin) can 
'signal' to the lower levels that a snapshot is safe to create now.

One could conceive a (relatively small) codechange in UFS that would make 
this happen under the hood, e.g. in the form of an ioctl - "flush things 
and signal the 'driver level' ioctl(SNAP_ENABLE_NOW_SAFE)". But that 
would, although you would then not issue the "lockfs -w/u" yourself, still 
involve the temporary write barrier. Just as with fssnap, where this also 
happens under the hood.

Without userdata logging/journaling, it seems pretty difficult to get 
around that barrier constraint, though if you'd have suggestions as "how 
to do it in UFS", we should start talking about it. 
[EMAIL PROTECTED] is open :)

As said - "Difficult problem".

>
> I tried mounting forcedirectio in the (mistaken) belief that this
> would bypass the UFS
> buffer cache, but it didn't help.

Nope, it's a design issue in UFS; no userdata logging, no point-in-time 
consistency (for "everything").

FrankH.

>
> -- 
> Rasputin :: Jack of All Trades - Master of Nuns
> http://number9.hellooperator.net/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Memory Usage

2007-09-14 Thread Frank Hofmann
On Fri, 14 Sep 2007, Sergey wrote:

> I am running Solaris U4 x86_64.
>
> Seems that something is changed regarding mdb:
>
> # mdb -k
> Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc 
> pcplusmp ufs ip hook neti sctp arp usba fctl nca lofs zfs random nfs sppp 
> crypto ptm ]
>> arc::print -a c_max
> mdb: failed to dereference symbol: unknown symbol name

See the comments at the bottom of:

http://bugs.opensolaris.org/view_bug.do?bug_id=6510807

Best regards,
FrankH.

>
>
>> ::arc -a
> {
>hits = 0x6baba0
>misses = 0x25ceb
>demand_data_hits = 0x2f0bb9
>demand_data_misses = 0x92bc
>demand_metadata_hits = 0x2b50db
>demand_metadata_misses = 0x14c20
>prefetch_data_hits = 0x5bfe
>prefetch_data_misses = 0x1d42
>prefetch_metadata_hits = 0x10f30e
>prefetch_metadata_misses = 0x60cd
>mru_hits = 0x62901
>mru_ghost_hits = 0x9dd5
>mfu_hits = 0x545ea4
>mfu_ghost_hits = 0xb9aa
>deleted = 0xcb5a3
>recycle_miss = 0x131fb
>mutex_miss = 0x1520
>evict_skip = 0x0
>hash_elements = 0x1ea54
>hash_elements_max = 0x40fac
>hash_collisions = 0x138464
>hash_chains = 0x92c7
> [..skipped..]
>
> How can I set/view arc.c_max now?
>
>
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single SAN Lun presented to 4 Hosts

2007-08-28 Thread Frank Hofmann
On Tue, 28 Aug 2007, David Olsen wrote:

>> On 27/08/2007, at 12:36 AM, Rainer J.H. Brandt wrote:
[ ... ]
>>> I don't see why multiple UFS mounts wouldn't work,
>> if only one
>>> of them has write access.  Can you elaborate?
>>
>> Even with a single writer you would need to be
>> concerned with read
>> cache invalidation on the read-only hosts and
>> (probably harder)
>> ensuring that read hosts don't rely on half-written
>> updates (since
>> UFS doesn't do atomic on-disk updates).

That synchronization issue is always there for shared filesystems. For 
example, the NFS specs mention it explicitly, sections 4.11 / 4.12 of RFC 
1813 for reference. Some quotes:

4.11 Caching policies

The NFS version 3 protocol does not define a policy for
caching on the client or server. In particular, there is no
support for strict cache consistency between a client and
server, nor between different clients. See [Kazar] for a
discussion of the issues of cache synchronization and
mechanisms in several distributed file systems.

4.12 Stable versus unstable writes
[ ... ]
Unfortunately, client A can't tell for sure, so it will need
to retransmit the buffers, thus overwriting the changes from
client B.  Fortunately, write sharing is rare and the
solution matches the current write sharing situation. Without
using locking for synchronization, the behaviour will be
indeterminate.

"Just sharing" a filesystem, even when using something "made to share" 
like NFS, doesn't solve writer/reader cache consistency issues. There 
needs to be a locking / arbitration mechanism (which in NFS is provided by 
rpc.lockd _AND_ the use of flock/fcntl in the applications - and which is 
done by a QFS-private lockmgr daemon for the "shared writer" case) if the 
shared resource isn't readonly-for-everyone.

As long as everyone is reader, or writes are extremely infrequent, 
"sharing" doesn't cause problems. But if that makes you decide to "simply 
share the SAN because it [seems to] works", think again. Sometimes, a 
little strategic planning is advisable.


FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Single SAN Lun presented to 4 Hosts

2007-08-28 Thread Frank Hofmann
On Tue, 28 Aug 2007, Charles DeBardeleben wrote:

> Are you sure that UFS writes a-time on read-only filesystems? I do not think
> that it is supposed to. If it does, I think that this is a bug. I have
> mounted
> read-only media before, and not gotten any write errors.
>
> -Charles

I think what might've been _meant_ here is sharing a UFS filesystem via 
NFS to different clients, some or all of which mount that 'NFS export' 
readonly. On the NFS server, you'll still see write activity on the 
backing filesystem - for access time updates.

That's in the context of this thread - "shared filesystem". UFS if 
mounted readonly should not write to the medium. Definitely not for atime 
updates.

FrankH.

>
> David Olsen wrote:
>>> On 27/08/2007, at 12:36 AM, Rainer J.H. Brandt wrote:
>>>
 Sorry, this is a bit off-topic, but anyway:

 Ronald Kuehn writes:

> No. You can neither access ZFS nor UFS in that
>
>>> way. Only one
>>>
> host can mount the file system at the same time
>
>>> (read/write or
>>>
> read-only doesn't matter here).
>
 I can see why you wouldn't recommend trying this

>>> with UFS
>>>
 (only one host knows which data has been committed

>>> to the disk),
>>>
 but is it really impossible?

 I don't see why multiple UFS mounts wouldn't work,

>>> if only one
>>>
 of them has write access.  Can you elaborate?

>>> Even with a single writer you would need to be
>>> concerned with read
>>> cache invalidation on the read-only hosts and
>>> (probably harder)
>>> ensuring that read hosts don't rely on half-written
>>> updates (since
>>> UFS doesn't do atomic on-disk updates).
>>>
>>> Even without explicit caching on the read-only hosts
>>> there is some
>>> "implicit caching" when, for example, a read host
>>> reads a directory
>>> entry and then uses that information to access a
>>> file. The file may
>>> have been unlinked in the meantime. This means that
>>> you need atomic
>>> reads, as well as writes.
>>>
>>> Boyd
>>> ___
>>> zfs-discuss mailing list
>>> zfs-discuss@opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
>>> ss
>>>
>>
>> It's worse than this.  Consider the read-only clients.  When you access a 
>> filesystem object (file, directory, etc.), UFS will write metadata to update 
>> atime.  I believe that there is a noatime option to mount, but I am unsure 
>> as to whether this is sufficient.
>>
>> my 2c.
>> --Dave
>>
>>
>> This message posted from opensolaris.org
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with HDS TrueCopy and EMC SRDF

2007-08-03 Thread Frank Hofmann
On Fri, 3 Aug 2007, Damon Atkins wrote:

[ ... ]
> UFS forcedirectio and VxFS closesync ensure that what ever happens your files 
> will always exist if the program completes. Therefore with Disk Replication 
> (sync) the file exists at the other site at its finished size. When you 
> introduce DR with Disk Replication, general means you can not afford to lose 
> any save data. UFS forcedirectio has a larger performance hit than VxFS 
> closesync.

Hmm, not quite.

forcedirectio, at least on UFS, is bound on the I/O operations meeting 
certain criteria. These are explained in directio(3C):

  DIRECTIO_ON The system behaves as though the application
  is  not  going to reuse the file data in the
  near future. In other words, the  file  data
  is not cached in the system's memory pages.

  When  possible,  data  is  read  or  written
  directly  between  the  application's memory
  and the device when  the  data  is  accessed
  with  read(2)  and write(2) operations. When
  such transfers are not possible, the  system
  switches  back  to the default behavior, but
  just for that  operation.  In  general,  the
  transfer  is possible when the application's
  buffer is  aligned  on  a  two-byte  (short)
  boundary,  the  offset into the file is on a
  device sector boundary, and the size of  the
  operation is a multiple of device sectors.

  This advisory  is  ignored  while  the  file
  associated   with   fildes  is  mapped  (see
  mmap(2)).

So, it all depends on how exactly your workload looks like. If you're 
doing non-blocked writes or writes to nonalinged offsets, and/or mmap 
access, directio is not being done, the advisory AND (!) the mount option 
notwithstanding.

As far as the hot backup consistency goes:

Do a "lockfs -w", then start the BCV copy, then (once that started) do a 
"lockfs -u".
A writelocked filesystem is "clean", needs not to be fsck'ed before being 
able to mount it.

The disadvantage is that write ops to that fs in question will block while 
the lockfs -w is active. But then, you don't need to wait until the BCV 
finished - you only need the consistent state to start with, and can 
unlock immediately as the copy started.

Note that fssnap also writelocks temporarily. So if you have used 
UFS snapshots in the past, "lockfs -w";;"lockfs -u" is not 
going to cause you more impact.

"lockfs -f" is only a best-try-if-I-cannot-writelock. It's no guarantee 
for consistency, because by the time the command returns something else 
can already be writing again.


FrankH.


>
> Cheers
>
>
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with HDS TrueCopy and EMC SRDF

2007-07-26 Thread Frank Hofmann
On Thu, 26 Jul 2007, Damon Atkins wrote:

> Guys,
>   What is the best way to ask for a feature enhancement to ZFS.
>
> To allow ZFS to be usefull for DR disk replication, we need to be able
> set an option against the pool or file system or both, called close
> sync. ie When a programme closes a file any outstanding writes are flush
> to disk, before the close returns to the programme.  So when a programme
> ends you are guarantee any state information is save to the disk.
> (exit() also results in close being called)
>
> open(xxx, O_DSYNC) is only good if you can alter the source code.  Shell
> scripts use of awk, head, tail, echo etc to create output  files do not
> use O_DSYNC, when the shell script returns 0, you want to know that all
> the data is on the disk, so if the system crashes the data is still there.
>
> PS it would be nice if  UFS had closessync as well, instead of using
> forcedirectio.

I'd implement this via LD_PRELOAD library, implementing your own 'close', 
so that this not only dispatches to libc`close but also does an fsync() 
call on that filedescriptor before.

Or, if really wanting to make sourcecode changes, again change it in 
libc`close(), and make it depend on an environment variable; if 
DO_CLOSE_SYNC is set, perform fsync(); close() instead of just the 
latter.

There's a problem with sync-on-close anyway - mmap for file I/O. Who 
guarantees you no file contents are being modified after the close() ?

FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-03 Thread Frank Hofmann


I'm not quite sure what this test should show ?

Compressing random data is the perfect way to generate heat.
After all, compression working relies on input entropy being low.
But good random generators are characterized by the opposite - output 
entropy being high.
Even a good compressor, if operated on a good random generator's output, 
will only end up burning cycles, but not reducing the data size.


Hence, is the request here for the compressor module to 'adapt', kind of 
first-pass check the input data whether it's sufficiently low-entropy to 
warrant a compression attempt ?


If not, then what ?

FrankH.

On Thu, 3 May 2007, Jürgen Keil wrote:


The reason you are busy computing SHA1 hashes is you are using
/dev/urandom.  The implementation of drv/random uses
SHA1 for mixing,
actually strictly speaking it is the swrand provider that does that part.


Ahh, ok.

So, instead of using dd reading from /dev/urandom all the time,
I've now used this quick C program to write one /dev/urandom block
over and over to the gzip compressed zpool:

=
#include 
#include 
#include 

int
main(int argc, char **argv)
{
   int fd;
   char buf[128*1024];

   fd = open("/dev/urandom", O_RDONLY);
   if (fd < 0) {
   perror("open /dev/urandom");
   exit(1);
   }
   if (read(fd, buf, sizeof(buf)) != sizeof(buf)) {
   perror("fill buf from /dev/urandom");
   exit(1);
   }
   close(fd);
   fd = open(argv[1], O_WRONLY|O_CREAT, 0666);
   if (fd < 0) {
   perror(argv[1]);
   exit(1);
   }
   for (;;) {
   if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
   break;
   }
   }
   close(fd);
   exit(0);
}
=


Avoiding the reads from /dev/urandom makes the effect even
more noticeable, the machine now "freezes" for 10+ seconds.

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 00   0 3109  3616  316  1965   17   48   45   2450  85   0  15
 10   0 3127  3797  592  2174   17   63   46   1760  84   0  15
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 00   0 3051  3529  277  2012   14   25   48   2160  83   0  17
 10   0 3065  3739  606  1952   14   37   47   1530  82   0  17
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 00   0 3011  3538  316  2423   26   16   52   2020  81   0  19
 10   0 3019  3698  578  2694   25   23   56   3090  83   0  17

# lockstat -kIW -D 20 sleep 30

Profiling interrupt: 6080 events in 31.341 seconds (194 events/sec)

Count indv cuml rcnt nsec Hottest CPU+PILCaller
---
2068  34%  34% 0.00 1767 cpu[0] deflate_slow
1506  25%  59% 0.00 1721 cpu[1] longest_match
1017  17%  76% 0.00 1833 cpu[1] mach_cpu_idle
 454   7%  83% 0.00 1539 cpu[0] fill_window
 215   4%  87% 0.00 1788 cpu[1] pqdownheap
 152   2%  89% 0.00 1691 cpu[0] copy_block
  89   1%  90% 0.00 1839 cpu[1] z_adler32
  77   1%  92% 0.0036067 cpu[1] do_splx
  64   1%  93% 0.00 2090 cpu[0] bzero
  62   1%  94% 0.00 2082 cpu[0] do_copy_fault_nta
  48   1%  95% 0.00 1976 cpu[0] bcopy
  41   1%  95% 0.0062913 cpu[0] mutex_enter
  27   0%  96% 0.00 1862 cpu[1] build_tree
  19   0%  96% 0.00 1771 cpu[1] gen_bitlen
  17   0%  96% 0.00 1744 cpu[0] bi_reverse
  15   0%  97% 0.00 1783 cpu[0] page_create_va
  15   0%  97% 0.00 1406 cpu[1] fletcher_2_native
  14   0%  97% 0.00 1778 cpu[1] gen_codes
  11   0%  97% 0.00  912 cpu[1]+6   ddi_mem_put8
   5   0%  97% 0.00 3854 cpu[1] fsflush_do_pages
---


It seems the same problem can be observed with "lzjb" compression,
but the pauses with lzjb are much shorter and the kernel consumes
less system cpu time with "lzjb" (which is expected, I think).


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Linux

2007-04-13 Thread Frank Hofmann

On Fri, 13 Apr 2007, Ignatich wrote:


Bart Smaalders writes:


Abide by the terms of the CDDL and all is well.  Basically, all you
have to do is make your changes to CDDL'd files available.  What you
do w/ the code you built (load it into MVS, ship a storage appliance,
build a ZFS for Linux) is up to you.


The problem is not with CDDL, GPL is the problem. ATI and nVidia do provide

^^


To sum all of this I see a number of possible solutions for this situation:

[ ... ]

N. Fix the GPL, to enable codesharing with opensource code of other licenses.

(you said above you recognized the problem - so why not fix the problem ?)

[ ... ]
4. GPL ZFS reimplementation project is started. I prefer that way until 1), 
2) or 3) happen.


Reminds me of "Project Harmony". One should try it. Qt got dual-licensed 
in the end. Whether that was due to Harmony "success" or just a business 
decision by Trolltech, who knows, but then there's precedence that seems 
to indicate such an approach may trigger the owner to license as GPL what 
used to be non-GPL code.




I know Sun opened most if not all ZFS related patents for OpenSolaris 
community. So I repeat questions I asked in my first mail:


1. Are those patents limited to CDDL/OpenSolaris code or can by used in 
GPL/Linux too?


2. If GPL code can't use those patented algorithms, will you please provide 
list of ZFS-related patents? RAID-Z and LZJB are most obvious technologies 
which may be patent protected.


These days, the situation with patents in computing is so bad that as a 
software writer, you essentially have no choice but "wait and see". To 
have even a fairly trivial software project proactively checked against 
potential patent violations would add prohibitive legal costs that no 
independent software writer could shell out.


And just because a piece of software is under GPL doesn't mean it cannot 
violate a patent, and/or that you'd be free to re-use that patented 
technology, embodied in this sourcecode, in a completely different 
project. Licensing the patent and licensing the code are two different 
things, and not all opensource licenses "cover your *ss" wrt. to patents.


(but this is really getting off-topic now)

FrankH.

___
zfs-discuss mailing list
[EMAIL PROTECTED]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Is there any performance problem with hard links in ZFS?

2007-03-26 Thread Frank Hofmann

On Mon, 26 Mar 2007, Viktor Turskyi wrote:


i have tested links performance. and i have got such results:
with hardlinks - no problems, reading of 5 files one million times takes 38 
seconds
in case with symlinks is another situation - reading of 5 files(through 
symlinks) one million times takes 3,5 minutes.
Is this OK? Is it normal that symlinks in so many times slower than hardlinks?


Tested this on UFS or on ZFS ?
How long were the filenames of the links ?

FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Efficiency when reading the same file blocks

2007-02-27 Thread Frank Hofmann

On Tue, 27 Feb 2007, Jeff Davis wrote:



Given your question are you about to come back with a
case where you are not
seeing this?



As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the I/O 
rate drops off quickly when you add processes while reading the same blocks 
from the same file at the same time. I don't know why this is, and it would be 
helpful if someone explained it to me.


UFS readahead isn't MT-aware - it starts trashing when multiple threads 
perform reads of the same blocks. UFS readahead only works if it's a 
single thread per file, as the readahead state, i_nextr, is per-inode 
(and not a per-thread) state. Multiple concurrent readers trash this for 
each other, as there's only one-per-file.




ZFS did a lot better. There did not appear to be any drop-off after the first 
process. There was a drop in I/O rate as I kept adding processes, but in that 
case the CPU was at 100%. I haven't had a chance to test this on a bigger box, 
but I suspect ZFS is able to keep the sequential read going at full speed (at 
least if the blocks happen to be written sequentially).


ZFS caches multiple readahead states - see the leading comment in
usr/src/uts/common/fs/zfs/vdev_cache.c in your favourite workspace.

FrankH.


I did these tests with each process being a "dd if=bigfile of=/dev/null" started at the same time, 
and I measured I/O rate with "zpool iostat mypool 2" and "iostat -Md 2".


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS file system supports short writes ?

2007-02-23 Thread Frank Hofmann

On Fri, 23 Feb 2007, Dan Mick wrote:

So, that would be an "error", and, other than reporting it accurately, what 
would you want ZFS to do to "support" it?


It's not an error for write(2) to return with less bytes written than 
requested. In some situations, that's pretty much expected. Like, for 
example, an writing to network sockets. But filesystems may also decide to 
do short writes, e.g. in the case when the write would extend the file, 
but the filesystem runs out of space before all of the write completed; 
it's up to the implementation whether it returns ENOSPC for all of the 
write or whether it returns the number of bytes successfully written. Same 
if you exceed the rlimits or quota allocations; if the write is 
interrupted before completion.




dudekula mastan wrote:
If a write call attempted to write X bytes of data, and if writecall writes 
only x ( hwere x 
 -Masthan



 > Please let me know the ZFS support for short writes ?


In the sense that it does them ? Well, it's UNIX/POSIX standard to do 
them, the write(2) manpage puts it like this:


 If a  write() requests that more bytes be written than there
 is  room for-for example, if the write would exceed the pro-
 cess file size limit (see getrlimit(2) and  ulimit(2)),  the
 system file size limit, or the free space on the device-only
 as many bytes as there is room  for  will  be  written.  For
 example,  suppose there is space for 20 bytes more in a file
 before reaching a limit. A write() of 512-bytes returns  20.
 The  next  write()  of  a  non-zero  number of bytes gives a
 failure return (except as noted for pipes and FIFO below).

I.e. you get a partial write before a failing write. ZFS behaves like 
this (on quota, definitely - "filesystem full" on ZFS is a bit different 
due to the space needs for COW), just as other filesystems do.


Where have you encountered a filesystem _NOT_ supporting this behaviour ?

FrankH.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Frank Hofmann

On Mon, 12 Feb 2007, Toby Thain wrote:

[ ... ]
I'm no guru, but would not ZFS already require strict ordering for its 
transactions ... which property Peter was exploiting to get "fbarrier()" for 
free?


It achieves this by flushing the disk write cache when there's need to 
barrier. Which completes outstanding writes.


A "perfect fsync()" for ZFS shouldn't need to do way more; that it does 
right now is something, as I understand, that is being worked on.


FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Frank Hofmann

On Mon, 12 Feb 2007, Chris Csanady wrote:

[ ... ]

> Am I missing something?

How do you guarantee that the disk driver and/or the disk firmware doesn't
reorder writes ?

The only guarantee for in-order writes, on actual storage level, is to
complete the outstanding ones before issuing new ones.


This is true for NCQ with SATA, but SCSI also supports ordered tags,
so it should not be necessary.

At least, that is my understanding.


Except that ZFS doesn't talk SCSI, it talks to a target driver. And that 
one may or may not treat async I/O requests dispatched via its strategy() 
entry point as strictly ordered / non-coalescible / non-cancellable.


See e.g. disksort(9F).

FrankH.



Chris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Frank Hofmann

On Mon, 12 Feb 2007, Peter Schuller wrote:


Hello,

Often fsync() is used not because one cares that some piece of data is on
stable storage, but because one wants to ensure the subsequent I/O operations
are performed after previous I/O operations are on stable storage. In these
cases the latency introduced by an fsync() is completely unnecessary. An
fbarrier() or similar would be extremely useful to get the proper semantics
while still allowing for better performance than what you get with fsync().

My assumption has been that this has not been traditionally implemented for
reasons of implementation complexity.

Given ZFS's copy-on-write transactional model, would it not be almost trivial
to implement fbarrier()? Basically just choose to wrap up the transaction at
the point of fbarrier() and that's it.

Am I missing something?


How do you guarantee that the disk driver and/or the disk firmware doesn't 
reorder writes ?


The only guarantee for in-order writes, on actual storage level, is to 
complete the outstanding ones before issuing new ones.


Or am _I_ now missing something :)

FrankH.



(I do not actually have a use case for this on ZFS, since my experience with
ZFS is thus far limited to my home storage server. But I have wished for an
fbarrier() many many times over the past few years...)

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Project Proposal: Availability Suite

2007-02-05 Thread Frank Hofmann


Btw, in case that gets lost between my devil's advocatism:
A happy +1 from me for the proposal !

FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Project Proposal: Availability Suite

2007-02-05 Thread Frank Hofmann

On Mon, 5 Feb 2007, Jim Dunham wrote:


Frank,

On Fri, 2 Feb 2007, Torrey McMahon wrote:


Jason J. W. Williams wrote:

Hi Jim,

Thank you very much for the heads up. Unfortunately, we need the
write-cache enabled for the application I was thinking of combining
this with. Sounds like SNDR and ZFS need some more soak time together
before you can use both to their full potential together?


Well...there is the fact that SNDR works with other FS other then ZFS. 
(Yes, I know this is the ZFS list.) Working around architectural issues 
for ZFS and ZFS alone might cause issues for others.


SNDR has some issues with logging UFS as well. If you start a SNDR live 
copy on an active logging UFS (not _writelocked_), the UFS log state may 
not be copied consistently.


Treading "very" carefully, UFS logging may have issues with being replicated, 
not the other way around. SNDR replication (after synchronizing) maintains a 
write-order consistent volume, thus if there is an issue with UFS logging 
being able to access an SNDR secondary, then UFS logging will also have 
issues with accessing a volume after Solaris crashes. The end result of 
Solaris crashing, or SNDR replication stopping, is a write-ordered, 
crash-consistent volume.


Except that you're not getting user data consistency - because UFS logging 
only does the write-ordered crash consistency for metadata.


In other words, it's possible with UFS logging to see metadata changes 
(file growth/shrink, filling of holes in sparse files) that do not match 
the file contents - AFTER crash recovery.


To get full consistency of data and metadata across crashes / replication 
termination, with a replicator underneath, the filesystem needs a way of 
telling the replicator "and now start/stop replicating please". For the 
filesystem to barrier.


I'm not saying SNDR isn't doing a good job. I'm just saying it could do a 
perfect job if it integrated in this way with the filesystem on top. If 
there were 'start/stop' hooks.


II is a different matter again. It had, for some time, don't know if 
that's still true, a window where it would EIO writes when enabling the 
image. Neither UFS logging nor ZFS very much like being told "this 
critical write of yours errored out".


FrankH.



Given that both UFS logging and SNDR are (near) perfect (or there would be a 
flood of escalations), this issue in all cases I've seen to date, is that the 
SNDR primary volume being replicated is mounted with UFS logging enable, but 
the SNDR secondary is not mounted with UFS logging enabled. Once this 
condition happens, the problem can be resolved by fixing /etc/vfstab to 
correct the inconsistent mount options, and then performing an SNDR update 
sync.




If you want a live remote replication facility, it _NEEDS_ to talk to the 
filesystem somehow. There must be a callback mechanism that the filesystem 
could use to tell the replicator "and from exactly now on you start 
replicating". The only entity which can truly give this signal is the 
filesystem itself.


There is an RFE against SNDR for something called "in-line PIT". I hope that 
this work will get done soon.




And no, that _not_ when the filesystem does a "flush write cache" ioctl. Or 
when the user has just issued a "sync" command or similar.
For ZFS, it'd be when a ZIL transaction is closed (as I understand it), for 
UFS it'd be when the UFS log is fully rolled. There's no notification to 
external entities when these two events happen.


Because ZFS is always on-disk consistent, this is not an issue. So far in ALL 
my testing with replicating ZFS with SNDR, I have not seen ZFS fail!


Of course be careful to not confuse my stated position with another closely 
related scenario. That being accessing ZFS on the remote node via a forced 
import "zpool import -f ", with  active SNDR replication, as ZFS is 
sure to panic the system. ZFS, unlike other filesystems has 0% tolerance to 
corrupted metadata.


Jim


SNDR tries its best to achieve this detection, but without actually 
_stopping_ all I/O (on UFS: writelocking), there's a window of 
vulnerability still open.
And SNDR/II don't stop filesystem I/O - by basic principle. That's how 
they're sold/advertised/intended to be used.


I'm all willing to see SNDR/II go open - we could finally work these issues 
!


FrankH.



I think the best of both worlds approach would be to let SNDR plug-in to 
ZFS along the same lines the crypto stuff will be able to plug in, 
different compression types, etc. There once was a slide that showed how 
that workedor I'm hallucinating again.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discu

Re: [zfs-discuss] Project Proposal: Availability Suite

2007-02-05 Thread Frank Hofmann

On Fri, 2 Feb 2007, Torrey McMahon wrote:


Jason J. W. Williams wrote:

Hi Jim,

Thank you very much for the heads up. Unfortunately, we need the
write-cache enabled for the application I was thinking of combining
this with. Sounds like SNDR and ZFS need some more soak time together
before you can use both to their full potential together?


Well...there is the fact that SNDR works with other FS other then ZFS. (Yes, 
I know this is the ZFS list.) Working around architectural issues for ZFS and 
ZFS alone might cause issues for others.


SNDR has some issues with logging UFS as well. If you start a SNDR live 
copy on an active logging UFS (not _writelocked_), the UFS log state may 
not be copied consistently.


If you want a live remote replication facility, it _NEEDS_ to talk to the 
filesystem somehow. There must be a callback mechanism that the filesystem 
could use to tell the replicator "and from exactly now on you start 
replicating". The only entity which can truly give this signal is the 
filesystem itself.


And no, that _not_ when the filesystem does a "flush write cache" ioctl. 
Or when the user has just issued a "sync" command or similar.
For ZFS, it'd be when a ZIL transaction is closed (as I understand it), 
for UFS it'd be when the UFS log is fully rolled. There's no notification 
to external entities when these two events happen.
SNDR tries its best to achieve this detection, but without actually 
_stopping_ all I/O (on UFS: writelocking), there's a window of 
vulnerability still open.
And SNDR/II don't stop filesystem I/O - by basic principle. That's how 
they're sold/advertised/intended to be used.


I'm all willing to see SNDR/II go open - we could finally work these 
issues !


FrankH.



I think the best of both worlds approach would be to let SNDR plug-in to ZFS 
along the same lines the crypto stuff will be able to plug in, different 
compression types, etc. There once was a slide that showed how that 
workedor I'm hallucinating again.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] using veritas dmp with ZFS (but not vxvm)

2007-01-03 Thread Frank Hofmann

On Wed, 3 Jan 2007, Darren Dunham wrote:


We have some HDS storage that isn't supported by mpxio, so we have to
use veritas dmp to get multipathing.



Whats the recommended way to use DMP storage with ZFS. I want to use
DMP but get at the multipathed virtual luns at as low a level as
possible to avoid using vxvm as much as possible.


I think that means creating a VxVM volume, and then passing that volume
to zpool.  DMP doesn't hand you a "device" that you can use.


It does - but you cannot use it. The /dev/vx/dmp/ devices aren't working 
properly, synchronization isn't done within DMP but within VxVM. There's a 
Veritas advisory and a SunAlert about _not_ putting filesystems directly 
onto vxdmp devices because that causes data corruption on the buf(9S)
linkage fields and subsequent crashes/hangs. If you have access to 
sunsolve.sun.com, see SunAlert 55980, or bug 4789779.





I figure theres no point in having overhead from 2 volume manages if
we can avoid it.


Sure.  I don't see an alternative for using DMP.  Given other reports of
using ZFS on top of SVM metadevices, I wouldn't expect peformance to
drop significantly.


As said, should work, a plain vxvm volume on top of the vxdmp node, then 
give that vxvm volume to ZFS.


FrankH.





Good luck.

--
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: [security-discuss] Thoughts on ZFS Secure Delete - without using Crypto

2006-12-20 Thread Frank Hofmann

On Wed, 20 Dec 2006, Pawel Jakub Dawidek wrote:


On Tue, Dec 19, 2006 at 02:04:37PM +, Darren J Moffat wrote:

In case it wasn't clear I am NOT proposing a UI like this:

$ zfs bleach ~/Documents/company-finance.odp

Instead ~/Documents or ~ would be a ZFS file system with a policy set something 
like this:

# zfs set erase=file:zero

Or maybe more like this:

# zfs create -o erase=file -o erasemethod=zero homepool/darrenm

The goal is the same as the goal for things like compression in ZFS, no application 
change it is "free" for the applications.


I like the idea, I really do, but it will be s expensive because of
ZFS' COW model. Not only file removal or truncation will call bleaching,
but every single file system modification... Heh, well, if privacy of
your data is important enough, you probably don't care too much about
performance. I for one would prefer encryption, which may turns out to be
much faster than bleaching and also more secure.


And this kind of "deep bleaching" would also break if you use snapshots - 
how do you reliably bleach if you need to keep the all of the old data 
around ? You only could do so once the last snapshot is gone. Kind of 
defeating the idea - automatic but delayed indefinitely till operator 
intervention (deleting the last snapshot).


FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re[2]: ZFS in a SAN environment

2006-12-20 Thread Frank Hofmann

On Tue, 19 Dec 2006, Anton B. Rang wrote:


"INFORMATION: If a member of this striped zpool becomes unavailable or
develops corruption, Solaris will kernel panic and reboot to protect your data."


OK, I'm puzzled.

Am I the only one on this list who believes that a kernel panic, instead of 
EIO, represents a bug?


I think any use of cmn_err(CE_PANIC,...) should be seen very critical ... 
because it's either trying to hide that we haven't bothered to create 
recoverability for a known-to-be-problematic situation, or because it's 
used where an ASSERT() should've been used.


FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto

2006-12-19 Thread Frank Hofmann

On Tue, 19 Dec 2006, Darren J Moffat wrote:


Frank Hofmann wrote:
On the technical side, I don't think a new VOP will be needed. This could 
easily be done in VOP_SPACE together with a new per-fs property - bleach 
new block when it's allocated (aka VOP_SPACE directly, or in a backend also 
called e.g. on allocating writes / filling holes), bleach existing block 
when VOP_SPACE is used to "stamp a hole" into a file, aka a request is made 
to bleach the blocks of an existing file.
I.e. make the implementation behind ftruncate()/posix_fallocate() do the 
per-file bleaching if so desired. And that implementation is VOP_SPACE.


That isn't solving the problem though, it solves a different problem.


Well, the thread has taken lots of turns already; the "erase just a file" 
task has been mentioned, and someone threw the idea of "VOP_BLEACH" in.




The problem that I want to be solved is that as files/datasets/pools are 
deleted (not as they are allocated) they are bleached.  In the cases there


VOP_SPACE() does truncation (free) as well as growth (alloc).

would not be a call to posix_fallocate() or ftruncate(), instead an unlink(2) 
or a zfs destory or zpool destroy.  Also on hotsparing in a disk - if the old 
disk can still be written to in some way we should do our best to bleach it.


Since VOP_*() requires a filesystem (with "/" specifying "all of this 
fs"), per-zvol or per-vdev "bleaching" needs a different implementation 
vehicle, that's clear, as you don't have any handle then that you could 
call it with.


I wouldn't just brush the VOP_*() approach aside. The world isn't pure ZFS 
there's more to it, whether you wish otherwise or no ...


FrankH.





--
Darren J Moffat


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Secure Delete - without using Crypto

2006-12-19 Thread Frank Hofmann

On Tue, 19 Dec 2006, Jonathan Edwards wrote:



On Dec 18, 2006, at 11:54, Darren J Moffat wrote:


[EMAIL PROTECTED] wrote:

Rather than bleaching which doesn't always remove all stains, why can't
we use a word like "erasing" (which is hitherto unused for filesystem use
in Solaris, AFAIK)


and this method doesn't remove all stains from the disk anyway it just 
reduces them so they can't be easily seen ;-)


and if you add the right amount of ammonia is should remove everything .. 
(ahh - fun with trichloramine)


Fluoric acid will dissolve the magnetic film on the platter as well as the 
platter itself. Always keep a PTFE bottle with the stuff in, just in case


;)

On the technical side, I don't think a new VOP will be needed. This could 
easily be done in VOP_SPACE together with a new per-fs property - bleach 
new block when it's allocated (aka VOP_SPACE directly, or in a backend 
also called e.g. on allocating writes / filling holes), bleach existing 
block when VOP_SPACE is used to "stamp a hole" into a file, aka a request 
is made to bleach the blocks of an existing file.
I.e. make the implementation behind ftruncate()/posix_fallocate() do the 
per-file bleaching if so desired. And that implementation is VOP_SPACE.


FrankH.




---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] single memory allocation in the ZFS intent log

2006-10-06 Thread Frank Hofmann

On Thu, 5 Oct 2006, Erblichs wrote:


Casper Dik,

After my posting, I assumed that a code question should be
directed to the ZFS code alias, so I apologize to the people
show don't read code. However, since the discussion is here,
I will post a code proof here. Just use "time program" to get
a generic time frame. It is under 0.1 secs for 500k loops
(each loop does removes a obj and puts it back).

It is just to be used as a proof of concept that a simple
pre-alloc'ed set of objects can be accessed so much faster
than allocating and assigning them.


Ok, could you please explain how is this piece (and all else, for that 
matter):


/*
 * Get a node structure from the freelist
 */
struct node *
node_getnode()
{
struct node *node;

if ((node = nodefree) == NULL)  /* "shouldn't happen" */
printf("out of nodes");

nodefree = node->node_next;
node->node_next = NULL;

return (node);
}

is multithread-safe ?

Best wishes,
FrankH.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] single memory allocation in the ZFS intent log

2006-10-04 Thread Frank Hofmann

On Wed, 4 Oct 2006, Erblichs wrote:


Casper Dik,

Yes, I am familiar with Bonwick's slab allocators and tried
it for wirespeed test of 64byte pieces for a 1Gb and then
100Mb Eths and lastly 10Mb Eth. My results were not
encouraging. I assume it has improved over time.

First, let me ask what happens to the FS if the allocs
in the intent log code are sleeping waiting for memory


The same as would happen to the FS with your proposed additional allocator 
layer in if that "freelist" of yours runs out - it'll wait, you'll see a 
latency bubble.


You seem to think it's likely that a kmem_alloc(..., KM_SLEEP) will sleep. 
It's not. Anything but. See below.




IMO, The general problem with memory allocators is:

- getting memory from a "cache" of ones own size/type
  is orders of magnitude higher than just getting some
  off one's own freelist,


This is why the kernel memory allocator in Solaris has two such freelists:

- the per-CPU kmem magazines (you say below 'one step at a time',
  but that step is already done in Solaris kemem)
- the slab cache



- their is a built in latency to recouperate/steal memory
  from other processes,


Stealing ("reclaim" in Solaris kmem terms) happens if the following three 
conditions are true:


- nothing in the per-CPU magazines
- nothing in the slab cache
- nothing in the quantum caches
- on the attempt to grow the quantum cache, the request to the
  vmem backend finds no readily-available heap to satisfy the
  growth demand immediately



- this stealing forces a sleep and context switches,

- the amount of time to sleep is undeterminate with a single
  call per struct. How long can you sleep for? 100ms or
  250ms or more..

- no process can guarantee a working set,


Yes and no. If your working set is small, use the stack.



In the time when memory was expensive, maybe a global
sharing mechanisms would make sense, but when  the amount
of memory is somewhat plentiful and cheap,

*** It then makes sense for a 2 stage implementation of
preallocation of a working set and then normal allocation
with the added latency.

So, it makes sense to pre-allocate a working set of allocs
by a single alloc call, break up the alloc into needed sizes,
and then alloc from your own free list,


See above - all of that _IS_ already done in Solaris kmem/vmem, with more 
parallelism and more intermediate caching layers designed to bring down 
allocation latency than your simple freelist approach would achieve.




-> if that freelist then empties, maybe then take the extra
overhead with the kmem call. Consider this a expected cost to exceed
a certain watermark.

But otherwise, I bet if I give you some code for the pre-alloc, I bet
10
allocs from the freelist can be done versus the kmem_alloc call, and
at least 100 to 10k allocs if sleep occurs on your side.


The same statistics can be made for Solaris kmem - you satisfy the request 
from the per-CPU magazine, you satisfy the request from the slab cache, 
you satisfy the request via immediate vmem backend allocation and a growth 
of the slab cache. All of these with increased latency but without 
sleeping. Sleeping only comes in if you're so tight on memory that you 
need to perform coalescing in the backend, and purge least-recently-used 
things from other kmem caches in favour of new backend requests. Just 
because you chose to say kmem_alloc(...,KM_SLEEP) doesn't mean you _will_ 
sleep. Normally you won't.




Actually, I think it is so bad, that why don't you time 1 kmem_free
versus grabbing elements off the freelist,

However, don't trust me, I will drop a snapshot of the code to you
tomarrow if you want and you make a single CPU benchmark comparison.

Your multiple CPU issue, forces me to ask, is it a common
occurance that 2 are more CPUs are simultaneouly requesting
memory for the intent log? If it is, then their should be a
freelist of a low watermark set of elements per CPU. However,
one thing at a time..


Of course it's common - have two or more threads do filesystem I/O at the 
same time and you're already there. Which is why, one thing at a time, 
Solaris kmem had the magazine layer for, I think (predates my time at 
Sun), around 12 years now, to get SMP scalability. Been there done that ...




So, do you want that code? It will be a single alloc of X units
and then place them on a freelist. You then time it takes to
remove Y elements from the freelist versus 1 kmem_alloc with
a NO_SLEEP arg and report the numbers. Then I would suggest the
call with the smallest sleep possible. 

Re: [zfs-discuss] x86 CPU Choice for ZFS

2006-07-07 Thread Frank Hofmann

On Fri, 7 Jul 2006, Darren J Moffat wrote:


Eric Schrock wrote:

On Thu, Jul 06, 2006 at 09:53:32PM +0530, Pramod Batni wrote:

   offtopic query :
   How can ZFS require more VM address space but not more VM ?



The real problem is VA fragmentation, not consumption.  Over time, ZFS's
heavy use of the VM system causes the address space to become
fragmented.  Eventually, we will need to grab a 128k block of contiguous
VA, but can't find a contiguous region, despite having plenty of memory
(physical or virtual).


Interesting,  I saw and helped debug a very similar sounding problem with 
VxVM and VxFS on an E10k with 15TB of EMC storage and 10,000 NFS shares years 
ago.  This was on Solaris 2.6 so even though it was UltraSPARC CPU there was 
still only a 32bit address space.


Jeff Bonwick supplied the fixes for this, I don't remember the details but it 
did help reduce the memory fragmentation.   It does make me wonder though if 
these fixes that were applicable to 32bit SPARC work for 32bit x86.


Not quite comparable. The work that Jeff did then was the conversion of 
the old rmalloc-based heap mgmt. to vmem. The problem with the old 
allocator was that _any_ oversize allocation activity, even if it were a 
growth request from a kmem cache, lead to heavy heap fragmentation, and 
the number of fragments in an rmalloc-based mechanism (see rmalloc(9F)) is 
limited. Vmem scales here, and the quantum caches (which is the part that 
got backported to 2.6) as an intermediate "band aid" also significantly 
reduce the number of calls into the heap allocator backend.


vmem allows the heap to fragment - and still to function - which is a 
striking difference to rmalloc. Once the (determined at map creation time) 
number of slots in a resource map is reached, it doesn't matter whether 
there'd be free mem in the heap, you can't get at it unless you happen to 
request _exactly_ the size of an existing fragment. Otherwise, you'd need 
to split a fragment, creating two/three new ones, which you can't as there 
is no slot - Fragmentation with the pre-8 rmalloc heap is pathological. 
It's not with vmem, vmem allows the heap to work even if heavily 
fragmented. But if you have a heavy "oversize consumer", the long-term 
effect of that will be that all vmem arenas larger than the "most 
frequently used 'big' size" become empty. ZFS will make all free spans 
accumulate in the 128kB one under high load.


Ok, all that babbling in short: In Solaris 2.6, heap fragmentation was a
pathological scaling problem that lead to a system hang sooner or later 
because of kernelmap exhaustion. The Vmem/quantum cache heap does function 
even if the heap gets very fragmented - it scales. It doesn't remove the 
possibility of the heap to fragment, but it deals with that gracefully. 
What still is there, though, is the ability of a kernel memory consumer to 
cause heap fragmentation - vmem can't solve the issue that if you allocate 
and free a huge number of N-sized slabs in random ways over time, the heap 
will in the end contain mostly N-sized fragments. That's what happens with 
ZFS.


FrankH.

==
No good can come from selling your freedom, not for all gold of the world,
for the value of this heavenly gift exceeds that of any fortune on earth.
==
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Trying to replicate ZFS self-heal demo and not seeing fixed error

2006-05-09 Thread Frank Hofmann

On Tue, 9 May 2006, Darren J Moffat wrote:


Paul van der Zwan wrote:

I just booted up Minix 3.1.1 today in Qemu and noticed to my surprise
that it has a disk nameing scheme similar to what Solaris uses.
It has c?d?p?s?  note that both p (PC FDISK I assume) and s is used,


HP-UX uses the same scheme.



I think any system descending from the old SysV branch has the c?t?d?s? 
naming convention.
I don't remember which version first used it but as far as I remember it 
was already used in the mid 80's.


There is a difference though as far as I can tell.  Sometimes on Solaris we 
have p? for fdisk partitioning included and sometimes we don't; similarly we 
sometimes don't have t? for target.  Personally I'd prefer us to be 
consistent always even if it leads to names like /dev/dsk/c0t0d0p0s0 if we 
are talking about the first Solaris VTOC slice c0t0d0 for the whole disk 
c0t0d0p0 for the whole Solaris VTOC.


I second the call for consistency, but think that this means dumping 
partitions/slices from the actual device name. A disk is a disk - one unit 
of storage. How it is subdivided and how/whether the subdivisions are made 
available as device nodes should not be the worry of the disk driver, but 
rather that of an independent layer. The way it is now may have a history 
but that doesn't make it less confusing to me :(


The problem with the 'p' and 's' nodes is that they're _not_ used in 
consistent fashions. You already noticed that Solaris/SPARC doesn't have 
'p' nodes, and if e.g. you take a Solaris/SPARC disk and attach it to a 
Solaris/x86 machine, you won't see it's 's' nodes either, and vice versa. 
How clean is that ? Why on earth do we use different names for 'whole 
disk' on SPARC/x86 ?


In short, why is it inevitabe to deal with disks _only_ if they have 
labels ? Why no separate labeling layer ?




Then there is the issue of referencing FAT filesystems in size Windows 
Extended partitions which would give rise to stuff like this 
/dev/dsk/c0t0d0p0:1 at the moment :-) which is only really understood by 
pcfs.


"understood" gives too much credit. PCFS acts on seeing this syntax, a 
well-trained animal. That it actually understands what it does (and worse, 
why it does so) is a bit far-fetched. And of course on SPARC, you'd rather 
use /dev/dsk/c0t0d0s2:1 ... if you know ...





--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: XATTRs, ZAP and the Mac

2006-05-04 Thread Frank Hofmann



ZFS must support POSIX semantics, part of which is hard links. Hard
links allow you to create multiple names (directory entries) for the
same file. Therefore, all UNIX filesystems have chosen to store the
file information separately for the directory entries (otherwise, you'd
have multiple copies, and need pointers between all of them so you could
update them all -- yuck).


For what it's worth, some file systems have chosen to special-case hard links
because they are rare and the directory/inode split hurts performance.  Apple's
HFS is a case in point.  The file metadata ("inode") is part of the directory 
entry,
so that no additional disk access is required to retrieve it.  If the file is a 
hard
link, this metadata is a pointer to the shared metadata for the file.


Yes, Microsoft's FAT does it the same way - the dirent is the inode.

This creates locking nightmares in its own right - directory scans/updates 
may be blocking file access; at the very least, the two race. It might 
have advantages in some situations, and simplifies the metadata 
implementation - but at least to me, it also causes headaches ... and an 
upset stomach every now and then ...



FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss