Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-13 Thread Roch - PAE

  >
  > Given ZFS's copy-on-write transactional model, would it not be almost 
trivial
  > to implement fbarrier()? Basically just choose to wrap up the transaction at
  > the point of fbarrier() and that's it.
  >
  > Am I missing something?

  How do you guarantee that the disk driver and/or the disk firmware doesn't 
  reorder writes ?

  The only guarantee for in-order writes, on actual storage level, is to 
  complete the outstanding ones before issuing new ones.

  Or am _I_ now missing something :)

  FrankH.

As Jeff said, ZFS guarantees the write(2) are ordered by the 
fact that either they show up in the order supplied or they
don't at all. 

So as the transaction  closes, we can issue  all the I/Os we
want in whatever order we choose  (more or less), then flush
the caches. Up  to here none  of  the I/O would  actually be
visible upon a reboot.

But then, we update the ueberblock, flush the cache and
we're done. All writes that associated with a transaction
group show up at once in the main tree.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-13 Thread Peter Schuller
>  > That is interesting. Could this account for disproportionate kernel
>  > CPU usage for applications that perform I/O one byte at a time, as
>  > compared to other filesystems? (Nevermind that the application
>  > shouldn't do that to begin with.)
> 
> I just quickly measured this (overwritting files in CHUNKS);
> This is a software benchmark (I/O is non-factor)
> 
>   CHUNK   ZFS vz UFS
> 
>   1B  4X slower
>   1K  2X slower
>   8K  25% slower
>   32K equal
>   64K 30% faster
> 
> Quick and dirty but I think it paints a picture.
> I can't really answer your question though.

I should probably have said "other filesystems on other platforms", I
did not really compare properly on the Solaris box. In this case it
was actually BitTorrent (the official python client) that was
completely CPU bound in kernel space, and tracing showed single-byte
I/O.

Regardless, the above stats are interesting and I suppose consistent
with what one might expect, from previous discussion on this list.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-13 Thread Roch - PAE
Peter Schuller writes:
 > > I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
 > > thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
 > > every system call.
 > 
 > That is interesting. Could this account for disproportionate kernel
 > CPU usage for applications that perform I/O one byte at a time, as
 > compared to other filesystems? (Nevermind that the application
 > shouldn't do that to begin with.)

I just quickly measured this (overwritting files in CHUNKS);
This is a software benchmark (I/O is non-factor)

CHUNK   ZFS vz UFS

1B  4X slower
1K  2X slower
8K  25% slower
32K equal
64K 30% faster

Quick and dirty but I think it paints a picture.
I can't really answer your question though.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Jeff Bonwick

That is interesting. Could this account for disproportionate kernel
CPU usage for applications that perform I/O one byte at a time, as
compared to other filesystems? (Nevermind that the application
shouldn't do that to begin with.)


No, this is entirely a matter of CPU efficiency in the current code.
There are two issues; we know what they are; and we're fixing them.

The first is that as we translate from znode to dnode, we throw away
information along the way -- we go from znode to object number (fast),
but then we have to do an object lookup to get from object number to
dnode (slow, by comparison -- or more to the point, slow relative to
the cost of writing a single byte).  But this is just stupid, since
we already have a dnode pointer sitting right there in the znode.
We just need to fix our internal interfaces to expose it.

The second problem is that we're not very fast at partial-block
updates.  Again, this is entirely a matter of code efficiency,
not anything fundamental.


I still would love to see something like fbarrier() defined by some
standrd (de facto or otherwise) to make the distinction between
ordered writes and guaranteed persistence more easily exploited in the
general case for applications, and encourage filesystems/storage
systems to optimize for that case (i.e., not have fbarrier() simply
fsync()).


Totally agree.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Peter Schuller
> I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
> thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
> every system call.

That is interesting. Could this account for disproportionate kernel
CPU usage for applications that perform I/O one byte at a time, as
compared to other filesystems? (Nevermind that the application
shouldn't do that to begin with.)

But the fact that you effectively have an fbarrier() is extremely
nice. Guess that is yet another reason to prefer ZFS for certrain
(granted, very specific) cases.

I still would love to see something like fbarrier() defined by some
standrd (de facto or otherwise) to make the distinction between
ordered writes and guaranteed persistence more easily exploited in the
general case for applications, and encourage filesystems/storage
systems to optimize for that case (i.e., not have fbarrier() simply
fsync()).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Peter Schuller
> That said, actually implementing the underlying mechanisms may not be
> worth the trouble.  It is only a matter of time before disks have fast
> non-volatile memory like PRAM or MRAM, and then the need to do
> explicit cache management basically disappears.

I meant fbarrier() as a syscall exposed to userland, like fsync(), so
that userland applications can achieve ordered semantics without
synchronous writes. Whether or not ZFS in turn manages to eliminate
synchronous writes by some feature of the underlying storage mechanism
is a separate issue. But even if not, an fbarrier() exposes an
asynchronous method of ensuring relative order of I/O operations to
userland, which is often useful.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Jeff Bonwick

Do you agree that their is a major tradeoff of
"builds up a wad of transactions in memory"?


I don't think so.  We trigger a transaction group commit when we
have lots of dirty data, or 5 seconds elapse, whichever comes first.
In other words, we don't let updates get stale.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Erblichs
Jeff Bonwick,

Do you agree that their is a major tradeoff of
"builds up a wad of transactions in memory"?

We loose the changes if we have an unstable
environment.

Thus, I don't quite understand why a 2-phase
approach to commits isn't done. First, take the
transactions as they come and do a minimal amount
of a delayed write. If the number of transactions
build up, then convert to the delayed write scheme.

This assumption is that not all ZFS envs are write
heavy versus write once and read-many type accesses.
My assumption is that attribute/meta reading
outweighs all other accesses.

Wouldn't this approach allow minimal outstanding
transactions and favor read access. Yes, the assumption
is that once the "wad" is started, the amount of writing
could be substantial and thus the amount of available
bandwidth for reading is reduced. This would then allow
for a more N states to be available. Right?

Second, their are a multiple uses  of "then: (then pushes,
then flushes all disk..., then writes the new uberblock,
then flushes the caches again), in which seems to have
some level of possible parallelism which should reduce the
latency from the start to the final write. Or did you just
say that for simplicity sake?

Mitchell Erblich
---


Jeff Bonwick wrote:
> 
> Toby Thain wrote:
> > I'm no guru, but would not ZFS already require strict ordering for its
> > transactions ... which property Peter was exploiting to get "fbarrier()"
> > for free?
> 
> Exactly.  Even if you disable the intent log, the transactional nature
> of ZFS ensures preservation of event ordering.  Note that disk caches
> don't come into it: ZFS builds up a wad of transactions in memory,
> then pushes them out as a transaction group.  That entire group will
> either commit or not.  ZFS writes all the new data to new locations,
> then flushes all disk write caches, then writes the new uberblock,
> then flushes the caches again.  Thus you can lose power at any point
> in the middle of committing transaction group N, and you're guaranteed
> that upon reboot, everything will either be at state N or state N-1.
> 
> I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
> thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
> every system call.
> 
> Jeff
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Jeff Bonwick

Toby Thain wrote:
I'm no guru, but would not ZFS already require strict ordering for its 
transactions ... which property Peter was exploiting to get "fbarrier()" 
for free?


Exactly.  Even if you disable the intent log, the transactional nature
of ZFS ensures preservation of event ordering.  Note that disk caches
don't come into it: ZFS builds up a wad of transactions in memory,
then pushes them out as a transaction group.  That entire group will
either commit or not.  ZFS writes all the new data to new locations,
then flushes all disk write caches, then writes the new uberblock,
then flushes the caches again.  Thus you can lose power at any point
in the middle of committing transaction group N, and you're guaranteed
that upon reboot, everything will either be at state N or state N-1.

I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
every system call.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Chris Csanady

2007/2/12, Frank Hofmann <[EMAIL PROTECTED]>:

On Mon, 12 Feb 2007, Chris Csanady wrote:

> This is true for NCQ with SATA, but SCSI also supports ordered tags,
> so it should not be necessary.
>
> At least, that is my understanding.

Except that ZFS doesn't talk SCSI, it talks to a target driver. And that
one may or may not treat async I/O requests dispatched via its strategy()
entry point as strictly ordered / non-coalescible / non-cancellable.

See e.g. disksort(9F).


Yes, however, this functionality could be exposed through the target
driver.  While the implementation does not (yet) take full advantage
of ordered tags, linux does provide an interface to do this:

   http://www.mjmwired.net/kernel/Documentation/block/barrier.txt


From a correctness standpoint, the interface seems worthwhile, even if

the mechanisms are never implemented.  It just feels wrong to execute
a synchronize cache command from ZFS, when often that is not the
intention.  The changes to ZFS itself would be very minor.

That said, actually implementing the underlying mechanisms may not be
worth the trouble.  It is only a matter of time before disks have fast
non-volatile memory like PRAM or MRAM, and then the need to do
explicit cache management basically disappears.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Frank Hofmann

On Mon, 12 Feb 2007, Toby Thain wrote:

[ ... ]
I'm no guru, but would not ZFS already require strict ordering for its 
transactions ... which property Peter was exploiting to get "fbarrier()" for 
free?


It achieves this by flushing the disk write cache when there's need to 
barrier. Which completes outstanding writes.


A "perfect fsync()" for ZFS shouldn't need to do way more; that it does 
right now is something, as I understand, that is being worked on.


FrankH.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Toby Thain


On 12-Feb-07, at 5:55 PM, Frank Hofmann wrote:


On Mon, 12 Feb 2007, Peter Schuller wrote:


Hello,

Often fsync() is used not because one cares that some piece of  
data is on
stable storage, but because one wants to ensure the subsequent I/O  
operations
are performed after previous I/O operations are on stable storage.  
In these
cases the latency introduced by an fsync() is completely  
unnecessary. An
fbarrier() or similar would be extremely useful to get the proper  
semantics
while still allowing for better performance than what you get with  
fsync().


My assumption has been that this has not been traditionally  
implemented for

reasons of implementation complexity.

Given ZFS's copy-on-write transactional model, would it not be  
almost trivial
to implement fbarrier()? Basically just choose to wrap up the  
transaction at

the point of fbarrier() and that's it.

Am I missing something?


How do you guarantee that the disk driver and/or the disk firmware  
doesn't reorder writes ?


The only guarantee for in-order writes, on actual storage level, is  
to complete the outstanding ones before issuing new ones.


Or am _I_ now missing something :)


I'm no guru, but would not ZFS already require strict ordering for  
its transactions ... which property Peter was exploiting to get  
"fbarrier()" for free?


--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Bart Smaalders

Peter Schuller wrote:

Hello,

Often fsync() is used not because one cares that some piece of data is on 
stable storage, but because one wants to ensure the subsequent I/O operations 
are performed after previous I/O operations are on stable storage. In these 
cases the latency introduced by an fsync() is completely unnecessary. An 
fbarrier() or similar would be extremely useful to get the proper semantics 
while still allowing for better performance than what you get with fsync().


My assumption has been that this has not been traditionally implemented for 
reasons of implementation complexity.


Given ZFS's copy-on-write transactional model, would it not be almost trivial 
to implement fbarrier()? Basically just choose to wrap up the transaction at 
the point of fbarrier() and that's it.


Am I missing something?

(I do not actually have a use case for this on ZFS, since my experience with 
ZFS is thus far limited to my home storage server. But I have wished for an 
fbarrier() many many times over the past few years...)




Hmmm... is store ordering what you're looking for?  Eg
make sure that in the case of power failure, all previous writes
will be visible after reboot if any subsequent write are visible.


- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Frank Hofmann

On Mon, 12 Feb 2007, Chris Csanady wrote:

[ ... ]

> Am I missing something?

How do you guarantee that the disk driver and/or the disk firmware doesn't
reorder writes ?

The only guarantee for in-order writes, on actual storage level, is to
complete the outstanding ones before issuing new ones.


This is true for NCQ with SATA, but SCSI also supports ordered tags,
so it should not be necessary.

At least, that is my understanding.


Except that ZFS doesn't talk SCSI, it talks to a target driver. And that 
one may or may not treat async I/O requests dispatched via its strategy() 
entry point as strictly ordered / non-coalescible / non-cancellable.


See e.g. disksort(9F).

FrankH.



Chris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Chris Csanady

2007/2/12, Frank Hofmann <[EMAIL PROTECTED]>:

On Mon, 12 Feb 2007, Peter Schuller wrote:

> Hello,
>
> Often fsync() is used not because one cares that some piece of data is on
> stable storage, but because one wants to ensure the subsequent I/O operations
> are performed after previous I/O operations are on stable storage. In these
> cases the latency introduced by an fsync() is completely unnecessary. An
> fbarrier() or similar would be extremely useful to get the proper semantics
> while still allowing for better performance than what you get with fsync().
>
> My assumption has been that this has not been traditionally implemented for
> reasons of implementation complexity.
>
> Given ZFS's copy-on-write transactional model, would it not be almost trivial
> to implement fbarrier()? Basically just choose to wrap up the transaction at
> the point of fbarrier() and that's it.
>
> Am I missing something?

How do you guarantee that the disk driver and/or the disk firmware doesn't
reorder writes ?

The only guarantee for in-order writes, on actual storage level, is to
complete the outstanding ones before issuing new ones.


This is true for NCQ with SATA, but SCSI also supports ordered tags,
so it should not be necessary.

At least, that is my understanding.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Frank Hofmann

On Mon, 12 Feb 2007, Peter Schuller wrote:


Hello,

Often fsync() is used not because one cares that some piece of data is on
stable storage, but because one wants to ensure the subsequent I/O operations
are performed after previous I/O operations are on stable storage. In these
cases the latency introduced by an fsync() is completely unnecessary. An
fbarrier() or similar would be extremely useful to get the proper semantics
while still allowing for better performance than what you get with fsync().

My assumption has been that this has not been traditionally implemented for
reasons of implementation complexity.

Given ZFS's copy-on-write transactional model, would it not be almost trivial
to implement fbarrier()? Basically just choose to wrap up the transaction at
the point of fbarrier() and that's it.

Am I missing something?


How do you guarantee that the disk driver and/or the disk firmware doesn't 
reorder writes ?


The only guarantee for in-order writes, on actual storage level, is to 
complete the outstanding ones before issuing new ones.


Or am _I_ now missing something :)

FrankH.



(I do not actually have a use case for this on ZFS, since my experience with
ZFS is thus far limited to my home storage server. But I have wished for an
fbarrier() many many times over the past few years...)

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Peter Schuller
Hello,

Often fsync() is used not because one cares that some piece of data is on 
stable storage, but because one wants to ensure the subsequent I/O operations 
are performed after previous I/O operations are on stable storage. In these 
cases the latency introduced by an fsync() is completely unnecessary. An 
fbarrier() or similar would be extremely useful to get the proper semantics 
while still allowing for better performance than what you get with fsync().

My assumption has been that this has not been traditionally implemented for 
reasons of implementation complexity.

Given ZFS's copy-on-write transactional model, would it not be almost trivial 
to implement fbarrier()? Basically just choose to wrap up the transaction at 
the point of fbarrier() and that's it.

Am I missing something?

(I do not actually have a use case for this on ZFS, since my experience with 
ZFS is thus far limited to my home storage server. But I have wished for an 
fbarrier() many many times over the past few years...)

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss