Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Thomas Maier-Komor
 
 the ZIL is always there in host memory, even when no
 synchronous writes
 are being done, since the POSIX fsync() call could be
 made on an open 
 write channel at any time, requiring all to-date
 writes on that channel
 to be committed to persistent store before it returns
 to the application
 ... it's cheaper to write the ZIL at this point than
 to force the entire 5 sec
 buffer out prematurely
 

I have a question that is related to this topic: Why is there only a (tunable) 
5 second threshold and not also an additional threshold for the buffer size 
(e.g. 50MB)?

Sometimes I see my system writing huge amounts of data to a zfs, but the disks 
staying idle for 5 seconds, although the memory consumption is already quite 
big and it really would make sense (from my uneducated point of view as an 
observer) to start writing all the data to disks. I think this leads to the 
pumping effect that has been previously mentioned in one of the forums here.

Can anybody comment on this?

TIA,
Thomas
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Bill Moloney
 I have a question that is related to this topic: Why
 is there only a (tunable) 5 second threshold and not
 also an additional threshold for the buffer size
 (e.g. 50MB)?
 
 Sometimes I see my system writing huge amounts of
 data to a zfs, but the disks staying idle for 5
 seconds, although the memory consumption is already
 quite big and it really would make sense (from my
 uneducated point of view as an observer) to start
 writing all the data to disks. I think this leads to
 the pumping effect that has been previously mentioned
 in one of the forums here.
 
 Can anybody comment on this?
 
 TIA,
 Thomas

because ZFS always writes to a new location on the disk, premature writing
can often result in redundant work ... a single host write to a ZFS object
results in the need to rewrite all of the changed data and meta-data leading
to that object

if a subsequent follow-up write to the same object occurs quickly,
this entire path, once again, has to be recreated, even though only a small 
portion of it is actually different from the previous version

if both versions were written to disk, the result would be to physically write 
potentially large amounts of nearly duplicate information over and over
again, resulting in logically vacant bandwidth

consolidating these writes in host cache eliminates some redundant disk
writing, resulting in more productive bandwidth ... providing some ability to
tune the consolidation time window and/or the accumulated cache size may
seem like a reasonable thing to do, but I think that it's typically a moving
target, and depending on an adaptive, built-in algorithm to dynamically set
these marks (as ZFS claims it does) seems like a better choice

...Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Casper . Dik


consolidating these writes in host cache eliminates some redundant disk
writing, resulting in more productive bandwidth ... providing some ability to
tune the consolidation time window and/or the accumulated cache size may
seem like a reasonable thing to do, but I think that it's typically a moving
target, and depending on an adaptive, built-in algorithm to dynamically set
these marks (as ZFS claims it does) seems like a better choice


But is seems that when we're talking about full block writes (such as 
sequential file writes) ZFS could do a bit better.

And as long as there is bandwidth left to the disk and the controllers, it 
is difficult to argue that the work is redundant.  If it's free in that
sense, it doesn't matter whether it is redundant.  But if it turns out NOT
to have been redundant you save a lot.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Bill Moloney
 But is seems that when we're talking about full block
 writes (such as 
 sequential file writes) ZFS could do a bit better.
 
 And as long as there is bandwidth left to the disk
 and the controllers, it 
 is difficult to argue that the work is redundant.  If
 it's free in that
 sense, it doesn't matter whether it is redundant.
  But if it turns out NOT
 o have been redundant you save a lot.
 

I think this is why an adaptive algorithm makes sense ... in situations where
frequent, progressive small writes are engaged by an application, the amount
of redundant disk access can be significant, and longer consolidation times
may make sense ... larger writes (= the FS block size) would benefit less 
from longer consolidation times, and shorter thresholds could provide more
usable bandwidth

to get a sense of the issue here, I've done some write testing to previously
written files in a ZFS file system, and the choice of write element size
shows some big swings in actual vs data-driven bandwidth

when I launch a set of threads each of which writes 4KB buffers 
sequentially to its own file, I observe that for 60GB of application 
writes, the disks see 230+GB of IO (reads and writes): 
data-driven BW =~41MB/Sec (my 60GB in ~1500 Sec)
actual BW =~157 MB/Sec (the 230+GB in ~1500 Sec)

if I do the same writes with 128KB buffers (block size of my pool),
the same 60GBs of writes only generate 95GB of disk IO (reads and writes)
data-driven BW =~85MB/Sec (my 60GB in ~700 Sec)
actual BW =~134.6MB/Sec (the 95+GB in ~700 Sec)

in the first case, longer consolidation times would have lead to less total IO
and better data-driven BW, while in the second case shorter consolidation
times would have worked better

as far as redundant writes possibly occupying free bandwidth (and thus
costing nothing), I think you also have to consider the related costs of
additional block scavenging, and less available free space at any specific 
instant, possibly limiting the sequentiality of the next write ... of
course there's also the additional device stress

in any case, I agree with you that ZFS could do a better job in this area,
but it's not as simple as just looking for large or small IOs ...
sequential vs random access patterns also play a big role (as you point out)

I expect  (hope) the adaptive algorithms will mature over time, eventually
providing better behavior over a broader set of operating conditions
... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-07 Thread Neil Perrin


parvez shaikh wrote:
 Hello,
 
 I am learning ZFS, its design and layout.
 
 I would like to understand how Intent logs are different from journal?
 
 Journal too are logs of updates to ensure consistency of file system 
 over crashes. Purpose of intent log also appear to be same.  I hope I am 
 not missing something important in these concepts.

There is a difference. A journal contains the necessary transactions to
make the on-disk fs consistent. The ZFS intent is not needed for consistency.
Here's an extract from http://blogs.sun.com/perrin/entry/the_lumberjack :


ZFS is always consistent on disk due to its transaction model. Unix system 
calls can be considered as transactions which are aggregated into a transaction 
group for performance and committed together periodically. Either everything 
commits or nothing does. That is, if a power goes out, then the transactions in 
the pool are never partial. This commitment happens fairly infrequently - 
typically a few seconds between each transaction group commit.

Some applications, such as databases, need assurance that say the data they 
wrote or mkdir they just executed is on stable storage, and so they request 
synchronous semantics such as O_DSYNC (when opening a file), or execute 
fsync(fd) after a series of changes to a file descriptor. Obviously waiting 
seconds for the transaction group to commit before returning from the system 
call is not a high performance solution. Thus the ZFS Intent Log (ZIL) was born.


 
 Also I read that Updates in ZFS are intrinsically atomic,  I cant 
 understand how they are intrinsically atomic 
 http://weblog.infoworld.com/yager/archives/2007/10/suns_zfs_is_clo.html
 
 I would be grateful if someone can address my query
 
 Thanks
 
 
 Explore your hobbies and interests. Click here to begin. 
 http://in.rd.yahoo.com/tagline_groups_6/*http://in.promos.yahoo.com/groups 
 
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-07 Thread Bill Moloney
file system journals may support a variety of availability models, ranging from
simple support for fast recovery (return to consistency) with possible data 
loss, to those that attempt to support synchronous write semantics with no data 
loss on failure, along with fast recovery

the simpler models use a persistent caching scheme for file system meta-data
that can be used to limit the possible sources of file system corruption,
avoiding a complete fsck run after a failure ... the journal specifies the only
possible sources of corruption, allowing a quick check-and-recover mechanism
... here the journal is always written with meta-data changes (at least), 
before the actual updated meta-data in question is over-written to its old
location on disk ... after a failure, the journal indicates what meta-data 
must be checked for consistency

more elaborate models may cache both data and meta-data, to support 
limited data loss, synchronous writes and fast recovery ... newer file systems
often let you choose among these features

since ZFS never updates any data or meta-data in place (anything written into a 
pool is always written to a new (unused) location, it does not have the same
consistency issues that traditional file systems have to deal with ... a ZFS
pool is always in a consistent state, moving an old state to a new state only
after the new state has been completely committed to persistent store ...
the final update to a new state depends on a single atomic write that either
succeeds (moving the system to a consistent new state) or fails, leaving the
system in its current consistent state ... there can be no interim inconsistent
state

a ZFS pool builds its new state information in host memory for some period of
time (about 5 seconds), as host IOs are generated by various applications ...
at the end of this period these buffers are written to fresh locations on 
persistent store as described above, meaning that application writes are
treated asynchronously by default, and in the face a failure, some amount of
information that has been accumulating in host memory can be lost

if an application requires synchronous writes and a guarantee of no data loss,
then ZFS must somehow get the written information to persistent store
before it returns the application write call ... this is where the intent log 
comes
in ... the system call information (including the data) involved in a 
synchronous write operation are written to the intent log on persistent store
before the application write call returns ... but the information is also
written into the host memory buffer scheduled for its 5 sec updates (just
as if it was an asynchronous write) ... at then end of the 5 sec update time 
the new host buffers are written to disk, and, once committed, the intent
log information written to the ZIL is not longer needed and can be jettisoned
(so the ZIL never needs to be very large)

if the system fails, the accumulated but not flushed host buffer information
will be lost, but the ZIL records will already be on disk for any synchronous
writes and can be replayed when the host comes back up, or the pool is
imported by some other living host ... the pool, of course, always comes up
in a consistent state, but any ZIL records can be incorporated into a new 
consistent state before the pool is fully imported for use

the ZIL is always there in host memory, even when no synchronous writes
are being done, since the POSIX fsync() call could be made on an open 
write channel at any time, requiring all to-date writes on that channel
to be committed to persistent store before it returns to the application
... it's cheaper to write the ZIL at this point than to force the entire 5 sec
buffer out prematurely

synchronous writes can clearly have a significant negative performance 
impact in ZFS (or any other system) by forcing writes to disk before having a
chance to do more efficient, aggregated writes (the 5 second type), but
the ZIL solution in ZFS provides a good trade-off with a lot of room to
choose among various levels of performance and potential data loss ...
this is especially true with the recent addition of separate ZIL device
specification ... a small, fast (nvram type) device can be designated for
ZIL use, leaving slower spindle disks for the rest of the pool 

hope this helps ... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss