Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Daniel Phillips wrote:

 Stephen asked me some sharp questions about how this would work,
 and after I answered them to his satisfaction he asked me if I 
 would have time to implement this feature.  I said yes, and went 
 on to write an initial design document describing the required
 modifications to Ext's handling of inodes, and a prototype
 algorithm for doing the tail merging.

Here is one more for you:
Suppose we grow the last fragment/tail/whatever. Do you copy the
data out of that shared block? If so, how do you update buffer_heads in
pages that cover the relocated data? (Same goes for reiserfs, if they are
doing something similar). BTW, our implementation of UFS is fucked up in
that respect, so variant from there will not work.




Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote:
 
 Here is one more for you:
   Suppose we grow the last fragment/tail/whatever. Do you copy the
 data out of that shared block? If so, how do you update buffer_heads in
 pages that cover the relocated data? (Same goes for reiserfs, if they are
 doing something similar). BTW, our implementation of UFS is fucked up in
 that respect, so variant from there will not work.

For tail writes, I'd imagine we would just end up using the page cache
as a virtual cache as NFS uses it, and doing plain copy into the
buffer cache pages.

Cheers,
 Stephen



Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:

 Hi,
 
 On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote:
  
  Here is one more for you:
  Suppose we grow the last fragment/tail/whatever. Do you copy the
  data out of that shared block? If so, how do you update buffer_heads in
  pages that cover the relocated data? (Same goes for reiserfs, if they are
  doing something similar). BTW, our implementation of UFS is fucked up in
  that respect, so variant from there will not work.
 
 For tail writes, I'd imagine we would just end up using the page cache
 as a virtual cache as NFS uses it, and doing plain copy into the
 buffer cache pages.

Ouch. I _really_ don't like it - we end up with special behaviour on one
page in the pagecache. And getting data migration from buffer cache to
page cache, which is Not Nice(tm). Yuck... Besides, when do we decide that
tail is going to be, erm, merged? What will happen with the page then?




Re: Tailmerging for Ext2

2000-07-26 Thread Daniel Phillips

On Wed, 26 Jul 2000, Alexander Viro wrote:
 On Wed, 26 Jul 2000, Daniel Phillips wrote:
 
  Stephen asked me some sharp questions about how this would work,
  and after I answered them to his satisfaction he asked me if I 
  would have time to implement this feature.  I said yes, and went 
  on to write an initial design document describing the required
  modifications to Ext's handling of inodes, and a prototype
  algorithm for doing the tail merging.
 
 Here is one more for you:
   Suppose we grow the last fragment/tail/whatever. Do you copy the
 data out of that shared block? If so, how do you update buffer_heads in
 pages that cover the relocated data? (Same goes for reiserfs, if they are
 doing something similar). BTW, our implementation of UFS is fucked up in
 that respect, so variant from there will not work.

Please bear in mind that I don't pretend to be an expert on the VFS, and
especially its latest incarnation in 2.4.0.  I'm coming to grips with it now. 
Notwithstanding that, I'll try to provide some insight anyway.

   Suppose we grow the last fragment/tail/whatever. Do you copy the
 data out of that shared block? 

Yes, except possibly in the case where the fragment grows by an amount will that
will still fit in the shared block.  Even in that case, you might want to
ignore the possible optimization and copy it out mindlessly, on the assumption
that another write is coming soon.  My plan is to do the incremental merging at
file close time.

 If so, how do you update buffer_heads in
 pages that cover the relocated data? 

We have to be sure that if blocks are buffered then they are buffered in
exactly one place and you always access them through through the buffer hash
table.  So far so good, but the picture gets murkier for me when you talk
about the page cache.  I'm not clear yet on the details of how the buffer cache
interacts with the page cache, and perhaps you can help shed some light on
that.  Until I am clear on it, I'll hold off commenting.

 (Same goes for reiserfs, if they are doing something similar).

I don't know exactly what ReiserFS does - I just heard Hans mention the term
'tail merging' and I could see that it was a good idea.

 BTW, our implementation of UFS is fucked up in  that respect, so variant
 from there will not work.

I'm not sure what you mean there...

-- 
Daniel



Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Daniel Phillips wrote:

  If so, how do you update buffer_heads in
  pages that cover the relocated data? 
 
 We have to be sure that if blocks are buffered then they are buffered in
 exactly one place and you always access them through through the buffer hash
 table.  So far so good, but the picture gets murkier for me when you talk

Not. Data normally is in page. Buffer_heads are not included into buffer
cache. They are refered from the struct page and their -b_data just
points to appropriate pieces of page. You can not get them via bread().
At all. Buffer cache is only for metadata.

  BTW, our implementation of UFS is fucked up in  that respect, so variant
  from there will not work.
 
 I'm not sure what you mean there...

I mean that UFS has the same problem (relocation of the last fragment) and
our implementation is fucked up (== does not deal with that properly and
eats data). So if you will look for existing solutions - forget about the
UFS one; it isn't. UFS will need fixing, but that's a separate story...




Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote:
 
 Not. Data normally is in page. Buffer_heads are not included into buffer
 cache. They are refered from the struct page and their -b_data just
 points to appropriate pieces of page. You can not get them via bread().
 At all. Buffer cache is only for metadata.

Only in the default usage.  There's no reason at all why we can't use
separate buffer and page cache aliases of the same data for tails as a
special case.

Cheers,
 Stephen



Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote:

  For tail writes, I'd imagine we would just end up using the page cache
  as a virtual cache as NFS uses it, and doing plain copy into the
  buffer cache pages.
 
 Ouch. I _really_ don't like it - we end up with special behaviour on one
 page in the pagecache.

Correct.  But it's all inside the filesystem, so there is zero VFS
impact.  And we're talking about non-block-aligned data for tails, so
we simply don't have a choice in this case.

 And getting data migration from buffer cache to
 page cache, which is Not Nice(tm).

Not preferred for bulk data, perhaps, but the VFS should cope just
fine.

 Yuck... Besides, when do we decide that
 tail is going to be, erm, merged? What will happen with the page then?

To the page?  Nothing.  To the buffer?  It gets updated with the new
contents of disk.  Page == virtual contents.  Buffer == physical
contents.  Plain and simple.

Cheers,
 Stephen




Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:

 Hi,
 
 On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote:
  
  Not. Data normally is in page. Buffer_heads are not included into buffer
  cache. They are refered from the struct page and their -b_data just
  points to appropriate pieces of page. You can not get them via bread().
  At all. Buffer cache is only for metadata.
 
 Only in the default usage.  There's no reason at all why we can't use
 separate buffer and page cache aliases of the same data for tails as a
 special case.

In theory - yes, but doing that will require a _lot_ of accurate thinking
about possible races. IOW, I'm afraid that transitions tail-normal block
will be race-prone. Paint me over-cautious, but after you-know-what... Oh,
well... I'm not saying that it's impossible, but I _really_ recommend to
take a hard look at race scenarios - there is a potential for plenty of
them.




Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:

 Hi,
 
 On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote:
 
   For tail writes, I'd imagine we would just end up using the page cache
   as a virtual cache as NFS uses it, and doing plain copy into the
   buffer cache pages.
  
  Ouch. I _really_ don't like it - we end up with special behaviour on one
  page in the pagecache.
 
 Correct.  But it's all inside the filesystem, so there is zero VFS
 impact.  And we're talking about non-block-aligned data for tails, so
 we simply don't have a choice in this case.

shrug Sure, it's not a VFS problem (albeit it _will_ require accurate
playing with unmap_() in buffer.c), but ext2 problems are pretty
interesting too...

  And getting data migration from buffer cache to
  page cache, which is Not Nice(tm).
 
 Not preferred for bulk data, perhaps, but the VFS should cope just
 fine.
 
  Yuck... Besides, when do we decide that
  tail is going to be, erm, merged? What will happen with the page then?
 
 To the page?  Nothing.  To the buffer?  It gets updated with the new
 contents of disk.  Page == virtual contents.  Buffer == physical
 contents.  Plain and simple.

Erm? Consider that: huge lseek() + write past the end of file. Woops - got
to unmerge the tail (it's an internal block now) and we've got no
knowledge of IO going on the page. Again, IO may be asynchronous - no
protection from i_sem for us. After that page becomes a regular one,
right? Looks like a change of state to me...




Re: Tailmerging for Ext2

2000-07-26 Thread Chris Mason



On Wed, 26 Jul 2000, Daniel Phillips wrote:

 I don't know exactly what ReiserFS does - I just heard Hans mention the term
 'tail merging' and I could see that it was a good idea.
 

I'll give the quick and dirty answer, if people want more details, let me
know.  In 2.2, reiserfs_file_write deals directly with tails.  It appends
to them if there is room in the packed block, or converts them if there
isn't.  If reiserfs_file_write is called with a buffer size  512 bytes,
it tries to write into full blocks instead of tails.  This limits the
overhead when you cp/untar to create new files.

In both releases, there is locking on the tail to prevent races, and we
don't bother with tails on files  16k (configurable).  

For 2.4, the functions work like this:

reiserfs_get_block converts the tail into its own block and points
the buffer head at the new block.

reiserfs_readpage reads directly from the tail into the page, leaves the
buffer head mapped, and sets b_blocknr to 0.

reiserfs_writepage and reiserfs_prepare_write both check for mapped buffer
heads with a block number of 0 in the page.  If found, they are unmapped.
Then block_write_full_page or block_prepare_write is called.

reiserfs_truncate deals directly with the tail.  If the last block is
packed back into the tail, it is unmapped from the page cache.

reiserfs_file_release will check to see if the tail needs to be repacked,
and use truncate (without changing i_size) to pack the tail.

-chris






Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Chris Mason wrote:

 In both releases, there is locking on the tail to prevent races, and we
 don't bother with tails on files  16k (configurable).  

What granularity do you have? (for tail size, that is).




Re: Tailmerging for Ext2

2000-07-26 Thread Chris Mason



On Wed, 26 Jul 2000, Alexander Viro wrote:

 
 
 On Wed, 26 Jul 2000, Chris Mason wrote:
 
  In both releases, there is locking on the tail to prevent races, and we
  don't bother with tails on files  16k (configurable).  
 
 What granularity do you have? (for tail size, that is).
 

From 1 byte to almost the blocksize (4k). But, there is a macro for
deciding when to use a tail, which varies it based on the file size.  If
the file  12k, it won't have a tail bigger than 1k, an 8k file won't have
a tail bigger than 2k.

Of course, this is just a guess about the right balance between space and
performance...

-chris