Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Daniel Phillips wrote: Stephen asked me some sharp questions about how this would work, and after I answered them to his satisfaction he asked me if I would have time to implement this feature. I said yes, and went on to write an initial design document describing the required modifications to Ext's handling of inodes, and a prototype algorithm for doing the tail merging. Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work.
Re: Tailmerging for Ext2
Hi, On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote: Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Cheers, Stephen
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: Hi, On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote: Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Ouch. I _really_ don't like it - we end up with special behaviour on one page in the pagecache. And getting data migration from buffer cache to page cache, which is Not Nice(tm). Yuck... Besides, when do we decide that tail is going to be, erm, merged? What will happen with the page then?
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Alexander Viro wrote: On Wed, 26 Jul 2000, Daniel Phillips wrote: Stephen asked me some sharp questions about how this would work, and after I answered them to his satisfaction he asked me if I would have time to implement this feature. I said yes, and went on to write an initial design document describing the required modifications to Ext's handling of inodes, and a prototype algorithm for doing the tail merging. Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. Please bear in mind that I don't pretend to be an expert on the VFS, and especially its latest incarnation in 2.4.0. I'm coming to grips with it now. Notwithstanding that, I'll try to provide some insight anyway. Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? Yes, except possibly in the case where the fragment grows by an amount will that will still fit in the shared block. Even in that case, you might want to ignore the possible optimization and copy it out mindlessly, on the assumption that another write is coming soon. My plan is to do the incremental merging at file close time. If so, how do you update buffer_heads in pages that cover the relocated data? We have to be sure that if blocks are buffered then they are buffered in exactly one place and you always access them through through the buffer hash table. So far so good, but the picture gets murkier for me when you talk about the page cache. I'm not clear yet on the details of how the buffer cache interacts with the page cache, and perhaps you can help shed some light on that. Until I am clear on it, I'll hold off commenting. (Same goes for reiserfs, if they are doing something similar). I don't know exactly what ReiserFS does - I just heard Hans mention the term 'tail merging' and I could see that it was a good idea. BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. I'm not sure what you mean there... -- Daniel
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Daniel Phillips wrote: If so, how do you update buffer_heads in pages that cover the relocated data? We have to be sure that if blocks are buffered then they are buffered in exactly one place and you always access them through through the buffer hash table. So far so good, but the picture gets murkier for me when you talk Not. Data normally is in page. Buffer_heads are not included into buffer cache. They are refered from the struct page and their -b_data just points to appropriate pieces of page. You can not get them via bread(). At all. Buffer cache is only for metadata. BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. I'm not sure what you mean there... I mean that UFS has the same problem (relocation of the last fragment) and our implementation is fucked up (== does not deal with that properly and eats data). So if you will look for existing solutions - forget about the UFS one; it isn't. UFS will need fixing, but that's a separate story...
Re: Tailmerging for Ext2
Hi, On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote: Not. Data normally is in page. Buffer_heads are not included into buffer cache. They are refered from the struct page and their -b_data just points to appropriate pieces of page. You can not get them via bread(). At all. Buffer cache is only for metadata. Only in the default usage. There's no reason at all why we can't use separate buffer and page cache aliases of the same data for tails as a special case. Cheers, Stephen
Re: Tailmerging for Ext2
Hi, On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote: For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Ouch. I _really_ don't like it - we end up with special behaviour on one page in the pagecache. Correct. But it's all inside the filesystem, so there is zero VFS impact. And we're talking about non-block-aligned data for tails, so we simply don't have a choice in this case. And getting data migration from buffer cache to page cache, which is Not Nice(tm). Not preferred for bulk data, perhaps, but the VFS should cope just fine. Yuck... Besides, when do we decide that tail is going to be, erm, merged? What will happen with the page then? To the page? Nothing. To the buffer? It gets updated with the new contents of disk. Page == virtual contents. Buffer == physical contents. Plain and simple. Cheers, Stephen
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: Hi, On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote: Not. Data normally is in page. Buffer_heads are not included into buffer cache. They are refered from the struct page and their -b_data just points to appropriate pieces of page. You can not get them via bread(). At all. Buffer cache is only for metadata. Only in the default usage. There's no reason at all why we can't use separate buffer and page cache aliases of the same data for tails as a special case. In theory - yes, but doing that will require a _lot_ of accurate thinking about possible races. IOW, I'm afraid that transitions tail-normal block will be race-prone. Paint me over-cautious, but after you-know-what... Oh, well... I'm not saying that it's impossible, but I _really_ recommend to take a hard look at race scenarios - there is a potential for plenty of them.
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: Hi, On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote: For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Ouch. I _really_ don't like it - we end up with special behaviour on one page in the pagecache. Correct. But it's all inside the filesystem, so there is zero VFS impact. And we're talking about non-block-aligned data for tails, so we simply don't have a choice in this case. shrug Sure, it's not a VFS problem (albeit it _will_ require accurate playing with unmap_() in buffer.c), but ext2 problems are pretty interesting too... And getting data migration from buffer cache to page cache, which is Not Nice(tm). Not preferred for bulk data, perhaps, but the VFS should cope just fine. Yuck... Besides, when do we decide that tail is going to be, erm, merged? What will happen with the page then? To the page? Nothing. To the buffer? It gets updated with the new contents of disk. Page == virtual contents. Buffer == physical contents. Plain and simple. Erm? Consider that: huge lseek() + write past the end of file. Woops - got to unmerge the tail (it's an internal block now) and we've got no knowledge of IO going on the page. Again, IO may be asynchronous - no protection from i_sem for us. After that page becomes a regular one, right? Looks like a change of state to me...
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Daniel Phillips wrote: I don't know exactly what ReiserFS does - I just heard Hans mention the term 'tail merging' and I could see that it was a good idea. I'll give the quick and dirty answer, if people want more details, let me know. In 2.2, reiserfs_file_write deals directly with tails. It appends to them if there is room in the packed block, or converts them if there isn't. If reiserfs_file_write is called with a buffer size 512 bytes, it tries to write into full blocks instead of tails. This limits the overhead when you cp/untar to create new files. In both releases, there is locking on the tail to prevent races, and we don't bother with tails on files 16k (configurable). For 2.4, the functions work like this: reiserfs_get_block converts the tail into its own block and points the buffer head at the new block. reiserfs_readpage reads directly from the tail into the page, leaves the buffer head mapped, and sets b_blocknr to 0. reiserfs_writepage and reiserfs_prepare_write both check for mapped buffer heads with a block number of 0 in the page. If found, they are unmapped. Then block_write_full_page or block_prepare_write is called. reiserfs_truncate deals directly with the tail. If the last block is packed back into the tail, it is unmapped from the page cache. reiserfs_file_release will check to see if the tail needs to be repacked, and use truncate (without changing i_size) to pack the tail. -chris
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Chris Mason wrote: In both releases, there is locking on the tail to prevent races, and we don't bother with tails on files 16k (configurable). What granularity do you have? (for tail size, that is).
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Alexander Viro wrote: On Wed, 26 Jul 2000, Chris Mason wrote: In both releases, there is locking on the tail to prevent races, and we don't bother with tails on files 16k (configurable). What granularity do you have? (for tail size, that is). From 1 byte to almost the blocksize (4k). But, there is a macro for deciding when to use a tail, which varies it based on the file size. If the file 12k, it won't have a tail bigger than 1k, an 8k file won't have a tail bigger than 2k. Of course, this is just a guess about the right balance between space and performance... -chris