Re: [patch 06/44] mm: trim more holes

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
 
 If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
 we may have failed the write operation despite prepare_write having
 instantiated blocks past i_size. Fix this, and consolidate the trimming into
 one place.
 
..
 @@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb
   cur_iov, iov_offset, bytes);
   flush_dcache_page(page);
   status = a_ops-commit_write(file, page, offset, offset+bytes);
 - if (status == AOP_TRUNCATED_PAGE) {
 - page_cache_release(page);
 - continue;
 + if (unlikely(status  0))
 + goto fs_write_aop_error;
 + if (unlikely(copied != bytes)) {
 + status = -EFAULT;
 + goto fs_write_aop_error;
   }

It isn't clear to me that you are handling the case
   status == AOP_TRUNCATED_PAGE
here.  AOP_TRUNCATED_PAGE is  0 (0x80001 to be precise)

Maybe -commit_write cannot return AOP_TRUNCATED_PAGE.  If that is
true, then a comment to that effect (i.e. that the old code was wrong)
in the change log might easy review. 

Or did I miss something?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 06/44] mm: trim more holes

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 04:07:35PM +1000, Neil Brown wrote:
 On Tuesday April 24, [EMAIL PROTECTED] wrote:
  
  If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, 
  then
  we may have failed the write operation despite prepare_write having
  instantiated blocks past i_size. Fix this, and consolidate the trimming into
  one place.
  
 ..
  @@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb
  cur_iov, iov_offset, bytes);
  flush_dcache_page(page);
  status = a_ops-commit_write(file, page, offset, offset+bytes);
  -   if (status == AOP_TRUNCATED_PAGE) {
  -   page_cache_release(page);
  -   continue;
  +   if (unlikely(status  0))
  +   goto fs_write_aop_error;
  +   if (unlikely(copied != bytes)) {
  +   status = -EFAULT;
  +   goto fs_write_aop_error;
  }
 
 It isn't clear to me that you are handling the case
status == AOP_TRUNCATED_PAGE
 here.  AOP_TRUNCATED_PAGE is  0 (0x80001 to be precise)

Yes, you are right there.


 Maybe -commit_write cannot return AOP_TRUNCATED_PAGE.  If that is
 true, then a comment to that effect (i.e. that the old code was wrong)
 in the change log might easy review. 
 
 Or did I miss something?

Actually, it seems that the old ocfs2 code (in mainline, not -mm) can
return AOP_TRUNCATED_PAGE from commit_write.

So that line should be changed to
+   if (unlikely(status  0 || status == AOP_TRUNCATED_PAGE)) 

Although we get rid of it in a subsequent patch anyway.

Thanks,
Nick

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
 +  write_begin: This is intended as a replacement for prepare_write. Called
 +by the generic buffered write code to ask the filesystem to prepare
 +to write len bytes at the given offset in the file. flags is a field
 +for AOP_FLAG_xxx flags, described in include/linux/mm.h.

Putting This is intended as a replacement.. there sees a bit
dangerous.  It could well accidentally remain when the documentation
for prepare_write gets removed.  I would make it a separate paragraph
and flesh it out.  And include text from prepare_write before that
gets removed.

   write_begin:
 This is intended as a replacement for prepare_write.  The key 
 differences being that:
- it returns a locked page (in *pagep) rather than
  being given a pre-locked page:
- it can pass arbitrary state to write_end rather than
  having to hide stuff in some filesystem-internal
  data structure 
- The (largely undocumented) flags option.

 Called by  the generic bufferred write code to ask an
 address_space to prepare to write len bytes at the given
 offset in the file.

 The address_space should check that the write will be able to
 complete, by allocating space if necessary and doing any other
 internal housekeeping.  If the write will update parts of any
 basic-blocks on storage, then those blocks should be pre-read
 (if they haven't been read already) so that the updated blocks
 can be written out properly.
 The possible flags are listed in include/linux/fs.h (not
 mm.h) and include
AOP_FLAG_UNINTERRUPTIBLE:
It is unclear how this should be used.  No
current code handles it.

(together with the rest...)
 +
 +The filesystem must return the locked pagecache page for the caller
 +to write into.
 +
 +A void * may be returned in fsdata, which then gets passed into
 +write_end.
 +
 +Returns  0 on failure, in which case all cleanup must be done and
 +write_end not called. 0 on success, in which case write_end must
 +be called.


As you are not including perform_write in the current patchset, maybe
it is best not to include the documentation yet either?

 +  perform_write: This is a single-call, bulk version of write_begin/write_end
 +operations. It is only used in the buffered write path (write_begin
 +must still be implemented), and not for in-kernel writes to 
 pagecache.
 +It takes an iov_iter structure, which provides a descriptor for the
 +source data (and has associated iov_iter_xxx helpers to operate on
 +that data). There are also file, mapping, and pos arguments, which
 +specify the destination of the data.
 +
 +Returns  0 on failure if nothing was written out, otherwise returns
 +the number of bytes copied into pagecache.
 +
 +fs/libfs.c provides a reasonable template to start with, 
 demonstrating
 +iov_iter routines, and iteration over the destination pagecache.
 +

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 04:59:47PM +1000, Neil Brown wrote:
 On Tuesday April 24, [EMAIL PROTECTED] wrote:
  +  write_begin: This is intended as a replacement for prepare_write. Called
  +by the generic buffered write code to ask the filesystem to prepare
  +to write len bytes at the given offset in the file. flags is a 
  field
  +for AOP_FLAG_xxx flags, described in include/linux/mm.h.
 
 Putting This is intended as a replacement.. there sees a bit
 dangerous.  It could well accidentally remain when the documentation
 for prepare_write gets removed.  I would make it a separate paragraph
 and flesh it out.  And include text from prepare_write before that
 gets removed.
 
write_begin:
  This is intended as a replacement for prepare_write.  The key 
  differences being that:
   - it returns a locked page (in *pagep) rather than
   being given a pre-locked page:
   - it can pass arbitrary state to write_end rather than
 having to hide stuff in some filesystem-internal
 data structure 
   - The (largely undocumented) flags option.
 
  Called by  the generic bufferred write code to ask an
  address_space to prepare to write len bytes at the given
  offset in the file.
 
The address_space should check that the write will be able to
complete, by allocating space if necessary and doing any other
internal housekeeping.  If the write will update parts of any
basic-blocks on storage, then those blocks should be pre-read
(if they haven't been read already) so that the updated blocks
can be written out properly.
The possible flags are listed in include/linux/fs.h (not
mm.h) and include
   AOP_FLAG_UNINTERRUPTIBLE:
   It is unclear how this should be used.  No
   current code handles it.

Yeah, reasonable points. I'll do an incremental patch to clean up
some of the documentation.

BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
an initial read or other sequence they might be using to handle the
case of a short write. ecryptfs uses it, others can too.

For buffered writes, this doesn't get passed in (unless they are
coming from kernel space), so I was debating whether to have it at
all.  However, in the previous API, _nobody_ had to worry about
short writes, so this flag means I avoid making an API that is
fundamentally less performant in some situations.

 
 (together with the rest...)
  +
  +The filesystem must return the locked pagecache page for the caller
  +to write into.
  +
  +A void * may be returned in fsdata, which then gets passed into
  +write_end.
  +
  +Returns  0 on failure, in which case all cleanup must be done and
  +write_end not called. 0 on success, in which case write_end must
  +be called.
 
 
 As you are not including perform_write in the current patchset, maybe
 it is best not to include the documentation yet either?

Right, missed that, thanks!

Nick
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
 
 BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
 an initial read or other sequence they might be using to handle the
 case of a short write. ecryptfs uses it, others can too.
 
 For buffered writes, this doesn't get passed in (unless they are
 coming from kernel space), so I was debating whether to have it at
 all.  However, in the previous API, _nobody_ had to worry about
 short writes, so this flag means I avoid making an API that is
 fundamentally less performant in some situations.

Ahhh I think I get it now.

  In general, the address_space must cope with the possibility that
  fewer than the expected number of bytes is copied.  This may leave
  parts of the page with invalid data.  This can be handled by
  pre-loading the page with valid data, however this may cause a
  significant performance cost.
  The write_begin/write_end interface provide two mechanism by which
  this case can be handled more efficiently.
  1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will
not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL).
If that is set, inefficient preparation can be avoided.  However the
most common write paths will never set this flag.
  2/ The return from write_end can declare that fewer bytes have been
accepted. e.g. part of the page may have been loaded from backing
store, overwriting some of the newly written bytes.  If this
return value is reduced, a new write_begin/write_end cycle
may be called to attempt to write the bytes again.

Also
+  write_end: After a successful write_begin, and data copy, write_end must
+be called. len is the original len passed to write_begin, and copied
+is the amount that was able to be copied (they must be equal if
+write_begin was called with intr == 0).
+

That should be ... called without AOP_FLAG_UNINTERRUPTIBLE being
set.
And that was able to be copied is misleading, as the copy is not done in
write_end.  Maybe that was accepted.

It seems to make sense now.  I might try re-reviewing the patches based
on this improved understanding only a public holiday looms :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 05:49:48PM +1000, Neil Brown wrote:
 On Tuesday April 24, [EMAIL PROTECTED] wrote:
  
  BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
  an initial read or other sequence they might be using to handle the
  case of a short write. ecryptfs uses it, others can too.
  
  For buffered writes, this doesn't get passed in (unless they are
  coming from kernel space), so I was debating whether to have it at
  all.  However, in the previous API, _nobody_ had to worry about
  short writes, so this flag means I avoid making an API that is
  fundamentally less performant in some situations.
 
 Ahhh I think I get it now.
 
   In general, the address_space must cope with the possibility that
   fewer than the expected number of bytes is copied.  This may leave
   parts of the page with invalid data.  This can be handled by
   pre-loading the page with valid data, however this may cause a
   significant performance cost.

Right. Bringing the page uptodate at write_begin-time is probably
the simplest way to handle it. However, more sophisticated schemes
are possible. For example, the generic block routines can recover
at write_end-time, and probably can't make use of the flag to do
things much better...


   The write_begin/write_end interface provide two mechanism by which
   this case can be handled more efficiently.
   1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will
 not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL).
 If that is set, inefficient preparation can be avoided.  However the
 most common write paths will never set this flag.

Yes, loop, nfsd, and filesystem-specific pagecache modification
(eg. ext2 directories) are probably the main things that use it.


   2/ The return from write_end can declare that fewer bytes have been
 accepted. e.g. part of the page may have been loaded from backing
 store, overwriting some of the newly written bytes.  If this
 return value is reduced, a new write_begin/write_end cycle
 may be called to attempt to write the bytes again.

Yeah, although you'd have to be careful not to overwrite things if
the page is uptodate (because uptodate *really* means uptodate --
ie.  it is the only thing we have to synchronise buffered reads from
returning the data to userspace).


 
 Also
 +  write_end: After a successful write_begin, and data copy, write_end must
 +be called. len is the original len passed to write_begin, and copied
 +is the amount that was able to be copied (they must be equal if
 +write_begin was called with intr == 0).
 +
 
 That should be ... called without AOP_FLAG_UNINTERRUPTIBLE being
 set.
 And that was able to be copied is misleading, as the copy is not done in
 write_end.  Maybe that was accepted.

Thanks, very good eyes and good suggestions.

Actually I'm a bit worried about this copied vs accepted thing -- we've
already copied some number of bytes into the pagecache by the time write_end
is called. So if the filesystem accepts less and the pagecache page is marked
uptodate, then the pagecache is now out of sync with the filesystem. There
are a few places where it looks like we get this wrong... but that's for a
future patch :P
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: Dentry still in use during umount in 2.6.21-rc5-git6

2007-04-24 Thread Jan Kara
 One of my autoboot test clients gave me this during shutdown. It used
 reiserfs and autofs and NFS heavily.
  Jeff has a fix for this bug so it should go away soon... Thanks for
report anyway :).

Honza

 Unmounting file systems
 BUG: Dentry 8100f3693a40{i=2352220,n=xattrs} still in use (1) [unmount of 
 reiserfs sda9]
 [ cut here ]
 kernel BUG at 
 /mnt/dm-2/newautoboot/autoboot/lsrc/mainline/linux/fs/dcache.c:623!
 invalid opcode:  [1] SMP 
 CPU 1 
 Modules linked in:
 Pid: 15791, comm: umount Not tainted 2.6.21-rc5-git6 #44
 RIP: 0010:[80287454]  [80287454] 
 shrink_dcache_for_umount_subtree+0x178/0x250
 RSP: 0018:8100f5f67e18  EFLAGS: 00010292
 RAX: 0060 RBX: 8100f3693a40 RCX: 5207
 RDX:  RSI: 0046 RDI: 00014661
 RBP: 8100f6dc9cc0 R08: 00a0 R09: 0005
 R10:  R11:  R12: 8100f3693aa0
 R13: 00014661 R14: 0050ea70 R15: 0050ead0
 FS:  2adc863a86d0() GS:8100f7fdc1c0() knlGS:b7be38d0
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 2adc8626a688 CR3: f628b000 CR4: 06e0
 Process umount (pid: 15791, threadinfo 8100f5f66000, task 
 8100f7a08100)
 Stack:  810004dab218 810004dab000 80558860 810004dab000
   8028815b 810004dab000 8027a1a5
   8100f6c50980 806c1600 8027a2a4
 Call Trace:
  [8028815b] shrink_dcache_for_umount+0x2f/0x3d
  [8027a1a5] generic_shutdown_super+0x19/0xf2
  [8027a2a4] kill_block_super+0x26/0x3b
  [8027a350] deactivate_super+0x47/0x60
  [8028bf94] sys_umount+0x1f7/0x22a
  [8027b64d] sys_newstat+0x19/0x31
  [8020932e] system_call+0x7e/0x83
 
 
 Code: 0f 0b eb fe 48 8b 6b 28 48 39 dd 75 04 31 ed eb 04 f0 ff 4d 
 RIP  [80287454] shrink_dcache_for_umount_subtree+0x178/0x250
  RSP 8100f5f67e18
 /etc/init.d/boot.d/K14boot.localfs: line 93: 15791 Segmentation fault  
 umount -avt noproc,nonfs,nonfs4,nosmbfs,nocifs,notmpfs
 
 
 
 -Andi
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 13/44] mm: restore KERNEL_DS optimisations

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 11:23:59AM +1000, Nick Piggin wrote:
 Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
 path.
 
 This may be a pretty questionable gain in most cases, especially after the
 legacy 2copy write path is removed, but it doesn't cost much.

Well, it gets removed later and sets a bad precedence.  Instead of
adding hacks we should have proper methods for kernel-space read/writes.
Especially as the latter are a lot simpler and most of the magic
in this patch series is not needed.  I'll start this work once
your patch series is in.

In general there seems to be a lot of stuff in the earlier patches
that just goes away later and doesn't make much sense in the series.
Is there a good reason not to simply consolidate out those changes
completely?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Christoph Hellwig
 + page = __grab_cache_page(mapping, index);

btw, __grab_cache_page should probably get a more descriptive and
non-__-prefixed name now that it's used all over the place.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 13/44] mm: restore KERNEL_DS optimisations

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 11:43:18AM +0100, Christoph Hellwig wrote:
 On Tue, Apr 24, 2007 at 11:23:59AM +1000, Nick Piggin wrote:
  Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
  path.
  
  This may be a pretty questionable gain in most cases, especially after the
  legacy 2copy write path is removed, but it doesn't cost much.
 
 Well, it gets removed later and sets a bad precedence.  Instead of
 adding hacks we should have proper methods for kernel-space read/writes.
 Especially as the latter are a lot simpler and most of the magic
 in this patch series is not needed.  I'll start this work once
 your patch series is in.

It was removed earlier and put back in here. I agree it isn't so
important, but again it does help that the patchset introduces no
obvious regression. You could remove it in your patchset?


 In general there seems to be a lot of stuff in the earlier patches
 that just goes away later and doesn't make much sense in the series.
 Is there a good reason not to simply consolidate out those changes
 completely?

I guess the first half of the patchset -- the slow deadlock fix for
the old prepare_write path -- came about because that's the only
reasonable way I could find to fix it. I initially thought it would
take a lot longer to convert all filesystems and that we might want
to stay compatible for a while, which is why I wanted to ensure that
was working.

Basically I can't really see which ones you think I should merge and
be able retain a working kernel?

Granted there are a couple of bugfixes and some slightly orthogonal
cleanups in there, but I just thought I'd submit them in the same
series because it was a little easier for me.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
  +   page = __grab_cache_page(mapping, index);
 
 btw, __grab_cache_page should probably get a more descriptive and
 non-__-prefixed name now that it's used all over the place.

Agreed. Suggestions? ;)

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
 On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
   + page = __grab_cache_page(mapping, index);
  
  btw, __grab_cache_page should probably get a more descriptive and
  non-__-prefixed name now that it's used all over the place.
 
 Agreed. Suggestions? ;)

find_or_create_cache_page given that's it's like find_or_create_page +
add_to_page_cache?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: Dentry still in use during umount in 2.6.21-rc5-git6

2007-04-24 Thread Andi Kleen
On Tuesday 24 April 2007 12:40:24 Jan Kara wrote:
  One of my autoboot test clients gave me this during shutdown. It used
  reiserfs and autofs and NFS heavily.
   Jeff has a fix for this bug so it should go away soon... Thanks for
 report anyway :).

Well I hit two more -- see other mails if you're interested.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote:
 On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
  On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
+   page = __grab_cache_page(mapping, index);
   
   btw, __grab_cache_page should probably get a more descriptive and
   non-__-prefixed name now that it's used all over the place.
  
  Agreed. Suggestions? ;)
 
 find_or_create_cache_page given that's it's like find_or_create_page +
 add_to_page_cache?

find_or_create_page adds to page cache as well, though :P

All those random little slightly different allocators scattered over
filemap.c are a bit annoying. Basically I think it would be better
to have a single variant that takes gfp_mask of both the pagecache
page, and the radix-tree insertion. Then serveral things can be
converted to use it.

I was going to try doing that after this patchset. Or do you think it
would be better to get the __grab_cache_page name right in the
first place?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 01:16:53PM +0200, Nick Piggin wrote:
  find_or_create_cache_page given that's it's like find_or_create_page +
  add_to_page_cache?
 
 find_or_create_page adds to page cache as well, though :P
 
 All those random little slightly different allocators scattered over
 filemap.c are a bit annoying. Basically I think it would be better
 to have a single variant that takes gfp_mask of both the pagecache
 page, and the radix-tree insertion. Then serveral things can be
 converted to use it.
 
 I was going to try doing that after this patchset. Or do you think it
 would be better to get the __grab_cache_page name right in the
 first place?

If you plan to fix things up afterwards there's not point in doing
the renaming first.  Would be nice to get both into the same release
in the end, though :)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 12:18:11PM +0100, Christoph Hellwig wrote:
 On Tue, Apr 24, 2007 at 01:16:53PM +0200, Nick Piggin wrote:
  I was going to try doing that after this patchset. Or do you think it
  would be better to get the __grab_cache_page name right in the
  first place?
 
 If you plan to fix things up afterwards there's not point in doing
 the renaming first.  Would be nice to get both into the same release
 in the end, though :)

OK, will do.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
 On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote:
  On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
   On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
 + page = __grab_cache_page(mapping, index);

btw, __grab_cache_page should probably get a more descriptive and
non-__-prefixed name now that it's used all over the place.
   
   Agreed. Suggestions? ;)
  
  find_or_create_cache_page given that's it's like find_or_create_page +
  add_to_page_cache?
 
 find_or_create_page adds to page cache as well, though :P

I would really like to see the word 'lock' in there, as it does
return a locked page, and when I first saw __grab_cache_page recently
there was an unlock_page afterwards and I couldn't see where the lock
happened, and I was confused for a little while.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov
Amit Gud writes:

Hello,

  
  This is an initial implementation of ChunkFS technique, briefly discussed
  at: http://lwn.net/Articles/190222 and 
  http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process. 

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during final phase fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only active during
crash), which seems to return us back to the scalability problem
chunkfs tries to address.

Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

  
  Best,
  AG

Nikita.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Interface for the new fallocate() system call

2007-04-24 Thread Amit K. Arora
On Fri, Apr 20, 2007 at 10:59:18AM -0400, Jakub Jelinek wrote:
 On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote:
  Ok.
  In this case we may have to consider following things:
  
  1) Obviously, for this glibc will have to call fallocate() syscall with
  different arguments on s390, than other archs. I think this should be
  doable and should not be an issue with glibc folks (right?).
 
 glibc can cope with this easily, will just add
 sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override
 the generic Linux implementation.
 
  2) we also need to see how strace behaves in this case. With little
  knowledge that I have of strace, I don't think it should depend on
  argument ordering of a system call on different archs (since it uses
  ptrace internally and that should take care of it). But, it will be
  nice if someone can confirm this.
 
 strace would solve this with #ifdef mess, it already does that in many
 places so guess another few lines don't make it significantly worse.

I will work on the revised fallocate patchset and will post it soon.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: review: don't block non-blocking writes when frozen

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 09:13:37AM +1000, David Chinner wrote:
 So, given the catch-22 you've just presented us can we revisit the
 nfsd non-blocking I/O issue again?  This affects anyone using DM
 snapshots on their NFS servers and has nothing to do with HSMs
 or DMAPI...
 
 FWIW, you can still do non-blocking userspace I/O to a file, so this
 XFS patch is still valid for mainline (that's how I tested it).

I've been investigating the situation a little bit more, and here's
what's going on:

 - SuS allows for O_NONBLOCK on regular files as per
   http://www.opengroup.org/onlinepubs/007908799/xsh/write.html
 - actually implementing O_NONBLOCK semantics for regular fixes
   breaks userspace when poll/select claims files are ready to
   read/write but they aren't, see http://lkml.org/lkml/2004/10/17/17

So we can't really expose O_NONBLOCK on regular files to userspace,
and we need to make sure in common code this does not happen.

EJUKEBOX on snaphots does make sense, though.  Can you please send
a full patchseries for nfsd, the common code and the xfs writepath
so that this actually gets used and behaviour is consistant for all
(or at least most) filesystems?

Also now that the patch goes to mainline please kill ugly FILP_DELAY_FLAG
and just check the flags directly.  And it should probably only check
O_NONBLOCK.  The only architecture having O_NDELAY different from
O_NONBLOCK is sparc, and it already translates the value for us.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChunkFS - measuring cross-chunk references

2007-04-24 Thread Theodore Tso
On Mon, Apr 23, 2007 at 06:02:29PM -0700, Arjan van de Ven wrote:
 
  The other thing which we should consider is that chunkfs really
  requires a 64-bit inode number space, which means either we only allow
 
 does it?
 I'd think it needs a chunk space number and a 32 bit local inode
 number ;) (same for blocks)
 

But that means that the number which gets exported to userspace via
the stat system call will need more than 32 bits worth of ino_t

- Ted

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


Amit Gud writes:

Hello,


 This is an initial implementation of ChunkFS technique, briefly discussed
 at: http://lwn.net/Articles/190222 and
 http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process.

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during final phase fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only active during
crash), which seems to return us back to the scalability problem
chunkfs tries to address.


not quite.

this checking is a O(n^2) or worse problem, and it can eat a lot of memory in 
the process. with chunkfs you divide the problem by a large constant (100 or 
more) for the checks of individual chunks. after those are done then the final 
pass checking the cross-chunk links doesn't have to keep track of everything, it 
only needs to check those links and what they point to


any ability to mark a filesystem as 'clean' and then not have to check it on 
reboot is a bonus on top of this.


David Lang


Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]


 Best,
 AG

Nikita.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


David Lang writes:
 On Tue, 24 Apr 2007, Nikita Danilov wrote:

  Amit Gud writes:
 
  Hello,
 
  
   This is an initial implementation of ChunkFS technique, briefly discussed
   at: http://lwn.net/Articles/190222 and
   http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 
  I have a couple of questions about chunkfs repair process.
 
  First, as I understand it, each continuation inode is a sparse file,
  mapping some subset of logical file blocks into block numbers. Then it
  seems, that during final phase fsck has to check that these partial
  mappings are consistent, for example, that no two different continuation
  inodes for a given file contain a block number for the same offset. This
  check requires scan of all chunks (rather than of only active during
  crash), which seems to return us back to the scalability problem
  chunkfs tries to address.

 not quite.

 this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
 the process. with chunkfs you divide the problem by a large constant (100 or
 more) for the checks of individual chunks. after those are done then the final
 pass checking the cross-chunk links doesn't have to keep track of everything, 
it
 only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.


actually, it should be proportional to the number of continuation nodes. The 
expectation (and design) is that they are rare.


If you get into the worst-case situation of all of them being continuation 
nodes, then you are actually worse off then you were to start with (as you are 
saying), but numbers from people's real filesystems (assuming a chunk size equal 
to a block cluster size) indicates that we are more on the order of a fraction 
of a percent of the nodes. and the expectation is that since the chunk sizes 
will be substantially larger then the block cluster sizes this should get 
reduced even more.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Amit Gud

Nikita Danilov wrote:

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.



Perhaps, I should talk about how continuation inodes are managed / 
located on disk. (This is how it is in my current implementation)


Right now, there is no distinction between an inode and continuation 
inode (also referred to as 'cnode' below), except for the 
EXT2_IS_CONT_FL flag. Every inode holds a list of static number of 
inodes, currently limited to 4.


The structure looks like this:

 -- --
| cnode 0  |--| cnode 0  |-- to another cnode or NULL
 -- --
| cnode 1  |-  | cnode 1  |-
 -- |   --  |
| cnode 2  |--  |  | cnode 2  |--   |
 --  |  |   --  |   |
| cnode 3  | |  |  | cnode 3  | |   |
 --  |  |   --  |   |
  |  |  ||  |   |

   inodes   inodes or NULL

I.e. only first cnode in the list carries forward the chain if all the 
slots are occupied.


Every cnode# field contains
{
ino_t cnode;
__u32 start;/* starting logical block number */
__u32 end;  /* ending logical block number */
}

(current implementation has just one field: cnode)

I thought of this structure to avoid recursion and / or use of any data 
structure while traversing the continuation inodes.


Additional flag, EXT2_SPARSE_CONT_FL would indicate whether the inode 
has any sparse portions. 'start' and 'end' fields are used to speed-up 
finding a cnode given a logical block number without the need of 
actually reading the inode - this can be done away with, perhaps more 
conveniently by, pinning the cnodes in memory as and when read.


Now going back to the Nikita's question, all the cnodes for an inode 
need to be scanned iff 'start' field or number of blocks or flag 
EXT2_SPARSE_CONT_FL in any of its cnodes is altered.


And yes, the whole attempt is to reduce the number of continuation inodes.

Comments, suggestions welcome.


AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-24 Thread Karel Zak
On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote:
 The following extra security measures are taken for unprivileged
 mounts:
 
  - usermounts are limited by a sysctl tunable
  - force nosuid,nodev mount options on the created mount

 The original userspace user= solution also implies the noexec
 option by default (you can override the default by exec option).
 
 It means the kernel based solution is not fully compatible ;-(

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 
 Red Hat Czech s.r.o.
 Purkynova 99/71, 612 45 Brno, Czech Republic
 Reg.id: CZ27690016
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChunkFS - measuring cross-chunk references

2007-04-24 Thread Karuna sagar K

On 4/24/07, Theodore Tso [EMAIL PROTECTED] wrote:

On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:

.

It would also be good to distinguish between directories referencing
files in another chunk, and directories referencing subdirectories in
another chunk (which would be simpler to handle, given the topological
restrictions on directories, as compared to files and hard links).



Modified the tool to distinguish between
1. cross references between directories and files
2. cross references between directories and sub directories
3. cross references within a file (due to huge file size)

Below is the result from / partition of ext3 file system:

Number of files = 221794
Number of directories = 24457
Total size = 8193116 KB
Total data stored = 7187392 KB
Size of block groups = 131072 KB
Number of inodes per block group = 16288
No. of cross references between directories and sub-directories = 7791
No. of cross references between directories and file = 657
Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)

Thanks for the suggestions.


There may also be special things we will need to do to handle
scenarios such as BackupPC, where if it looks like a directory
contains a huge number of hard links to a particular chunk, we'll need
to make sure that directory is either created in the right chunk
(possibly with hints from the application) or migrated to the right
chunk (but this might cause the inode number of the directory to
change --- maybe we allow this as long as the directory has never been
stat'ed, so that the inode number has never been observed).

The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow
it on 64-bit systems, or we need to consider a migration so that even
on 32-bit platforms, stat() functions like stat64(), insofar that it
uses a stat structure which returns a 64-bit ino_t.

   - Ted




Thanks,
Karuna


cref.tar.bz2
Description: BZip2 compressed data


ChunkFS - measuring cross-chunk references

2007-04-24 Thread Karuna sagar K

On 4/24/07, Theodore Tso [EMAIL PROTECTED] wrote:

On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:

.

It would also be good to distinguish between directories referencing
files in another chunk, and directories referencing subdirectories in
another chunk (which would be simpler to handle, given the topological
restrictions on directories, as compared to files and hard links).



Modified the tool to distinguish between
1. cross references between directories and files
2. cross references between directories and sub directories
3. cross references within a file (due to huge file size)

Below is the result from / partition of ext3 file system:

Number of files = 221794
Number of directories = 24457
Total size = 8193116 KB
Total data stored = 7187392 KB
Size of block groups = 131072 KB
Number of inodes per block group = 16288
No. of cross references between directories and sub-directories = 7791
No. of cross references between directories and file = 657
Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)

Thanks for the suggestions.


There may also be special things we will need to do to handle
scenarios such as BackupPC, where if it looks like a directory
contains a huge number of hard links to a particular chunk, we'll need
to make sure that directory is either created in the right chunk
(possibly with hints from the application) or migrated to the right
chunk (but this might cause the inode number of the directory to
change --- maybe we allow this as long as the directory has never been
stat'ed, so that the inode number has never been observed).

The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow
it on 64-bit systems, or we need to consider a migration so that even
on 32-bit platforms, stat() functions like stat64(), insofar that it
uses a stat structure which returns a 64-bit ino_t.

   - Ted




Thanks,
Karuna


cref.tar.bz2
Description: BZip2 compressed data


Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-24 Thread Eric W. Biederman
Karel Zak [EMAIL PROTECTED] writes:

 On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote:
 The following extra security measures are taken for unprivileged
 mounts:
 
  - usermounts are limited by a sysctl tunable
  - force nosuid,nodev mount options on the created mount

  The original userspace user= solution also implies the noexec
  option by default (you can override the default by exec option).
  
  It means the kernel based solution is not fully compatible ;-(

Why noexec?  Either it was a silly or arbitrary decision, or
our kernel design may be incomplete.

Now I can see not wanting to support executables if you are locking
down a system.  The classic don't execute a program from a CD just because
the CD was stuck in the drive problem.

So I can see how executing code from an untrusted source could prevent
exploitation of other problems, and we certainly don't want to do it
automatically.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html