Re: [PATCH] ext4: dir inode reservation V3

2007-11-13 Thread Alex Tomas

hmm. so you trade 265% degradation of creation for 40% improvement of unlink?

thanks, Alex

Coly Li wrote:

normal ext4 ext4 with dir inode 
reservation
mount options:  -o data=writeback   -o 
data=writeback,dir_ireserve=low
Create dirs:real0m49.101s   real2m59.703s
Create files:   real24m17.962s  real21m8.161s
Unlink all: real24m43.788s  real17m29.862s
Creating dirs with dir inode reservation is slower than normal ext4 as 
predicted, because allocating
directory inodes in non-linear order will cause extra hard disk seeking and 
block I/O. Creating
files with dir inode reservation is 13% faster than normal ext4. Unlink all the 
directories and
files is 29.2% faster as expected.
When number of directories is increased, the performance improvement will be 
more considerable. More
benchmark result will be posted here if necessary, because I need more time to 
run more test cases.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ext4: dir inode reservation V3

2007-11-13 Thread Alex Tomas

hmm. so you trade 265% degradation of creation for 40% improvement of unlink?

thanks, Alex

Coly Li wrote:

normal ext4 ext4 with dir inode 
reservation
mount options:  -o data=writeback   -o 
data=writeback,dir_ireserve=low
Create dirs:real0m49.101s   real2m59.703s
Create files:   real24m17.962s  real21m8.161s
Unlink all: real24m43.788s  real17m29.862s
Creating dirs with dir inode reservation is slower than normal ext4 as 
predicted, because allocating
directory inodes in non-linear order will cause extra hard disk seeking and 
block I/O. Creating
files with dir inode reservation is 13% faster than normal ext4. Unlink all the 
directories and
files is 29.2% faster as expected.
When number of directories is increased, the performance improvement will be 
more considerable. More
benchmark result will be posted here if necessary, because I need more time to 
run more test cases.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Alex Tomas
On 9/19/07, David Chinner <[EMAIL PROTECTED]> wrote:
> The problem is this: to alter the fundamental block size of the
> filesystem we also need to alter the data block size and that is
> exactly the piece that linux does not support right now.  So while
> we have the capability to use large block sizes in certain
> filesystems, we can't use that capability until the data path
> supports it.

it's much simpler to teach fs to understand multipage data (like
multipage bitmap scan, multipage extent search, etc) then deal with mm
fragmentation. IMHO. at same time you don't bust IO traffic with
non-used space.

-- 
thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Alex Tomas
On 9/19/07, David Chinner [EMAIL PROTECTED] wrote:
 The problem is this: to alter the fundamental block size of the
 filesystem we also need to alter the data block size and that is
 exactly the piece that linux does not support right now.  So while
 we have the capability to use large block sizes in certain
 filesystems, we can't use that capability until the data path
 supports it.

it's much simpler to teach fs to understand multipage data (like
multipage bitmap scan, multipage extent search, etc) then deal with mm
fragmentation. IMHO. at same time you don't bust IO traffic with
non-used space.

-- 
thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ext4:fix unexpected error from ext4_reserve_global

2007-06-19 Thread Alex Tomas

ACK, of course.

thanks, Alex

Mingming Cao wrote:

On Thu, 2007-06-14 at 19:29 +0400, Dmitriy Monakhov wrote:

I just cant belive my eyes then i saw this at the first time...
simple test: strace dd if=/dev/zero of=/mnt/file

Thanks for reporting it. 


open("/dev/zero", O_RDONLY) = 0
close(1)= 0
open("/mnt/test/file", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1
read(0, "\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0"..., 512) = 512
read(0, "\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0"..., 512) = -1 ENOENT (No such fil
e or directory)

This strange error returned from ext4_reserve_global().
It's just typo because:
a) In fact this is 100% ENOSPC situation
b) simular function ext4_reserve_local() returns -ENOSPC


I agree.

Patch is put in ext4 patch queue. Alex if you can Ack, that would be
great.


Thanks,

Mingming

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 fs/ext4/balloc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 4d7bfd2..43ae8f8 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1920,7 +1920,7 @@ int ext4_reserve_global(struct super_block *sb, int 
blocks)
 {
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_reservation_slot *rs;
-   int i, rc = -ENOENT;
+   int i, rc = -ENOSPC;
__u64 free = 0;

rs = sbi->s_reservation_slots;


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ext4:fix unexpected error from ext4_reserve_global

2007-06-19 Thread Alex Tomas

ACK, of course.

thanks, Alex

Mingming Cao wrote:

On Thu, 2007-06-14 at 19:29 +0400, Dmitriy Monakhov wrote:

I just cant belive my eyes then i saw this at the first time...
simple test: strace dd if=/dev/zero of=/mnt/file

Thanks for reporting it. 


open(/dev/zero, O_RDONLY) = 0
close(1)= 0
open(/mnt/test/file, O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1
read(0, \0\0\0\0\0\0\0\0\0..., 512) = 512
write(1, \0\0\0\0\0\0\0\0..., 512) = 512
read(0, \0\0\0\0\0\0\0\0\0..., 512) = 512
write(1, \0\0\0\0\0\0\0\0..., 512) = -1 ENOENT (No such fil
e or directory)

This strange error returned from ext4_reserve_global().
It's just typo because:
a) In fact this is 100% ENOSPC situation
b) simular function ext4_reserve_local() returns -ENOSPC


I agree.

Patch is put in ext4 patch queue. Alex if you can Ack, that would be
great.


Thanks,

Mingming

Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED]
---
 fs/ext4/balloc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 4d7bfd2..43ae8f8 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1920,7 +1920,7 @@ int ext4_reserve_global(struct super_block *sb, int 
blocks)
 {
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_reservation_slot *rs;
-   int i, rc = -ENOENT;
+   int i, rc = -ENOSPC;
__u64 free = 0;

rs = sbi-s_reservation_slots;


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does "find some dirty unallocated blocks" mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).





going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.


nope, I mean sb->inode->page walk.


But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.


if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?


It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.


not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.


I assume (hope) that your delayed allocation code implements
->writepages()?  Doing the allocation one-page-at-a-time sounds painful...


indeed. this is a root cause of all this complexity.

thanks, Alex


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But

no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.


I don't understand.  Can you please describe the race in more detail?


if I understood your idea right, then in data=ordered mode, commit thread writes
all dirty mapped blocks before real commit.

say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
thread

t1  t2
find dirty inode I
find some dirty unallocated blocks
journal_start()
allocate blocks
attach them to I
journal_stop()

going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


map pages/bhs to just allocate blocks


so, either we mark pages/bhs someway within journal_start()--journal_stop() or
commit thread should do lookup for all dirty pages. the latter doesn't sound 
nice, IMHO.

thanks, Alex



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But


no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.


a) If the page has newly allocated space on disk then the metadata which
   refers to that page is already in the journal: no new journal space
   needed.

b) If the page doesn't have space allocated on disk then we don't need
   to write it out at ordered-mode commit time, because the post-recovery
   filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?


no need to allocate space within commit thread, I think. only to take care
of the race I described above. in hackish version of data=ordered for delayed
allocation I used counter of submitted bio's with newly-allocated blocks and
commit thread waits for the counter to reach 0.



However b) might lead to the hey-my-file-is-full-of-zeroes problem.



thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But


no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.


a) If the page has newly allocated space on disk then the metadata which
   refers to that page is already in the journal: no new journal space
   needed.

b) If the page doesn't have space allocated on disk then we don't need
   to write it out at ordered-mode commit time, because the post-recovery
   filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?


no need to allocate space within commit thread, I think. only to take care
of the race I described above. in hackish version of data=ordered for delayed
allocation I used counter of submitted bio's with newly-allocated blocks and
commit thread waits for the counter to reach 0.



However b) might lead to the hey-my-file-is-full-of-zeroes problem.



thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:


Andrew Morton wrote:

Yes, there can be issues with needing to allocate journal space within the
context of a commit.  But

no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.


I don't understand.  Can you please describe the race in more detail?


if I understood your idea right, then in data=ordered mode, commit thread writes
all dirty mapped blocks before real commit.

say, we have two threads: t1 is a thread doing flushing and t2 is a commit 
thread

t1  t2
find dirty inode I
find some dirty unallocated blocks
journal_start()
allocate blocks
attach them to I
journal_stop()

going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


map pages/bhs to just allocate blocks


so, either we mark pages/bhs someway within journal_start()--journal_stop() or
commit thread should do lookup for all dirty pages. the latter doesn't sound 
nice, IMHO.

thanks, Alex



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-04 Thread Alex Tomas

Andrew Morton wrote:

I'm still not understanding.  The terms you're using are a bit ambiguous.

What does find some dirty unallocated blocks mean?  Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).





going to commit
find inode I dirty
do NOT find these blocks because they're
  allocated only, but pages/bhs aren't 
mapped
  to them
start commit


I think you're assuming here that commit would be using -t_sync_datalist
to locate dirty buffer_heads.


nope, I mean sb-inode-page walk.


But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb-inode-page walk.  So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page().  Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.


if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?


It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so.  If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search.  But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data.  Files which
have chattr +j would screw things up, as usual.


not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.


I assume (hope) that your delayed allocation code implements
-writepages()?  Doing the allocation one-page-at-a-time sounds painful...


indeed. this is a root cause of all this complexity.

thanks, Alex


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Alex Tomas

Andrew Morton wrote:

We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer. 
Basically, do ordered-data with a commit-time inode walk, calling

do_sync_mapping_range().

Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.


I'm not sure it's that easy.

if we move to pages, then we have to mark pages to be flushed holding
transaction open. now take delayed allocation into account: we need
to allocate number of blocks at once and then mark all pages mapped,
again within context of the same transaction. so, an implementation
would look like the following?

generic_writepages() {
/* collect set of contig. dirty pages */
foo_get_blocks() {
foo_journal_start();
foo_new_blocks();
foo_attach_blocks_to_inode();
generic_mark_pages_mapped();
foo_journal_stop();
}
}

another question is will it scale well given number of dirty inodes
can be much larger than number of inodes with dirty mapped blocks
(in delayed allocation case, for example) ?

thanks, Alex




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

2007-05-03 Thread Alex Tomas

Andrew Morton wrote:

We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer. 
Basically, do ordered-data with a commit-time inode walk, calling

do_sync_mapping_range().

Do it in the VFS.  Make reiserfs use it, remove reiserfs ordered-mode too. 
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.


I'm not sure it's that easy.

if we move to pages, then we have to mark pages to be flushed holding
transaction open. now take delayed allocation into account: we need
to allocate number of blocks at once and then mark all pages mapped,
again within context of the same transaction. so, an implementation
would look like the following?

generic_writepages() {
/* collect set of contig. dirty pages */
foo_get_blocks() {
foo_journal_start();
foo_new_blocks();
foo_attach_blocks_to_inode();
generic_mark_pages_mapped();
foo_journal_stop();
}
}

another question is will it scale well given number of dirty inodes
can be much larger than number of inodes with dirty mapped blocks
(in delayed allocation case, for example) ?

thanks, Alex




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-ext4-1

2007-04-30 Thread Alex Tomas

Theodore Ts'o wrote:

P.S.  One bug which I've noted --- if there is a failure due to disk
filling up, running e2fsck on the filesystem will show that the i_blocks
fields on the inodes where there was a failure to allocate disk blocks
are left incorrect.  I'm guessing this is a bug in the delayed
allocation patches.  Alex, when you have a moment, could you take a
look?  Thanks!!


definitely. thanks for the report.

thanks, Alex


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-ext4-1

2007-04-30 Thread Alex Tomas

Theodore Ts'o wrote:

P.S.  One bug which I've noted --- if there is a failure due to disk
filling up, running e2fsck on the filesystem will show that the i_blocks
fields on the inodes where there was a failure to allocate disk blocks
are left incorrect.  I'm guessing this is a bug in the delayed
allocation patches.  Alex, when you have a moment, could you take a
look?  Thanks!!


definitely. thanks for the report.

thanks, Alex


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-17 Thread Alex Tomas

I think one problem with mmap/msync is that they can't maintain
i_size atomically like regular write does. so, one needs to
implement own i_size management in userspace.

thanks, Alex

> Side note: the only reason O_DIRECT exists is because database people are 
> too used to it, because other OS's haven't had enough taste to tell them 
> to do it right, so they've historically hacked their OS to get out of the 
> way.

> As a result, our madvise and/or posix_fadvise interfaces may not be all 
> that strong, because people sadly don't use them that much. It's a sad 
> example of a totally broken interface (O_DIRECT) resulting in better 
> interfaces not getting used, and then not getting as much development 
> effort put into them.

> So O_DIRECT not only is a total disaster from a design standpoint (just 
> look at all the crap it results in), it also indirectly has hurt better 
> interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and 
> clean interface to make sure we don't pollute memory unnecessarily with 
> cached pages after they are all done) ends up being a no-op ;/

> Sad. And it's one of those self-fulfilling prophecies. Still, I hope some 
> day we can just rip the damn disaster out.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-17 Thread Alex Tomas

I think one problem with mmap/msync is that they can't maintain
i_size atomically like regular write does. so, one needs to
implement own i_size management in userspace.

thanks, Alex

 Side note: the only reason O_DIRECT exists is because database people are 
 too used to it, because other OS's haven't had enough taste to tell them 
 to do it right, so they've historically hacked their OS to get out of the 
 way.

 As a result, our madvise and/or posix_fadvise interfaces may not be all 
 that strong, because people sadly don't use them that much. It's a sad 
 example of a totally broken interface (O_DIRECT) resulting in better 
 interfaces not getting used, and then not getting as much development 
 effort put into them.

 So O_DIRECT not only is a total disaster from a design standpoint (just 
 look at all the crap it results in), it also indirectly has hurt better 
 interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and 
 clean interface to make sure we don't pollute memory unnecessarily with 
 cached pages after they are all done) ends up being a no-op ;/

 Sad. And it's one of those self-fulfilling prophecies. Still, I hope some 
 day we can just rip the damn disaster out.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] return ENOENT from ext3_link when racing with unlink

2007-01-16 Thread Alex Tomas
> Peter Staubach (PS) writes:


 PS> Just out of curosity, what keeps i_nlink from going to 0 immediately
 PS> after the new test is executed?

i_mutex in vfs_link() and vfs_unlink()

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] return ENOENT from ext3_link when racing with unlink

2007-01-16 Thread Alex Tomas
 Peter Staubach (PS) writes:


 PS Just out of curosity, what keeps i_nlink from going to 0 immediately
 PS after the new test is executed?

i_mutex in vfs_link() and vfs_unlink()

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
> Eric Sandeen (ES) writes:

 ES> Al says "no" and I'm not arguing.  :)

 ES> Apparently this may be OK with some filesystems, and Al says he doesn't
 ES> want to know about i_nlink in the vfs in any case.

well, generic_drop_inode() uses i_nlink ...

 ES> But I suppose there may be other filesystems which DO care, and should
 ES> be checking if they're not.

this is why I thought VFS could take care.

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
> Eric Sandeen (ES) writes:
 ES> I tend to agree, chatting w/ Al I think he does too.  :)  I'll test
 ES> a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit
 ES> that if things go well.

shouldn't VFS do that?

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
> Eric Sandeen (ES) writes:

 ES> so I think it's possible that link can sneak in there & find it after
 ES> the mutex is dropped...?  Is this ok? :)  It's certainly -happening-
 ES> anyway

yes, but it shouldn't allow to re-link such inode back, IMHO.
a filesystem may start some non-revertable activity in its
unlink method.

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas

ah, it seems vfs_link() doesn't check whether source is still alive.
for example, in mkdir case vfs_mkdir() calls may_create() and
checks the parent is still there:
if (IS_DEADDIR(dir))
return -ENOENT;

VFS doesn't set S_DEAD on regular files, but we could check i_nlink.

thanks, Alex

>>>>> Alex Tomas (AT) writes:

 AT> interesting ..

 AT> I thought VFS doesn't allow concurrent operations.
 AT> if unlink goes first, then link should wait on the
 AT> parent's i_mutex and then found no source name.

 AT> thanks, Alex

>>>>> Eric Sandeen (ES) writes:

 ES> )
 ES> I've been looking at a case where many threads are opening, unlinking, and
 ES> hardlinking files on ext3 .  At unmount time I see an oops, because the 
superblock's
 ES> orphan list points to a freed inode.

 ES> I did some tracing of the inodes, and it looks like this:

 ES> ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
 ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES> ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
 ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES> iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
 ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES> ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
 ES> i_state:0x7 cpu:3 i_count:1 i_nlink:0

 ES> ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
 ES> i_state:0x7 cpu:3 i_count:1 i_nlink:1

 ES> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 
0, so
 ES> it puts it on the orphan inode list.  Then link comes along, and bumps the 
link
 ES> back up to 1.  So now we are on the orphan inode list, but we are not 
unlinked.

 ES> Eventually when count goes to 0, and we still have 1 link, again no action 
is
 ES> taken to remove the inode from the orphan list, because it is still linked 
(i.e.
 ES> we don't go through ext3_delete())

 ES> When this inode is eventually freed, the sb orphan list gets corrupted, 
because 
 ES> we have freed it without first removing it from the orphan list.

 ES> I think the simple solution is to remove the inode from the orphan list
 ES> when we bump the link back up from 0 to 1.  I put that test in there 
because
 ES> there are other potential reasons that we might be on the list (truncates,
 ES> direct IO).

 ES> Comments?

 ES> Thanks,
 ES> -Eric

 ES> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
 ES> arg, and are very infrequently called.  I'll probably submit a patch
 ES> to just put the single line of code into the caller, too.

 ES> ---

 ES> Remove inode from the orphan list in ext3_link() if we might have
 ES> raced with ext3_unlink(), which potentially put it on the list.
 ES> If we're on the list with nlink > 0, we'll never get cleaned up
 ES> properly and eventually may corrupt the list.

 ES> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

 ES> Index: linux-2.6.19/fs/ext3/namei.c
 ES> ===
 ES> --- linux-2.6.19.orig/fs/ext3/namei.c
 ES> +++ linux-2.6.19/fs/ext3/namei.c
 ES> @@ -2204,6 +2204,9 @@ retry:
 inode-> i_ctime = CURRENT_TIME_SEC;
 ES> ext3_inc_count(handle, inode);
 ES> atomic_inc(>i_count);
 ES> +  /* did we race w/ unlink? */
 ES> +  if (inode->i_nlink == 1)
 ES> +  ext3_orphan_del(handle, inode);
 
 ES> err = ext3_add_nondir(handle, dentry, inode);
 ES> ext3_journal_stop(handle);


 ES> -
 ES> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
 ES> the body of a message to [EMAIL PROTECTED]
 ES> More majordomo info at  http://vger.kernel.org/majordomo-info.html
 AT> -
 AT> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
 AT> the body of a message to [EMAIL PROTECTED]
 AT> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas

interesting ..

I thought VFS doesn't allow concurrent operations.
if unlink goes first, then link should wait on the
parent's i_mutex and then found no source name.

thanks, Alex

> Eric Sandeen (ES) writes:

 ES> )
 ES> I've been looking at a case where many threads are opening, unlinking, and
 ES> hardlinking files on ext3 .  At unmount time I see an oops, because the 
superblock's
 ES> orphan list points to a freed inode.

 ES> I did some tracing of the inodes, and it looks like this:

 ES>   ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
 ES>   i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES>   ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] 
ext3_orphan_add
 ES>   i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES>   iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
 ES>   i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES>   ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
 ES>   i_state:0x7 cpu:3 i_count:1 i_nlink:0

 ES>   ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
 ES>   i_state:0x7 cpu:3 i_count:1 i_nlink:1

 ES> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 
0, so
 ES> it puts it on the orphan inode list.  Then link comes along, and bumps the 
link
 ES> back up to 1.  So now we are on the orphan inode list, but we are not 
unlinked.

 ES> Eventually when count goes to 0, and we still have 1 link, again no action 
is
 ES> taken to remove the inode from the orphan list, because it is still linked 
(i.e.
 ES> we don't go through ext3_delete())

 ES> When this inode is eventually freed, the sb orphan list gets corrupted, 
because 
 ES> we have freed it without first removing it from the orphan list.

 ES> I think the simple solution is to remove the inode from the orphan list
 ES> when we bump the link back up from 0 to 1.  I put that test in there 
because
 ES> there are other potential reasons that we might be on the list (truncates,
 ES> direct IO).

 ES> Comments?

 ES> Thanks,
 ES> -Eric

 ES> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
 ES> arg, and are very infrequently called.  I'll probably submit a patch
 ES> to just put the single line of code into the caller, too.

 ES> ---

 ES> Remove inode from the orphan list in ext3_link() if we might have
 ES> raced with ext3_unlink(), which potentially put it on the list.
 ES> If we're on the list with nlink > 0, we'll never get cleaned up
 ES> properly and eventually may corrupt the list.

 ES> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

 ES> Index: linux-2.6.19/fs/ext3/namei.c
 ES> ===
 ES> --- linux-2.6.19.orig/fs/ext3/namei.c
 ES> +++ linux-2.6.19/fs/ext3/namei.c
 ES> @@ -2204,6 +2204,9 @@ retry:
 inode-> i_ctime = CURRENT_TIME_SEC;
 ES>ext3_inc_count(handle, inode);
 ES>atomic_inc(>i_count);
 ES> +  /* did we race w/ unlink? */
 ES> +  if (inode->i_nlink == 1)
 ES> +  ext3_orphan_del(handle, inode);
 
 ES>err = ext3_add_nondir(handle, dentry, inode);
 ES>ext3_journal_stop(handle);


 ES> -
 ES> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
 ES> the body of a message to [EMAIL PROTECTED]
 ES> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas

interesting ..

I thought VFS doesn't allow concurrent operations.
if unlink goes first, then link should wait on the
parent's i_mutex and then found no source name.

thanks, Alex

 Eric Sandeen (ES) writes:

 ES )
 ES I've been looking at a case where many threads are opening, unlinking, and
 ES hardlinking files on ext3 .  At unmount time I see an oops, because the 
superblock's
 ES orphan list points to a freed inode.

 ES I did some tracing of the inodes, and it looks like this:

 ES   ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
 ES   i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES   ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] 
ext3_orphan_add
 ES   i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES   iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
 ES   i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES   ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
 ES   i_state:0x7 cpu:3 i_count:1 i_nlink:0

 ES   ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
 ES   i_state:0x7 cpu:3 i_count:1 i_nlink:1

 ES The unlink gets there first, finds i_count  0 (in use) but nlink goes to 
0, so
 ES it puts it on the orphan inode list.  Then link comes along, and bumps the 
link
 ES back up to 1.  So now we are on the orphan inode list, but we are not 
unlinked.

 ES Eventually when count goes to 0, and we still have 1 link, again no action 
is
 ES taken to remove the inode from the orphan list, because it is still linked 
(i.e.
 ES we don't go through ext3_delete())

 ES When this inode is eventually freed, the sb orphan list gets corrupted, 
because 
 ES we have freed it without first removing it from the orphan list.

 ES I think the simple solution is to remove the inode from the orphan list
 ES when we bump the link back up from 0 to 1.  I put that test in there 
because
 ES there are other potential reasons that we might be on the list (truncates,
 ES direct IO).

 ES Comments?

 ES Thanks,
 ES -Eric

 ES p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
 ES arg, and are very infrequently called.  I'll probably submit a patch
 ES to just put the single line of code into the caller, too.

 ES ---

 ES Remove inode from the orphan list in ext3_link() if we might have
 ES raced with ext3_unlink(), which potentially put it on the list.
 ES If we're on the list with nlink  0, we'll never get cleaned up
 ES properly and eventually may corrupt the list.

 ES Signed-off-by: Eric Sandeen [EMAIL PROTECTED]

 ES Index: linux-2.6.19/fs/ext3/namei.c
 ES ===
 ES --- linux-2.6.19.orig/fs/ext3/namei.c
 ES +++ linux-2.6.19/fs/ext3/namei.c
 ES @@ -2204,6 +2204,9 @@ retry:
 inode- i_ctime = CURRENT_TIME_SEC;
 ESext3_inc_count(handle, inode);
 ESatomic_inc(inode-i_count);
 ES +  /* did we race w/ unlink? */
 ES +  if (inode-i_nlink == 1)
 ES +  ext3_orphan_del(handle, inode);
 
 ESerr = ext3_add_nondir(handle, dentry, inode);
 ESext3_journal_stop(handle);


 ES -
 ES To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 ES the body of a message to [EMAIL PROTECTED]
 ES More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas

ah, it seems vfs_link() doesn't check whether source is still alive.
for example, in mkdir case vfs_mkdir() calls may_create() and
checks the parent is still there:
if (IS_DEADDIR(dir))
return -ENOENT;

VFS doesn't set S_DEAD on regular files, but we could check i_nlink.

thanks, Alex

 Alex Tomas (AT) writes:

 AT interesting ..

 AT I thought VFS doesn't allow concurrent operations.
 AT if unlink goes first, then link should wait on the
 AT parent's i_mutex and then found no source name.

 AT thanks, Alex

 Eric Sandeen (ES) writes:

 ES )
 ES I've been looking at a case where many threads are opening, unlinking, and
 ES hardlinking files on ext3 .  At unmount time I see an oops, because the 
superblock's
 ES orphan list points to a freed inode.

 ES I did some tracing of the inodes, and it looks like this:

 ES ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
 ES i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
 ES i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
 ES i_state:0x7 cpu:1 i_count:2 i_nlink:0

 ES ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
 ES i_state:0x7 cpu:3 i_count:1 i_nlink:0

 ES ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
 ES i_state:0x7 cpu:3 i_count:1 i_nlink:1

 ES The unlink gets there first, finds i_count  0 (in use) but nlink goes to 
0, so
 ES it puts it on the orphan inode list.  Then link comes along, and bumps the 
link
 ES back up to 1.  So now we are on the orphan inode list, but we are not 
unlinked.

 ES Eventually when count goes to 0, and we still have 1 link, again no action 
is
 ES taken to remove the inode from the orphan list, because it is still linked 
(i.e.
 ES we don't go through ext3_delete())

 ES When this inode is eventually freed, the sb orphan list gets corrupted, 
because 
 ES we have freed it without first removing it from the orphan list.

 ES I think the simple solution is to remove the inode from the orphan list
 ES when we bump the link back up from 0 to 1.  I put that test in there 
because
 ES there are other potential reasons that we might be on the list (truncates,
 ES direct IO).

 ES Comments?

 ES Thanks,
 ES -Eric

 ES p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
 ES arg, and are very infrequently called.  I'll probably submit a patch
 ES to just put the single line of code into the caller, too.

 ES ---

 ES Remove inode from the orphan list in ext3_link() if we might have
 ES raced with ext3_unlink(), which potentially put it on the list.
 ES If we're on the list with nlink  0, we'll never get cleaned up
 ES properly and eventually may corrupt the list.

 ES Signed-off-by: Eric Sandeen [EMAIL PROTECTED]

 ES Index: linux-2.6.19/fs/ext3/namei.c
 ES ===
 ES --- linux-2.6.19.orig/fs/ext3/namei.c
 ES +++ linux-2.6.19/fs/ext3/namei.c
 ES @@ -2204,6 +2204,9 @@ retry:
 inode- i_ctime = CURRENT_TIME_SEC;
 ES ext3_inc_count(handle, inode);
 ES atomic_inc(inode-i_count);
 ES +  /* did we race w/ unlink? */
 ES +  if (inode-i_nlink == 1)
 ES +  ext3_orphan_del(handle, inode);
 
 ES err = ext3_add_nondir(handle, dentry, inode);
 ES ext3_journal_stop(handle);


 ES -
 ES To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 ES the body of a message to [EMAIL PROTECTED]
 ES More majordomo info at  http://vger.kernel.org/majordomo-info.html
 AT -
 AT To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 AT the body of a message to [EMAIL PROTECTED]
 AT More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
 Eric Sandeen (ES) writes:

 ES so I think it's possible that link can sneak in there  find it after
 ES the mutex is dropped...?  Is this ok? :)  It's certainly -happening-
 ES anyway

yes, but it shouldn't allow to re-link such inode back, IMHO.
a filesystem may start some non-revertable activity in its
unlink method.

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
 Eric Sandeen (ES) writes:
 ES I tend to agree, chatting w/ Al I think he does too.  :)  I'll test
 ES a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit
 ES that if things go well.

shouldn't VFS do that?

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-12 Thread Alex Tomas
 Eric Sandeen (ES) writes:

 ES Al says no and I'm not arguing.  :)

 ES Apparently this may be OK with some filesystems, and Al says he doesn't
 ES want to know about i_nlink in the vfs in any case.

well, generic_drop_inode() uses i_nlink ...

 ES But I suppose there may be other filesystems which DO care, and should
 ES be checking if they're not.

this is why I thought VFS could take care.

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] delayed allocation for ext4

2006-12-28 Thread Alex Tomas
> David Chinner (DC) writes:

 DC> So that mean's we'll have 2 separate mechanisms for marking
 DC> pages as delalloc. XFS uses the BH_delay flag to indicate
 DC> that a buffer (block) attached to the page is using delalloc.
 >> 
 >> well, for blocksize=pagesize we can save 56 bytes on every page.

 DC> Sure, but it means that ext4 w/ delalloc won't work on lots of
 DC> machines

it does not currenly. but I'm going to implement that. not sure
whether it's worth to have two different codepaths for
block size=page size and block size < page size.

 DC> FWIW, how does this mechanism deal with block size < page size?
 DC> Don't you have to track delalloc on a block basis rather than
 DC> a page basis?
 >> 
 >> I'm still thinking how better to deal with that w/o much code duplication.

 DC> Code duplication in ext4, or across all filesystems?

given what Andrew said about moving the code into VFS, it's more
about all filesystems.

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] delayed allocation for ext4

2006-12-28 Thread Alex Tomas
 David Chinner (DC) writes:

 DC So that mean's we'll have 2 separate mechanisms for marking
 DC pages as delalloc. XFS uses the BH_delay flag to indicate
 DC that a buffer (block) attached to the page is using delalloc.
  
  well, for blocksize=pagesize we can save 56 bytes on every page.

 DC Sure, but it means that ext4 w/ delalloc won't work on lots of
 DC machines

it does not currenly. but I'm going to implement that. not sure
whether it's worth to have two different codepaths for
block size=page size and block size  page size.

 DC FWIW, how does this mechanism deal with block size  page size?
 DC Don't you have to track delalloc on a block basis rather than
 DC a page basis?
  
  I'm still thinking how better to deal with that w/o much code duplication.

 DC Code duplication in ext4, or across all filesystems?

given what Andrew said about moving the code into VFS, it's more
about all filesystems.

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext4-block-reservation.patch

2006-12-23 Thread Alex Tomas

Hi,

> Andrew Morton (AM) writes:

 AM> Should be cacheline_aligned_in_smp.

 AM> That's assuming it needs to be cacheline aligned at all.  It can consume a
 AM> lot of space.

the idea is to make block reservation cheap because it's called
for every page. 

 AM> 

 AM> oh, this should be allocated with alloc_percpu(), in which case the
 AM> open-coded alignment can perhaps go away.

got it.

 >> +
 >> +int ext4_reserve_local(struct super_block *sb, int blocks)
 >> +{
 >> +   struct ext4_sb_info *sbi = EXT4_SB(sb);
 >> +   struct ext4_reservation_slot *rs;
 >> +   int rc = -ENOSPC;
 >> +
 >> +   preempt_disable();
 >> +   rs = sbi->s_reservation_slots + smp_processor_id();

 AM> use get_cpu() here.

ok.

 >> +void ext4_rebalance_reservation(struct ext4_reservation_slot *rs, __u64 
 >> free)
 >> +{
 >> +   int i, used_slots = 0;
 >> +   __u64 chunk;
 >> +
 >> +   /* let's know what slots have been used */
 >> +   for (i = 0; i < NR_CPUS; i++)
 >> +   if (rs[i].rs_reserved || i == smp_processor_id())
 >> +   used_slots++;
 >> +
 >> +   /* chunk is a number of block every used
 >> +* slot will get. make sure it isn't 0 */
 >> +   chunk = free + used_slots - 1;
 >> +   do_div(chunk, used_slots);
 >> +
 >> +   for (i = 0; i < NR_CPUS; i++) {

 AM> all these NR_CPUS loops need to go away.  Use either
 AM> for_each_possible_cpu() or, preferably, for_each_online_cpu() and a hotplug
 AM> notifier.

hmm, i see.

 AM> Why is this code using per-cpu data at all, btw?  These optimisations tend
 AM> to be marginal in filesystems.  What is the perfomance impact of making
 AM> this data be single-superblock-wide-instance?

well, even on 2way box a single-lock reservation was in top10.

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] delayed allocation for ext4

2006-12-23 Thread Alex Tomas
> Christoph Hellwig (CH) writes:

 CH> Note that recording delayed alloc state at a page granularity in addition
 CH> to just the buffer heads has a lot of advantages aswell and would help
 CH> xfs, too.  But I think it makes a lot more sense to record it as a radix
 CH> tree tag to speed up the gang lookups for delalloc conversion.

please, exaplein about radix tree tag. in ext4-delalloc I use this
bit the only way - to avoid multiple reservation space for same
page. I guess you need to find non-allocated pages. probably to
flush them and update number of reserved blocks in case of -ENOSPC?

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] delayed allocation for ext4

2006-12-23 Thread Alex Tomas

Good day,

> David Chinner (DC) writes:

 DC> So that mean's we'll have 2 separate mechanisms for marking
 DC> pages as delalloc. XFS uses the BH_delay flag to indicate
 DC> that a buffer (block) attached to the page is using delalloc.

well, for blocksize=pagesize we can save 56 bytes on every page.

 DC> FWIW, how does this mechanism deal with block size < page size?
 DC> Don't you have to track delalloc on a block basis rather than
 DC> a page basis?

I'm still thinking how better to deal with that w/o much code duplication.

 DC> Ah, that's why you can get away with a page flag - you've ignored
 DC> the partial page delay state problem. Any plans to use the
 DC> existing method in the future so we will be able to use ext4 delalloc
 DC> on machines with a page size larger than 4k?

what do you mean by "exsiting"? BH_delay?


thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] delayed allocation for ext4

2006-12-23 Thread Alex Tomas

Good day,

 David Chinner (DC) writes:

 DC So that mean's we'll have 2 separate mechanisms for marking
 DC pages as delalloc. XFS uses the BH_delay flag to indicate
 DC that a buffer (block) attached to the page is using delalloc.

well, for blocksize=pagesize we can save 56 bytes on every page.

 DC FWIW, how does this mechanism deal with block size  page size?
 DC Don't you have to track delalloc on a block basis rather than
 DC a page basis?

I'm still thinking how better to deal with that w/o much code duplication.

 DC Ah, that's why you can get away with a page flag - you've ignored
 DC the partial page delay state problem. Any plans to use the
 DC existing method in the future so we will be able to use ext4 delalloc
 DC on machines with a page size larger than 4k?

what do you mean by exsiting? BH_delay?


thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] delayed allocation for ext4

2006-12-23 Thread Alex Tomas
 Christoph Hellwig (CH) writes:

 CH Note that recording delayed alloc state at a page granularity in addition
 CH to just the buffer heads has a lot of advantages aswell and would help
 CH xfs, too.  But I think it makes a lot more sense to record it as a radix
 CH tree tag to speed up the gang lookups for delalloc conversion.

please, exaplein about radix tree tag. in ext4-delalloc I use this
bit the only way - to avoid multiple reservation space for same
page. I guess you need to find non-allocated pages. probably to
flush them and update number of reserved blocks in case of -ENOSPC?

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] ext4-block-reservation.patch

2006-12-23 Thread Alex Tomas

Hi,

 Andrew Morton (AM) writes:

 AM Should be cacheline_aligned_in_smp.

 AM That's assuming it needs to be cacheline aligned at all.  It can consume a
 AM lot of space.

the idea is to make block reservation cheap because it's called
for every page. 

 AM looks

 AM oh, this should be allocated with alloc_percpu(), in which case the
 AM open-coded alignment can perhaps go away.

got it.

  +
  +int ext4_reserve_local(struct super_block *sb, int blocks)
  +{
  +   struct ext4_sb_info *sbi = EXT4_SB(sb);
  +   struct ext4_reservation_slot *rs;
  +   int rc = -ENOSPC;
  +
  +   preempt_disable();
  +   rs = sbi-s_reservation_slots + smp_processor_id();

 AM use get_cpu() here.

ok.

  +void ext4_rebalance_reservation(struct ext4_reservation_slot *rs, __u64 
  free)
  +{
  +   int i, used_slots = 0;
  +   __u64 chunk;
  +
  +   /* let's know what slots have been used */
  +   for (i = 0; i  NR_CPUS; i++)
  +   if (rs[i].rs_reserved || i == smp_processor_id())
  +   used_slots++;
  +
  +   /* chunk is a number of block every used
  +* slot will get. make sure it isn't 0 */
  +   chunk = free + used_slots - 1;
  +   do_div(chunk, used_slots);
  +
  +   for (i = 0; i  NR_CPUS; i++) {

 AM all these NR_CPUS loops need to go away.  Use either
 AM for_each_possible_cpu() or, preferably, for_each_online_cpu() and a hotplug
 AM notifier.

hmm, i see.

 AM Why is this code using per-cpu data at all, btw?  These optimisations tend
 AM to be marginal in filesystems.  What is the perfomance impact of making
 AM this data be single-superblock-wide-instance?

well, even on 2way box a single-lock reservation was in top10.

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] ext4-block-reservation.patch

2006-12-22 Thread Alex Tomas


Index: linux-2.6.20-rc1/include/linux/ext4_fs.h
===
--- linux-2.6.20-rc1.orig/include/linux/ext4_fs.h   2006-12-14 
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/ext4_fs.h2006-12-22 20:21:12.0 
+0300
@@ -201,6 +201,7 @@ struct ext4_group_desc
 #define EXT4_STATE_JDATA   0x0001 /* journaled data exists */
 #define EXT4_STATE_NEW 0x0002 /* inode is newly created */
 #define EXT4_STATE_XATTR   0x0004 /* has in-inode xattrs */
+#define EXT4_STATE_BLOCKS_RESERVED 0x0008 /* blocks reserved */
 
 /* Used to pass group descriptor data when online resize is done */
 struct ext4_new_group_input {
@@ -808,6 +809,10 @@ extern struct ext4_group_desc * ext4_get
 extern int ext4_should_retry_alloc(struct super_block *sb, int *retries);
 extern void ext4_init_block_alloc_info(struct inode *);
 extern void ext4_rsv_window_add(struct super_block *sb, struct 
ext4_reserve_window_node *rsv);
+int ext4_reserve_init(struct super_block *sb);
+void ext4_reserve_release(struct super_block *sb);
+void ext4_release_blocks(struct super_block *sb, int blocks);
+int ext4_reserve_blocks(struct super_block *sb, int blocks);
 
 /* dir.c */
 extern int ext4_check_dir_entry(const char *, struct inode *,
Index: linux-2.6.20-rc1/include/linux/ext4_fs_sb.h
===
--- linux-2.6.20-rc1.orig/include/linux/ext4_fs_sb.h2006-12-14 
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/ext4_fs_sb.h 2006-12-22 20:20:10.0 
+0300
@@ -24,6 +24,8 @@
 #endif
 #include 
 
+struct ext4_reservation_slot;
+
 /*
  * third extended-fs super-block data in memory
  */
@@ -65,6 +67,9 @@ struct ext4_sb_info {
struct rb_root s_rsv_window_root;
struct ext4_reserve_window_node s_rsv_window_head;
 
+   /* global reservation structures */
+   struct ext4_reservation_slot *s_reservation_slots;
+
/* Journaling */
struct inode * s_journal_inode;
struct journal_s * s_journal;
Index: linux-2.6.20-rc1/fs/ext4/super.c
===
--- linux-2.6.20-rc1.orig/fs/ext4/super.c   2006-12-14 04:14:23.0 
+0300
+++ linux-2.6.20-rc1/fs/ext4/super.c2006-12-22 20:20:10.0 +0300
@@ -439,6 +439,7 @@ static void ext4_put_super (struct super
struct ext4_super_block *es = sbi->s_es;
int i;
 
+   ext4_reserve_release(sb);
ext4_ext_release(sb);
ext4_xattr_put_super(sb);
jbd2_journal_destroy(sbi->s_journal);
@@ -1867,6 +1868,7 @@ static int ext4_fill_super (struct super
"writeback");
 
ext4_ext_init(sb);
+   ext4_reserve_init(sb);
 
lock_kernel();
return 0;
Index: linux-2.6.20-rc1/fs/ext4/balloc.c
===
--- linux-2.6.20-rc1.orig/fs/ext4/balloc.c  2006-12-14 04:14:23.0 
+0300
+++ linux-2.6.20-rc1/fs/ext4/balloc.c   2006-12-22 20:32:11.0 +0300
@@ -630,8 +630,10 @@ void ext4_free_blocks(handle_t *handle, 
return;
}
ext4_free_blocks_sb(handle, sb, block, count, _freed_blocks);
-   if (dquot_freed_blocks)
+   if (dquot_freed_blocks) {
+   ext4_release_blocks(sb, dquot_freed_blocks);
DQUOT_FREE_BLOCK(inode, dquot_freed_blocks);
+   }
return;
 }
 
@@ -1440,7 +1442,7 @@ ext4_fsblk_t ext4_new_blocks(handle_t *h
struct ext4_sb_info *sbi;
struct ext4_reserve_window_node *my_rsv = NULL;
struct ext4_block_alloc_info *block_i;
-   unsigned short windowsz = 0;
+   unsigned short windowsz = 0, reserved = 0;
 #ifdef EXT4FS_DEBUG
static int goal_hits, goal_attempts;
 #endif
@@ -1462,6 +1464,13 @@ ext4_fsblk_t ext4_new_blocks(handle_t *h
return 0;
}
 
+   if (!(EXT4_I(inode)->i_state & EXT4_STATE_BLOCKS_RESERVED)) {
+   *errp = ext4_reserve_blocks(sb, num);
+   if (*errp)
+   return 0;
+   reserved = num;
+   }
+
sbi = EXT4_SB(sb);
es = EXT4_SB(sb)->s_es;
ext4_debug("goal=%lu.\n", goal);
@@ -1674,8 +1683,11 @@ out:
/*
 * Undo the block allocation
 */
-   if (!performed_allocation)
+   if (!performed_allocation) {
DQUOT_FREE_BLOCK(inode, *count);
+   if (reserved)
+   ext4_release_blocks(sb, reserved);
+   }
brelse(bitmap_bh);
return 0;
 }
@@ -1834,3 +1846,161 @@ unsigned long ext4_bg_num_gdb(struct sup
return ext4_bg_num_gdb_meta(sb,group);
 
 }
+
+/*
+ * reservation.c contains routines to reserve blocks.
+ * we need this for delayed allocation, otherwise we
+ * could meet -ENOSPC at flush time
+ */
+
+/*
+ * as ->commit_write() where we're going to reserve
+ * 

[RFC] ext4-delayed-allocation.patch

2006-12-22 Thread Alex Tomas
sv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-   Opt_grpquota, Opt_extents,
+   Opt_grpquota, Opt_extents, Opt_delayed_alloc,
 };
 
 static match_table_t tokens = {
@@ -780,6 +788,7 @@ static match_table_t tokens = {
{Opt_usrquota, "usrquota"},
{Opt_barrier, "barrier=%u"},
{Opt_extents, "extents"},
+   {Opt_delayed_alloc, "delalloc"},
{Opt_err, NULL},
{Opt_resize, "resize"},
 };
@@ -1094,6 +1103,9 @@ clear_qf_name:
else
clear_opt(sbi->s_mount_opt, BARRIER);
break;
+   case Opt_delayed_alloc:
+   set_opt(sbi->s_mount_opt, DELAYED_ALLOC);
+   break;
case Opt_ignore:
break;
case Opt_resize:
@@ -1869,6 +1881,7 @@ static int ext4_fill_super (struct super
 
ext4_ext_init(sb);
ext4_reserve_init(sb);
+   ext4_wb_init(sb);
 
lock_kernel();
return 0;
Index: linux-2.6.20-rc1/fs/ext4/extents.c
===
--- linux-2.6.20-rc1.orig/fs/ext4/extents.c 2006-12-14 04:14:23.0 
+0300
+++ linux-2.6.20-rc1/fs/ext4/extents.c  2006-12-22 22:56:04.0 +0300
@@ -2159,6 +2159,36 @@ int ext4_ext_writepage_trans_blocks(stru
return needed;
 }
 
+int ext4_ext_calc_metadata_amount(struct inode *inode, int blocks)
+{
+   int lcap, icap, rcap, leafs, idxs, num;
+
+   rcap = ext4_ext_space_root(inode);
+   if (blocks <= rcap) {
+   /* all extents fit to the root */
+   return 0;
+   }
+
+   rcap = ext4_ext_space_root_idx(inode);
+   lcap = ext4_ext_space_block(inode);
+   icap = ext4_ext_space_block_idx(inode);
+
+   num = leafs = (blocks + lcap - 1) / lcap;
+   if (leafs <= rcap) {
+   /* all pointers to leafs fit to the root */
+   return leafs;
+   }
+
+   /* ok. we need separate index block(s) to link all leaf blocks */
+   idxs = (leafs + icap - 1) / icap;
+   do {
+   num += idxs;
+   idxs = (idxs + icap - 1) / icap;
+   } while (idxs > rcap);
+
+   return num;
+}
+
 EXPORT_SYMBOL(ext4_mark_inode_dirty);
 EXPORT_SYMBOL(ext4_ext_invalidate_cache);
 EXPORT_SYMBOL(ext4_ext_insert_extent);
Index: linux-2.6.20-rc1/fs/ext4/Makefile
===
--- linux-2.6.20-rc1.orig/fs/ext4/Makefile  2006-12-14 04:14:23.0 
+0300
+++ linux-2.6.20-rc1/fs/ext4/Makefile   2006-12-22 22:56:04.0 +0300
@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT4DEV_FS) += ext4dev.o
 
 ext4dev-y  := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
   ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
-  ext4_jbd2.o
+  ext4_jbd2.o writeback.o
 
 ext4dev-$(CONFIG_EXT4DEV_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
 ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL) += acl.o
Index: linux-2.6.20-rc1/fs/ext4/writeback.c
===
--- linux-2.6.20-rc1.orig/fs/ext4/writeback.c   2006-11-30 15:32:10.563465031 
+0300
+++ linux-2.6.20-rc1/fs/ext4/writeback.c2006-12-22 22:59:33.0 
+0300
@@ -0,0 +1,1167 @@
+/*
+ * Copyright (c) 2003-2006, Cluster File Systems, Inc, [EMAIL PROTECTED]
+ * Written by Alex Tomas <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-
+ */
+
+/*
+ * TODO:
+ *   MUST:
+ * - flush dirty pages in -ENOSPC case in order to free reserved blocks
+ * - direct I/O support
+ * - blocksize != PAGE_CACHE_SIZE support
+ * - store last unwritten page in ext4_wb_writepages() and
+ *   continue from it in a next run
+ *   WISH:
+ * - should ext4_wb_writepage() try to flush neighbours?
+ * - ext4_wb_block_truncate_page() must flush partial truncated pages
+ * - reservation can be done per write-request in ext4_file_write()
+ *   rather than per-page in ext4_wb_commit_write() -- it's quite
+ *   expensive to recalculate amount of required metadata for evey page
+ * - re-allocation to improve layout
+ */
+
+#include 
+#include 
+

[RFC] booked-page-flag.patch

2006-12-22 Thread Alex Tomas


Index: linux-2.6.20-rc1/include/linux/page-flags.h
===
--- linux-2.6.20-rc1.orig/include/linux/page-flags.h2006-12-14 
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/page-flags.h 2006-12-22 20:05:31.0 
+0300
@@ -90,6 +90,7 @@
 #define PG_reclaim 17  /* To be reclaimed asap */
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
+#define PG_booked  20  /* Has blocks reserved on-disk */
 
 
 #if (BITS_PER_LONG > 32)
@@ -230,6 +231,10 @@ static inline void SetPageUptodate(struc
 #define SetPageMappedToDisk(page) set_bit(PG_mappedtodisk, &(page)->flags)
 #define ClearPageMappedToDisk(page) clear_bit(PG_mappedtodisk, &(page)->flags)
 
+#define PageBooked(page)   test_bit(PG_booked, &(page)->flags)
+#define SetPageBooked(page)set_bit(PG_booked, &(page)->flags)
+#define ClearPageBooked(page)  clear_bit(PG_booked, &(page)->flags)
+
 #define PageReclaim(page)  test_bit(PG_reclaim, &(page)->flags)
 #define SetPageReclaim(page)   set_bit(PG_reclaim, &(page)->flags)
 #define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] delayed allocation for ext4

2006-12-22 Thread Alex Tomas

Good day,

probably the previous set of patches (including mballoc/lg)
is too large. so, I reworked delayed allocation a bit so
that it can be used on top of regular balloc, though it
still can be used with extents-enabled files only.

this time series contains just 3 patches:

 - booked-page-flag.patch
   adds PG_booked bit to page->flags. it's used in delayed
   allocation to mark space is already reserved for page
   (including possible metadata)

 - ext4-block-reservation.patch
   this is scalable free space management. every time we
   delay allocation of some page, a space (including metadata)
   should be reserved

 - ext4-delayed-allocation.patch
   delayed allocation itself, enabled by "delalloc" mount option.
   extents support is also required. currently it works only
   with blocksize=pagesize.


all the patches can be used in 
ftp://ftp.clusterfs.com/pub/people/alex/2.6.20-rc1/

the series passed basic tests like dd/dbench/fsx.

any comments/questions are very welcome.

thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] delayed allocation for ext4

2006-12-22 Thread Alex Tomas

Good day,

probably the previous set of patches (including mballoc/lg)
is too large. so, I reworked delayed allocation a bit so
that it can be used on top of regular balloc, though it
still can be used with extents-enabled files only.

this time series contains just 3 patches:

 - booked-page-flag.patch
   adds PG_booked bit to page-flags. it's used in delayed
   allocation to mark space is already reserved for page
   (including possible metadata)

 - ext4-block-reservation.patch
   this is scalable free space management. every time we
   delay allocation of some page, a space (including metadata)
   should be reserved

 - ext4-delayed-allocation.patch
   delayed allocation itself, enabled by delalloc mount option.
   extents support is also required. currently it works only
   with blocksize=pagesize.


all the patches can be used in 
ftp://ftp.clusterfs.com/pub/people/alex/2.6.20-rc1/

the series passed basic tests like dd/dbench/fsx.

any comments/questions are very welcome.

thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] booked-page-flag.patch

2006-12-22 Thread Alex Tomas


Index: linux-2.6.20-rc1/include/linux/page-flags.h
===
--- linux-2.6.20-rc1.orig/include/linux/page-flags.h2006-12-14 
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/page-flags.h 2006-12-22 20:05:31.0 
+0300
@@ -90,6 +90,7 @@
 #define PG_reclaim 17  /* To be reclaimed asap */
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
+#define PG_booked  20  /* Has blocks reserved on-disk */
 
 
 #if (BITS_PER_LONG  32)
@@ -230,6 +231,10 @@ static inline void SetPageUptodate(struc
 #define SetPageMappedToDisk(page) set_bit(PG_mappedtodisk, (page)-flags)
 #define ClearPageMappedToDisk(page) clear_bit(PG_mappedtodisk, (page)-flags)
 
+#define PageBooked(page)   test_bit(PG_booked, (page)-flags)
+#define SetPageBooked(page)set_bit(PG_booked, (page)-flags)
+#define ClearPageBooked(page)  clear_bit(PG_booked, (page)-flags)
+
 #define PageReclaim(page)  test_bit(PG_reclaim, (page)-flags)
 #define SetPageReclaim(page)   set_bit(PG_reclaim, (page)-flags)
 #define ClearPageReclaim(page) clear_bit(PG_reclaim, (page)-flags)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] ext4-delayed-allocation.patch

2006-12-22 Thread Alex Tomas
, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-   Opt_grpquota, Opt_extents,
+   Opt_grpquota, Opt_extents, Opt_delayed_alloc,
 };
 
 static match_table_t tokens = {
@@ -780,6 +788,7 @@ static match_table_t tokens = {
{Opt_usrquota, usrquota},
{Opt_barrier, barrier=%u},
{Opt_extents, extents},
+   {Opt_delayed_alloc, delalloc},
{Opt_err, NULL},
{Opt_resize, resize},
 };
@@ -1094,6 +1103,9 @@ clear_qf_name:
else
clear_opt(sbi-s_mount_opt, BARRIER);
break;
+   case Opt_delayed_alloc:
+   set_opt(sbi-s_mount_opt, DELAYED_ALLOC);
+   break;
case Opt_ignore:
break;
case Opt_resize:
@@ -1869,6 +1881,7 @@ static int ext4_fill_super (struct super
 
ext4_ext_init(sb);
ext4_reserve_init(sb);
+   ext4_wb_init(sb);
 
lock_kernel();
return 0;
Index: linux-2.6.20-rc1/fs/ext4/extents.c
===
--- linux-2.6.20-rc1.orig/fs/ext4/extents.c 2006-12-14 04:14:23.0 
+0300
+++ linux-2.6.20-rc1/fs/ext4/extents.c  2006-12-22 22:56:04.0 +0300
@@ -2159,6 +2159,36 @@ int ext4_ext_writepage_trans_blocks(stru
return needed;
 }
 
+int ext4_ext_calc_metadata_amount(struct inode *inode, int blocks)
+{
+   int lcap, icap, rcap, leafs, idxs, num;
+
+   rcap = ext4_ext_space_root(inode);
+   if (blocks = rcap) {
+   /* all extents fit to the root */
+   return 0;
+   }
+
+   rcap = ext4_ext_space_root_idx(inode);
+   lcap = ext4_ext_space_block(inode);
+   icap = ext4_ext_space_block_idx(inode);
+
+   num = leafs = (blocks + lcap - 1) / lcap;
+   if (leafs = rcap) {
+   /* all pointers to leafs fit to the root */
+   return leafs;
+   }
+
+   /* ok. we need separate index block(s) to link all leaf blocks */
+   idxs = (leafs + icap - 1) / icap;
+   do {
+   num += idxs;
+   idxs = (idxs + icap - 1) / icap;
+   } while (idxs  rcap);
+
+   return num;
+}
+
 EXPORT_SYMBOL(ext4_mark_inode_dirty);
 EXPORT_SYMBOL(ext4_ext_invalidate_cache);
 EXPORT_SYMBOL(ext4_ext_insert_extent);
Index: linux-2.6.20-rc1/fs/ext4/Makefile
===
--- linux-2.6.20-rc1.orig/fs/ext4/Makefile  2006-12-14 04:14:23.0 
+0300
+++ linux-2.6.20-rc1/fs/ext4/Makefile   2006-12-22 22:56:04.0 +0300
@@ -6,7 +6,7 @@ obj-$(CONFIG_EXT4DEV_FS) += ext4dev.o
 
 ext4dev-y  := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
   ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
-  ext4_jbd2.o
+  ext4_jbd2.o writeback.o
 
 ext4dev-$(CONFIG_EXT4DEV_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
 ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL) += acl.o
Index: linux-2.6.20-rc1/fs/ext4/writeback.c
===
--- linux-2.6.20-rc1.orig/fs/ext4/writeback.c   2006-11-30 15:32:10.563465031 
+0300
+++ linux-2.6.20-rc1/fs/ext4/writeback.c2006-12-22 22:59:33.0 
+0300
@@ -0,0 +1,1167 @@
+/*
+ * Copyright (c) 2003-2006, Cluster File Systems, Inc, [EMAIL PROTECTED]
+ * Written by Alex Tomas [EMAIL PROTECTED]
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-
+ */
+
+/*
+ * TODO:
+ *   MUST:
+ * - flush dirty pages in -ENOSPC case in order to free reserved blocks
+ * - direct I/O support
+ * - blocksize != PAGE_CACHE_SIZE support
+ * - store last unwritten page in ext4_wb_writepages() and
+ *   continue from it in a next run
+ *   WISH:
+ * - should ext4_wb_writepage() try to flush neighbours?
+ * - ext4_wb_block_truncate_page() must flush partial truncated pages
+ * - reservation can be done per write-request in ext4_file_write()
+ *   rather than per-page in ext4_wb_commit_write() -- it's quite
+ *   expensive to recalculate amount of required metadata for evey page
+ * - re-allocation to improve layout
+ */
+
+#include linux/module.h
+#include linux/fs.h
+#include linux/bio.h
+#include linux/time.h
+#include linux/ext4_jbd2.h
+#include linux/jbd.h
+#include linux

[RFC] ext4-block-reservation.patch

2006-12-22 Thread Alex Tomas


Index: linux-2.6.20-rc1/include/linux/ext4_fs.h
===
--- linux-2.6.20-rc1.orig/include/linux/ext4_fs.h   2006-12-14 
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/ext4_fs.h2006-12-22 20:21:12.0 
+0300
@@ -201,6 +201,7 @@ struct ext4_group_desc
 #define EXT4_STATE_JDATA   0x0001 /* journaled data exists */
 #define EXT4_STATE_NEW 0x0002 /* inode is newly created */
 #define EXT4_STATE_XATTR   0x0004 /* has in-inode xattrs */
+#define EXT4_STATE_BLOCKS_RESERVED 0x0008 /* blocks reserved */
 
 /* Used to pass group descriptor data when online resize is done */
 struct ext4_new_group_input {
@@ -808,6 +809,10 @@ extern struct ext4_group_desc * ext4_get
 extern int ext4_should_retry_alloc(struct super_block *sb, int *retries);
 extern void ext4_init_block_alloc_info(struct inode *);
 extern void ext4_rsv_window_add(struct super_block *sb, struct 
ext4_reserve_window_node *rsv);
+int ext4_reserve_init(struct super_block *sb);
+void ext4_reserve_release(struct super_block *sb);
+void ext4_release_blocks(struct super_block *sb, int blocks);
+int ext4_reserve_blocks(struct super_block *sb, int blocks);
 
 /* dir.c */
 extern int ext4_check_dir_entry(const char *, struct inode *,
Index: linux-2.6.20-rc1/include/linux/ext4_fs_sb.h
===
--- linux-2.6.20-rc1.orig/include/linux/ext4_fs_sb.h2006-12-14 
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/ext4_fs_sb.h 2006-12-22 20:20:10.0 
+0300
@@ -24,6 +24,8 @@
 #endif
 #include linux/rbtree.h
 
+struct ext4_reservation_slot;
+
 /*
  * third extended-fs super-block data in memory
  */
@@ -65,6 +67,9 @@ struct ext4_sb_info {
struct rb_root s_rsv_window_root;
struct ext4_reserve_window_node s_rsv_window_head;
 
+   /* global reservation structures */
+   struct ext4_reservation_slot *s_reservation_slots;
+
/* Journaling */
struct inode * s_journal_inode;
struct journal_s * s_journal;
Index: linux-2.6.20-rc1/fs/ext4/super.c
===
--- linux-2.6.20-rc1.orig/fs/ext4/super.c   2006-12-14 04:14:23.0 
+0300
+++ linux-2.6.20-rc1/fs/ext4/super.c2006-12-22 20:20:10.0 +0300
@@ -439,6 +439,7 @@ static void ext4_put_super (struct super
struct ext4_super_block *es = sbi-s_es;
int i;
 
+   ext4_reserve_release(sb);
ext4_ext_release(sb);
ext4_xattr_put_super(sb);
jbd2_journal_destroy(sbi-s_journal);
@@ -1867,6 +1868,7 @@ static int ext4_fill_super (struct super
writeback);
 
ext4_ext_init(sb);
+   ext4_reserve_init(sb);
 
lock_kernel();
return 0;
Index: linux-2.6.20-rc1/fs/ext4/balloc.c
===
--- linux-2.6.20-rc1.orig/fs/ext4/balloc.c  2006-12-14 04:14:23.0 
+0300
+++ linux-2.6.20-rc1/fs/ext4/balloc.c   2006-12-22 20:32:11.0 +0300
@@ -630,8 +630,10 @@ void ext4_free_blocks(handle_t *handle, 
return;
}
ext4_free_blocks_sb(handle, sb, block, count, dquot_freed_blocks);
-   if (dquot_freed_blocks)
+   if (dquot_freed_blocks) {
+   ext4_release_blocks(sb, dquot_freed_blocks);
DQUOT_FREE_BLOCK(inode, dquot_freed_blocks);
+   }
return;
 }
 
@@ -1440,7 +1442,7 @@ ext4_fsblk_t ext4_new_blocks(handle_t *h
struct ext4_sb_info *sbi;
struct ext4_reserve_window_node *my_rsv = NULL;
struct ext4_block_alloc_info *block_i;
-   unsigned short windowsz = 0;
+   unsigned short windowsz = 0, reserved = 0;
 #ifdef EXT4FS_DEBUG
static int goal_hits, goal_attempts;
 #endif
@@ -1462,6 +1464,13 @@ ext4_fsblk_t ext4_new_blocks(handle_t *h
return 0;
}
 
+   if (!(EXT4_I(inode)-i_state  EXT4_STATE_BLOCKS_RESERVED)) {
+   *errp = ext4_reserve_blocks(sb, num);
+   if (*errp)
+   return 0;
+   reserved = num;
+   }
+
sbi = EXT4_SB(sb);
es = EXT4_SB(sb)-s_es;
ext4_debug(goal=%lu.\n, goal);
@@ -1674,8 +1683,11 @@ out:
/*
 * Undo the block allocation
 */
-   if (!performed_allocation)
+   if (!performed_allocation) {
DQUOT_FREE_BLOCK(inode, *count);
+   if (reserved)
+   ext4_release_blocks(sb, reserved);
+   }
brelse(bitmap_bh);
return 0;
 }
@@ -1834,3 +1846,161 @@ unsigned long ext4_bg_num_gdb(struct sup
return ext4_bg_num_gdb_meta(sb,group);
 
 }
+
+/*
+ * reservation.c contains routines to reserve blocks.
+ * we need this for delayed allocation, otherwise we
+ * could meet -ENOSPC at flush time
+ */
+
+/*
+ * as -commit_write() where we're going to 

Re: Boot failure with ext2 and initrds

2006-11-16 Thread Alex Tomas
> Andrew Morton (AM) writes:

 AM> What lock protects the fields in struct ext[234]_reserve_window from being
 AM> concurrently modified by two CPUs?  None, it seems.  Ditto
 AM> ext[234]_reserve_window_node.  i_mutex will cover it for write(), but not
 AM> for pageout over a file hole.  If we end up with a zero- or negative-sized
 AM> window then odd things might happen.

truncate_mutex?

thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Boot failure with ext2 and initrds

2006-11-16 Thread Alex Tomas
 Andrew Morton (AM) writes:

 AM What lock protects the fields in struct ext[234]_reserve_window from being
 AM concurrently modified by two CPUs?  None, it seems.  Ditto
 AM ext[234]_reserve_window_node.  i_mutex will cover it for write(), but not
 AM for pageout over a file hole.  If we end up with a zero- or negative-sized
 AM window then odd things might happen.

truncate_mutex?

thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-23 Thread Alex Tomas
> Jan Blunck (JB) writes:

 JB> Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not 
found
 JB> by __d_lookup() until it is rehashed which is implicit done by ->lookup().

that means we can have two processes allocated dentry for
same name. they'll call ->lookup() each against own dentry,
fill them and hash. so, we'll get two equal dentries. 
is that OK?

thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-23 Thread Alex Tomas
 Jan Blunck (JB) writes:

 JB Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not 
found
 JB by __d_lookup() until it is rehashed which is implicit done by -lookup().

that means we can have two processes allocated dentry for
same name. they'll call -lookup() each against own dentry,
fill them and hash. so, we'll get two equal dentries. 
is that OK?

thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-22 Thread Alex Tomas
> Jan Blunck (JB) writes:

 JB> i_sem does NOT protect the dcache. Also not in real_lookup(). The lock 
must be
 JB> acquired for ->lookup() and because we might sleep on i_sem, we have to 
get it
 JB> early and check for repopulation of the dcache.

dentry is part of dcache, right? i_sem protects dentry from being
returned with incomplete data inside, right?


thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-22 Thread Alex Tomas
> Jan Blunck (JB) writes:

 >> 1) i_sem protects dcache too

 JB> Where? i_sem is the per-inode lock, and shouldn't be used else.

read comments in fs/namei.c:read_lookup()

 >> 2) tmpfs has no "own" data, so we can use it this way (see 2nd patch)
 >> 3) I have pdirops patch for ext3, but it needs some cleaning ...

 JB> I think you didn't get my point.

 JB> 1) Your approach is duplicating the locking effort for regular filesystem
 JB> (like ext2):
 JB> a) locking with s_pdirops_sems
 JB> b) locking the low-level filesystem data
 JB> It's cool that it speeds up tmpfs, but I don't think that this legatimate 
the
 JB> doubled locking for every other filesystem.
 JB> I'm not sure that it also increases performance for regular filesystems, 
if you
 JB> do the locking right.

we've already done this for ext3. it works.
it speeds some loads up significantly.
especially on big directories.
and you can control this via mount option,
so almost zero cost for fs that doesn't support pdirops.

 JB> 2) In my opinion, a superblock-wide semaphore array which allows 1024
 JB> different (different names and different operations) accesses to ONE single
 JB> inode (which is the data it should protect) is not a good idea.

yes, it has some weakness, i'm reworking vfs patch to avoid inter-inode
collisions.

thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-22 Thread Alex Tomas
 Jan Blunck (JB) writes:

  1) i_sem protects dcache too

 JB Where? i_sem is the per-inode lock, and shouldn't be used else.

read comments in fs/namei.c:read_lookup()

  2) tmpfs has no own data, so we can use it this way (see 2nd patch)
  3) I have pdirops patch for ext3, but it needs some cleaning ...

 JB I think you didn't get my point.

 JB 1) Your approach is duplicating the locking effort for regular filesystem
 JB (like ext2):
 JB a) locking with s_pdirops_sems
 JB b) locking the low-level filesystem data
 JB It's cool that it speeds up tmpfs, but I don't think that this legatimate 
the
 JB doubled locking for every other filesystem.
 JB I'm not sure that it also increases performance for regular filesystems, 
if you
 JB do the locking right.

we've already done this for ext3. it works.
it speeds some loads up significantly.
especially on big directories.
and you can control this via mount option,
so almost zero cost for fs that doesn't support pdirops.

 JB 2) In my opinion, a superblock-wide semaphore array which allows 1024
 JB different (different names and different operations) accesses to ONE single
 JB inode (which is the data it should protect) is not a good idea.

yes, it has some weakness, i'm reworking vfs patch to avoid inter-inode
collisions.

thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-22 Thread Alex Tomas
 Jan Blunck (JB) writes:

 JB i_sem does NOT protect the dcache. Also not in real_lookup(). The lock 
must be
 JB acquired for -lookup() and because we might sleep on i_sem, we have to 
get it
 JB early and check for repopulation of the dcache.

dentry is part of dcache, right? i_sem protects dentry from being
returned with incomplete data inside, right?


thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-20 Thread Alex Tomas
> Jan Blunck (JB) writes:

 JB> With luck you have s_pdirops_size (or 1024) different renames altering
 JB> concurrently one directory inode. Therefore you need a lock protecting
 JB> your filesystem data. This is basically the job done by i_sem. So in
 JB> my opinion you only move "The Problem" from the VFS to the lowlevel
 JB> filesystems. But then there is no need for i_sem or your
 JB> s_pdirops_sems anymore.

1) i_sem protects dcache too
2) tmpfs has no "own" data, so we can use it this way (see 2nd patch)
3) I have pdirops patch for ext3, but it needs some cleaning ...

thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-20 Thread Alex Tomas
 Jan Blunck (JB) writes:

 JB With luck you have s_pdirops_size (or 1024) different renames altering
 JB concurrently one directory inode. Therefore you need a lock protecting
 JB your filesystem data. This is basically the job done by i_sem. So in
 JB my opinion you only move The Problem from the VFS to the lowlevel
 JB filesystems. But then there is no need for i_sem or your
 JB s_pdirops_sems anymore.

1) i_sem protects dcache too
2) tmpfs has no own data, so we can use it this way (see 2nd patch)
3) I have pdirops patch for ext3, but it needs some cleaning ...

thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: tmpfs patch

2005-02-19 Thread Alex Tomas


Index: linux-2.6.10/mm/shmem.c
===
--- linux-2.6.10.orig/mm/shmem.c2005-01-28 19:32:16.0 +0300
+++ linux-2.6.10/mm/shmem.c 2005-02-19 20:05:32.642599576 +0300
@@ -1849,7 +1849,7 @@
 #endif
 };
 
-static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t 
*gid, unsigned long *blocks, unsigned long *inodes)
+static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t 
*gid, unsigned long *blocks, unsigned long *inodes, struct super_block *sb)
 {
char *this_char, *value, *rest;
 
@@ -1858,6 +1858,9 @@
continue;
if ((value = strchr(this_char,'=')) != NULL) {
*value++ = 0;
+   } else if (!strcmp(this_char,"pdirops")) {
+   sb->s_flags |= S_PDIROPS;
+   continue;
} else {
printk(KERN_ERR
"tmpfs: No value for mount option '%s'\n",
@@ -1928,7 +1931,7 @@
max_blocks = sbinfo->max_blocks;
max_inodes = sbinfo->max_inodes;
}
-   if (shmem_parse_options(data, NULL, NULL, NULL, _blocks, 
_inodes))
+   if (shmem_parse_options(data, NULL, NULL, NULL, _blocks, 
_inodes, sb))
return -EINVAL;
/* Keep it simple: disallow limited <-> unlimited remount */
if ((max_blocks || max_inodes) == !sbinfo)
@@ -1978,7 +1981,7 @@
inodes = blocks;
 
if (shmem_parse_options(data, ,
-   , , , ))
+   , , , , sb))
return -EINVAL;
}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-19 Thread Alex Tomas


 fs/inode.c |1 
 fs/namei.c |   66 ++---
 include/linux/fs.h |   11 
 3 files changed, 54 insertions(+), 24 deletions(-)

Index: linux-2.6.10/fs/namei.c
===
--- linux-2.6.10.orig/fs/namei.c2005-01-28 19:32:13.0 +0300
+++ linux-2.6.10/fs/namei.c 2005-02-19 20:40:05.763437304 +0300
@@ -104,6 +104,28 @@
  * any extra contention...
  */
 
+static inline struct semaphore * lock_sem(struct inode *dir, struct qstr *name)
+{
+   if (IS_PDIROPS(dir)) {
+   struct super_block *sb;
+   /* name->hash expected to be already calculated */
+   sb = dir->i_sb;
+   BUG_ON(sb->s_pdirops_sems == NULL);
+   return sb->s_pdirops_sems + name->hash % sb->s_pdirops_size;
+   }
+   return >i_sem;
+}
+
+static inline void lock_dir(struct inode *dir, struct qstr *name)
+{
+   down(lock_sem(dir, name));
+}
+
+static inline void unlock_dir(struct inode *dir, struct qstr *name)
+{
+   up(lock_sem(dir, name));
+}
+
 /* In order to reduce some races, while at the same time doing additional
  * checking and hopefully speeding things up, we copy filenames to the
  * kernel data space before using them..
@@ -380,7 +402,7 @@
struct dentry * result;
struct inode *dir = parent->d_inode;
 
-   down(>i_sem);
+   lock_dir(dir, name);
/*
 * First re-do the cached lookup just in case it was created
 * while we waited for the directory semaphore..
@@ -406,7 +428,7 @@
else
result = dentry;
}
-   up(>i_sem);
+   unlock_dir(dir, name);
return result;
}
 
@@ -414,7 +436,7 @@
 * Uhhuh! Nasty case: the cache was re-populated while
 * we waited on the semaphore. Need to revalidate.
 */
-   up(>i_sem);
+   unlock_dir(dir, name);
if (result->d_op && result->d_op->d_revalidate) {
if (!result->d_op->d_revalidate(result, nd) && 
!d_invalidate(result)) {
dput(result);
@@ -1182,12 +1204,26 @@
 /*
  * p1 and p2 should be directories on the same fs.
  */
-struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
+struct dentry *lock_rename(struct dentry *p1, struct qstr *n1,
+   struct dentry *p2, struct qstr *n2)
 {
struct dentry *p;
 
if (p1 == p2) {
-   down(>d_inode->i_sem);
+   if (IS_PDIROPS(p1->d_inode)) {
+   unsigned int h1, h2;
+   h1 = n1->hash % p1->d_inode->i_sb->s_pdirops_size;
+   h2 = n2->hash % p2->d_inode->i_sb->s_pdirops_size;
+   if (h1 < h2) {
+   lock_dir(p1->d_inode, n1);
+   lock_dir(p2->d_inode, n2);
+   } else if (h1 > h2) {
+   lock_dir(p2->d_inode, n2);
+   lock_dir(p1->d_inode, n1);
+   } else
+   lock_dir(p1->d_inode, n1);
+   } else
+   down(>d_inode->i_sem);
return NULL;
}
 
@@ -1195,31 +1231,35 @@
 
for (p = p1; p->d_parent != p; p = p->d_parent) {
if (p->d_parent == p2) {
-   down(>d_inode->i_sem);
-   down(>d_inode->i_sem);
+   lock_dir(p2->d_inode, n2);
+   lock_dir(p1->d_inode, n1);
return p;
}
}
 
for (p = p2; p->d_parent != p; p = p->d_parent) {
if (p->d_parent == p1) {
-   down(>d_inode->i_sem);
-   down(>d_inode->i_sem);
+   lock_dir(p1->d_inode, n1);
+   lock_dir(p2->d_inode, n2);
return p;
}
}
 
-   down(>d_inode->i_sem);
-   down(>d_inode->i_sem);
+   lock_dir(p1->d_inode, n1);
+   lock_dir(p2->d_inode, n2);
return NULL;
 }
 
-void unlock_rename(struct dentry *p1, struct dentry *p2)
+void unlock_rename(struct dentry *p1, struct qstr *n1,
+   struct dentry *p2, struct qstr *n2)
 {
-   up(>d_inode->i_sem);
+   unlock_dir(p1->d_inode, n1);
if (p1 != p2) {
-   up(>d_inode->i_sem);
+   unlock_dir(p2->d_inode, n2);
up(>d_inode->i_sb->s_vfs_rename_sem);
+   } else if (IS_PDIROPS(p1->d_inode) && 
+   n1->hash != n2->hash) {
+   unlock_dir(p2->d_inode, n2);
}
 }
 
@@ -1386,13 +1426,13 @@
 
dir = nd->dentry;
nd->flags &= ~LOOKUP_PARENT;
-   down(>d_inode->i_sem);
+   lock_dir(dir->d_inode, >last);
   

[RFC] parallel directory operations

2005-02-19 Thread Alex Tomas

Good day Al and all

could you review couple patches that implement $subj
for vfs and tmpfs. In short the idea is that we can
protect operations taking semaphore related for set
of names. definitely, protection at vfs layer isn't
enough and filesystem will need to protect their own
structures by itself, but in some cases vfs patch is
enough. for example, tmpfs. during some loads one can
see quite high load in /tmp. being mounted as tmpfs
on big smp, we can get high contention on i_sem.

probably someone could try more-less real load?

I wrote simple program that spawn few processes, then
chdir to the given directory, then loops creating and
unlinking file. The test box is dual PIII-1GHz:


run 1: 2 processes create/unlink file on regular tmpfs

[EMAIL PROTECTED] root]# mount -t tmpfs none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/crunl ./f 100 2)
#1998: 100 iterations, create/unlink ./f-0-1998
#1999: 100 iterations, create/unlink ./f-1-1999
#384: done
#384: done
wait for completion ... OK
real0m36.224s
user0m0.823s
sys 0m47.994s


run 2: 2 processes create/unlink file on tmpfs + pdirops

[EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/crunl ./f 100 2)
#1992: 100 iterations, create/unlink ./f-0-1992
#1993: 100 iterations, create/unlink ./f-1-1993
#384: done
#384: done
wait for completion ... OK
real0m15.108s
user0m0.592s
sys 0m29.406s


run 3: 1 process creates/unlinks file on regular tmpfs

[EMAIL PROTECTED] root]# mount -t tmpfs none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/crunl ./f 100 1)
#2004: 100 iterations, create/unlink ./f-0-2004
#384: done
wait for completion ... OK
real0m11.950s
user0m0.262s
sys 0m7.465s

run 4: 1 process creates/unlinks file on tmpf + pdirops

[EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/crunl ./f 100 1)
#2009: 100 iterations, create/unlink ./f-0-2009
#384: done
wait for completion ... OK
real0m8.047s
user0m0.243s
sys 0m7.646s


2 processes creating/unlinking on regular tmpfs cause ~200K context switches:

   procs  memory  swap  io system  cpu
 r  b  w   swpd   free   buff  cache   si   sobibo   incs us sy id
 2  0  0  0 1005760   6616  1092800 0 0 1007 215095  1 64 35
 2  0  0  0 1005760   6616  1092800 0 0 1007 213580  1 67 32
 2  0  0  0 1005760   6616  1092800 0 0 1007 214445  1 63 36
 2  0  0  0 1005760   6616  1092800 0 0 1007 216250  1 63 36


2 processes creating/unlinking on tmpfs + pdirops cause ~44 context switches:

   procs  memory  swap  io system  cpu
 r  b  w   swpd   free   buff  cache   si   sobibo   incs us sy id
 2  0  1  0 1001824   6632  1091200 0 0 100841  2 98  0
 2  0  2  0 1002144   6632  1091200 0 0 100845  2 98  0
 2  0  2  0 1001632   6632  1091200 0 0 100747  2 98  0


the next benchmark is rename. two processes generate random name, create file,
generate new name, rename created file in new name and unlink:

run 5: regular tmpfs

[EMAIL PROTECTED] root]# mount -t tmpfs none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/rndrename ./f 100 2)
#2036: 100 iterations
#2037: 100 iterations
wait for completion ... OK
real1m22.381s
user0m10.254s
sys 1m50.214s

run 6: tmpfs + pdirops

[EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/rndrename ./f 100 2)
#2044: 100 iterations
#2045: 100 iterations
wait for completion ... OK

real0m39.403s
user0m9.411s
sys 1m8.626s


thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] parallel directory operations

2005-02-19 Thread Alex Tomas

Good day Al and all

could you review couple patches that implement $subj
for vfs and tmpfs. In short the idea is that we can
protect operations taking semaphore related for set
of names. definitely, protection at vfs layer isn't
enough and filesystem will need to protect their own
structures by itself, but in some cases vfs patch is
enough. for example, tmpfs. during some loads one can
see quite high load in /tmp. being mounted as tmpfs
on big smp, we can get high contention on i_sem.

probably someone could try more-less real load?

I wrote simple program that spawn few processes, then
chdir to the given directory, then loops creating and
unlinking file. The test box is dual PIII-1GHz:


run 1: 2 processes create/unlink file on regular tmpfs

[EMAIL PROTECTED] root]# mount -t tmpfs none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/crunl ./f 100 2)
#1998: 100 iterations, create/unlink ./f-0-1998
#1999: 100 iterations, create/unlink ./f-1-1999
#384: done
#384: done
wait for completion ... OK
real0m36.224s
user0m0.823s
sys 0m47.994s


run 2: 2 processes create/unlink file on tmpfs + pdirops

[EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/crunl ./f 100 2)
#1992: 100 iterations, create/unlink ./f-0-1992
#1993: 100 iterations, create/unlink ./f-1-1993
#384: done
#384: done
wait for completion ... OK
real0m15.108s
user0m0.592s
sys 0m29.406s


run 3: 1 process creates/unlinks file on regular tmpfs

[EMAIL PROTECTED] root]# mount -t tmpfs none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/crunl ./f 100 1)
#2004: 100 iterations, create/unlink ./f-0-2004
#384: done
wait for completion ... OK
real0m11.950s
user0m0.262s
sys 0m7.465s

run 4: 1 process creates/unlinks file on tmpf + pdirops

[EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/crunl ./f 100 1)
#2009: 100 iterations, create/unlink ./f-0-2009
#384: done
wait for completion ... OK
real0m8.047s
user0m0.243s
sys 0m7.646s


2 processes creating/unlinking on regular tmpfs cause ~200K context switches:

   procs  memory  swap  io system  cpu
 r  b  w   swpd   free   buff  cache   si   sobibo   incs us sy id
 2  0  0  0 1005760   6616  1092800 0 0 1007 215095  1 64 35
 2  0  0  0 1005760   6616  1092800 0 0 1007 213580  1 67 32
 2  0  0  0 1005760   6616  1092800 0 0 1007 214445  1 63 36
 2  0  0  0 1005760   6616  1092800 0 0 1007 216250  1 63 36


2 processes creating/unlinking on tmpfs + pdirops cause ~44 context switches:

   procs  memory  swap  io system  cpu
 r  b  w   swpd   free   buff  cache   si   sobibo   incs us sy id
 2  0  1  0 1001824   6632  1091200 0 0 100841  2 98  0
 2  0  2  0 1002144   6632  1091200 0 0 100845  2 98  0
 2  0  2  0 1001632   6632  1091200 0 0 100747  2 98  0


the next benchmark is rename. two processes generate random name, create file,
generate new name, rename created file in new name and unlink:

run 5: regular tmpfs

[EMAIL PROTECTED] root]# mount -t tmpfs none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/rndrename ./f 100 2)
#2036: 100 iterations
#2037: 100 iterations
wait for completion ... OK
real1m22.381s
user0m10.254s
sys 1m50.214s

run 6: tmpfs + pdirops

[EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test
[EMAIL PROTECTED] root]# (cd /test; time  /root/rndrename ./f 100 2)
#2044: 100 iterations
#2045: 100 iterations
wait for completion ... OK

real0m39.403s
user0m9.411s
sys 1m8.626s


thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] pdirops: vfs patch

2005-02-19 Thread Alex Tomas


 fs/inode.c |1 
 fs/namei.c |   66 ++---
 include/linux/fs.h |   11 
 3 files changed, 54 insertions(+), 24 deletions(-)

Index: linux-2.6.10/fs/namei.c
===
--- linux-2.6.10.orig/fs/namei.c2005-01-28 19:32:13.0 +0300
+++ linux-2.6.10/fs/namei.c 2005-02-19 20:40:05.763437304 +0300
@@ -104,6 +104,28 @@
  * any extra contention...
  */
 
+static inline struct semaphore * lock_sem(struct inode *dir, struct qstr *name)
+{
+   if (IS_PDIROPS(dir)) {
+   struct super_block *sb;
+   /* name-hash expected to be already calculated */
+   sb = dir-i_sb;
+   BUG_ON(sb-s_pdirops_sems == NULL);
+   return sb-s_pdirops_sems + name-hash % sb-s_pdirops_size;
+   }
+   return dir-i_sem;
+}
+
+static inline void lock_dir(struct inode *dir, struct qstr *name)
+{
+   down(lock_sem(dir, name));
+}
+
+static inline void unlock_dir(struct inode *dir, struct qstr *name)
+{
+   up(lock_sem(dir, name));
+}
+
 /* In order to reduce some races, while at the same time doing additional
  * checking and hopefully speeding things up, we copy filenames to the
  * kernel data space before using them..
@@ -380,7 +402,7 @@
struct dentry * result;
struct inode *dir = parent-d_inode;
 
-   down(dir-i_sem);
+   lock_dir(dir, name);
/*
 * First re-do the cached lookup just in case it was created
 * while we waited for the directory semaphore..
@@ -406,7 +428,7 @@
else
result = dentry;
}
-   up(dir-i_sem);
+   unlock_dir(dir, name);
return result;
}
 
@@ -414,7 +436,7 @@
 * Uhhuh! Nasty case: the cache was re-populated while
 * we waited on the semaphore. Need to revalidate.
 */
-   up(dir-i_sem);
+   unlock_dir(dir, name);
if (result-d_op  result-d_op-d_revalidate) {
if (!result-d_op-d_revalidate(result, nd)  
!d_invalidate(result)) {
dput(result);
@@ -1182,12 +1204,26 @@
 /*
  * p1 and p2 should be directories on the same fs.
  */
-struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
+struct dentry *lock_rename(struct dentry *p1, struct qstr *n1,
+   struct dentry *p2, struct qstr *n2)
 {
struct dentry *p;
 
if (p1 == p2) {
-   down(p1-d_inode-i_sem);
+   if (IS_PDIROPS(p1-d_inode)) {
+   unsigned int h1, h2;
+   h1 = n1-hash % p1-d_inode-i_sb-s_pdirops_size;
+   h2 = n2-hash % p2-d_inode-i_sb-s_pdirops_size;
+   if (h1  h2) {
+   lock_dir(p1-d_inode, n1);
+   lock_dir(p2-d_inode, n2);
+   } else if (h1  h2) {
+   lock_dir(p2-d_inode, n2);
+   lock_dir(p1-d_inode, n1);
+   } else
+   lock_dir(p1-d_inode, n1);
+   } else
+   down(p1-d_inode-i_sem);
return NULL;
}
 
@@ -1195,31 +1231,35 @@
 
for (p = p1; p-d_parent != p; p = p-d_parent) {
if (p-d_parent == p2) {
-   down(p2-d_inode-i_sem);
-   down(p1-d_inode-i_sem);
+   lock_dir(p2-d_inode, n2);
+   lock_dir(p1-d_inode, n1);
return p;
}
}
 
for (p = p2; p-d_parent != p; p = p-d_parent) {
if (p-d_parent == p1) {
-   down(p1-d_inode-i_sem);
-   down(p2-d_inode-i_sem);
+   lock_dir(p1-d_inode, n1);
+   lock_dir(p2-d_inode, n2);
return p;
}
}
 
-   down(p1-d_inode-i_sem);
-   down(p2-d_inode-i_sem);
+   lock_dir(p1-d_inode, n1);
+   lock_dir(p2-d_inode, n2);
return NULL;
 }
 
-void unlock_rename(struct dentry *p1, struct dentry *p2)
+void unlock_rename(struct dentry *p1, struct qstr *n1,
+   struct dentry *p2, struct qstr *n2)
 {
-   up(p1-d_inode-i_sem);
+   unlock_dir(p1-d_inode, n1);
if (p1 != p2) {
-   up(p2-d_inode-i_sem);
+   unlock_dir(p2-d_inode, n2);
up(p1-d_inode-i_sb-s_vfs_rename_sem);
+   } else if (IS_PDIROPS(p1-d_inode)  
+   n1-hash != n2-hash) {
+   unlock_dir(p2-d_inode, n2);
}
 }
 
@@ -1386,13 +1426,13 @@
 
dir = nd-dentry;
nd-flags = ~LOOKUP_PARENT;
-   down(dir-d_inode-i_sem);
+   lock_dir(dir-d_inode, nd-last);
dentry = 

Re: [RFC] pdirops: tmpfs patch

2005-02-19 Thread Alex Tomas


Index: linux-2.6.10/mm/shmem.c
===
--- linux-2.6.10.orig/mm/shmem.c2005-01-28 19:32:16.0 +0300
+++ linux-2.6.10/mm/shmem.c 2005-02-19 20:05:32.642599576 +0300
@@ -1849,7 +1849,7 @@
 #endif
 };
 
-static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t 
*gid, unsigned long *blocks, unsigned long *inodes)
+static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t 
*gid, unsigned long *blocks, unsigned long *inodes, struct super_block *sb)
 {
char *this_char, *value, *rest;
 
@@ -1858,6 +1858,9 @@
continue;
if ((value = strchr(this_char,'=')) != NULL) {
*value++ = 0;
+   } else if (!strcmp(this_char,pdirops)) {
+   sb-s_flags |= S_PDIROPS;
+   continue;
} else {
printk(KERN_ERR
tmpfs: No value for mount option '%s'\n,
@@ -1928,7 +1931,7 @@
max_blocks = sbinfo-max_blocks;
max_inodes = sbinfo-max_inodes;
}
-   if (shmem_parse_options(data, NULL, NULL, NULL, max_blocks, 
max_inodes))
+   if (shmem_parse_options(data, NULL, NULL, NULL, max_blocks, 
max_inodes, sb))
return -EINVAL;
/* Keep it simple: disallow limited - unlimited remount */
if ((max_blocks || max_inodes) == !sbinfo)
@@ -1978,7 +1981,7 @@
inodes = blocks;
 
if (shmem_parse_options(data, mode,
-   uid, gid, blocks, inodes))
+   uid, gid, blocks, inodes, sb))
return -EINVAL;
}
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] Re: Latest ext3 patches (extents, mballoc, delayed allocation)

2005-02-15 Thread Alex Tomas
> Sonny Rao (SR) writes:

 SR> Alex, small buglet, If the FIBMAP-ioctl get's called on a file with
 SR> delayed allocation, you need to flush it (or at least allocate) before
 SR> returning the mappings.   This doesn't seem to work properly at
 SR> present.  

good catch. thanks. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] Re: Latest ext3 patches (extents, mballoc, delayed allocation)

2005-02-15 Thread Alex Tomas
 Sonny Rao (SR) writes:

 SR Alex, small buglet, If the FIBMAP-ioctl get's called on a file with
 SR delayed allocation, you need to flush it (or at least allocate) before
 SR returning the mappings.   This doesn't seem to work properly at
 SR present.  

good catch. thanks. 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Latest ext3 patches (extents, mballoc, delayed allocation)

2005-02-11 Thread Alex Tomas

Good day all, 

I've updated the patchset against 2.6.10. A bunch of bugs have been
fixed and mballoc now behaves smarter a bit. Extents and mballoc 
patches collects some stats they print upon umount. NOTE: they must
not be used to store important data. A lot of things are to be done.

Please review. Any comments and suggestions are very welcome.

The patches are too big to send in a message (previous time they got
rejected). I put them in ftp://ftp.clusterfs.com/pub/people/alex/2.6.10

The following names are used:
  ext3rs - ext3 mounted with data=writeback,reservation options
  ext3i  - ext3 mounted with extents,mballoc,delalloc options
  reiser - reiserfs v3

I did some benchmarking. Two systems were used:
  SMP - dual PIII-1GHz, 1GB, FC-connected to 2 disks raid0
  UP  - PIII-933MHz, 256MB, FC-connected to 2 disks raid0



The tests dd500 and dd1000 generate 500M and 1000M using dd on fresh fs.
Real time varies run from run because of disks, but sys.time is stable.

  SMPUP
TEST / FS real   sys real   sys
dd500/ext2  - 9.14   2.3319.03  2.50
dd500/ext3  - 12.23  4.0618.99  4.19
dd500/ext3rs- 13.39  4.0115.08  4.16
dd500/ext3i - 9.19   2.3119.07  2.52
dd500/reiser- 7.95   2.8721.23  3.09
dd500/xfs   - 17.88  2.4219.25  2.67

dd1000/ext2 - 37.47  4.6944.59  5.02
dd1000/ext3 - 38.03  7.9040.77  8.31
dd1000/ext3rs   - 34.62  7.9540.51  8.30
dd1000/ext3i- 33.73  4.6538.74  4.93
dd1000/reiser   - 29.29  5.7944.88  6.08
dd1000/xfs  - 37.11  5.0339.98  5.27



The test untar unpacks linux-2.6.0.tar:

  SMP   UP 
TEST / FS real   sys real   sys
untar/ext2  - 9.21   2.4727.31  2.57
untar/ext3  - 15.63  3.3834.54  3.61
untar/ext3rs- 15.91  3.2735.98  3.65
untar/ext3i - 8.33   2.7016.75  2.84
untar/reiser- 13.38  5.4725.88  5.96
untar/xfs   - 44.62  5.6451.92  4.88



The next test is dbench:

   SMP  UP
TEST / FS real   sys MB/s   real   sys MB/s
dbench1/ext2- 5.87   1.4363.72565.93   1.5463.79
dbench1/ext3- 2.46   1.75139.7943.60   1.85131.372
dbench1/ext3rs  - 2.46   1.73140.3783.55   1.85133.071
dbench1/ext3i   - 2.19   1.48156.9762.28   1.53150.403
dbench1/reiser  - 2.80   2.05122.7612.88   2.11119.815
dbench1/xfs - 2.83   1.81122.1592.87   1.91119.564

dbench4/ext2- 4.99   7.13274.2169.96   6.22137.316
dbench4/ext3- 5.79   8.64236.85811.49  7.40130.146
dbench4/ext3rs  - 5.80   8.57236.35611.45  7.38130.467
dbench4/ext3i   - 5.16   7.16265.6219.25   6.31147.872
dbench4/reiser  - 7.11   10.85   192.49111.85  8.59115.392
dbench4/xfs - 6.60   8.88207.59911.98  7.69114.177

dbench8/ext2- 16.04  14.27   181.93 18.32  12.14   149.228
dbench8/ext3- 18.87  17.04   165.21423.77  14.88   119.995
dbench8/ext3rs  - 11.52  17.21   237.64723.35  15.04   123.088
dbench8/ext3i   - 11.20  14.66   268.05221.00  12.49   130.223
dbench8/reiser  - 13.92  21.65   196.30624.98  17.67   114.083
dbench8/xfs - 14.84  18.23   184.26325.84  15.53   105.743

dbench16/ext2   - 20.39  28.79   267.97947.37  25.13   115.582
dbench16/ext3   - 25.69  34.54   212.74553.43  30.78   102.266
dbench16/ext3rs - 24.47  34.36   223.37 54.68  30.44   100.082
dbench16/ext3i  - 22.94  29.40   238.21544.89  25.73   121.962
dbench16/reiser - 28.56  43.86   191.40756.48  36.41   96.9686
dbench16/xfs- 33.49  36.94   163.15978.68  32.56   69.4764

dbench32/ext2   - 43.54  58.87   250.947108.1  51.24   101.354
dbench32/ext3   - 61.83  70.66   176.707139.84 63.77   78.8818
dbench32/ext3rs - 67.24  71.69   162.651145.03 63.38   75.5155
dbench32/ext3i  - 55.29  59.41   197.757100.24 52.50   109.26
dbench32/reiser - 76.32  93.45   143.127128.77 77.94   85.7179
dbench32/xfs- 119.4  88.09   91.4746670.45 76.19   16.298



The followins crazy listing shows tiobench's results for SMP box:

Sequential Reads
  File  Blk   Num   Avg CPU
IdentifierSize  Size  Thr   Rate  (CPU%)  Latency   Eff
 -- - ---  -- -- - -
ext2  512   40961   40.95 12.80% 0.095   320
ext3  512   40961   45.03 13.99% 0.086   322
ext3rs512   40961   50.53 

Re: Latest ext3 patches (extents, mballoc, delayed allocation)

2005-02-11 Thread Alex Tomas

Good day all, 

I've updated the patchset against 2.6.10. A bunch of bugs have been
fixed and mballoc now behaves smarter a bit. Extents and mballoc 
patches collects some stats they print upon umount. NOTE: they must
not be used to store important data. A lot of things are to be done.

Please review. Any comments and suggestions are very welcome.

The patches are too big to send in a message (previous time they got
rejected). I put them in ftp://ftp.clusterfs.com/pub/people/alex/2.6.10

The following names are used:
  ext3rs - ext3 mounted with data=writeback,reservation options
  ext3i  - ext3 mounted with extents,mballoc,delalloc options
  reiser - reiserfs v3

I did some benchmarking. Two systems were used:
  SMP - dual PIII-1GHz, 1GB, FC-connected to 2 disks raid0
  UP  - PIII-933MHz, 256MB, FC-connected to 2 disks raid0



The tests dd500 and dd1000 generate 500M and 1000M using dd on fresh fs.
Real time varies run from run because of disks, but sys.time is stable.

  SMPUP
TEST / FS real   sys real   sys
dd500/ext2  - 9.14   2.3319.03  2.50
dd500/ext3  - 12.23  4.0618.99  4.19
dd500/ext3rs- 13.39  4.0115.08  4.16
dd500/ext3i - 9.19   2.3119.07  2.52
dd500/reiser- 7.95   2.8721.23  3.09
dd500/xfs   - 17.88  2.4219.25  2.67

dd1000/ext2 - 37.47  4.6944.59  5.02
dd1000/ext3 - 38.03  7.9040.77  8.31
dd1000/ext3rs   - 34.62  7.9540.51  8.30
dd1000/ext3i- 33.73  4.6538.74  4.93
dd1000/reiser   - 29.29  5.7944.88  6.08
dd1000/xfs  - 37.11  5.0339.98  5.27



The test untar unpacks linux-2.6.0.tar:

  SMP   UP 
TEST / FS real   sys real   sys
untar/ext2  - 9.21   2.4727.31  2.57
untar/ext3  - 15.63  3.3834.54  3.61
untar/ext3rs- 15.91  3.2735.98  3.65
untar/ext3i - 8.33   2.7016.75  2.84
untar/reiser- 13.38  5.4725.88  5.96
untar/xfs   - 44.62  5.6451.92  4.88



The next test is dbench:

   SMP  UP
TEST / FS real   sys MB/s   real   sys MB/s
dbench1/ext2- 5.87   1.4363.72565.93   1.5463.79
dbench1/ext3- 2.46   1.75139.7943.60   1.85131.372
dbench1/ext3rs  - 2.46   1.73140.3783.55   1.85133.071
dbench1/ext3i   - 2.19   1.48156.9762.28   1.53150.403
dbench1/reiser  - 2.80   2.05122.7612.88   2.11119.815
dbench1/xfs - 2.83   1.81122.1592.87   1.91119.564

dbench4/ext2- 4.99   7.13274.2169.96   6.22137.316
dbench4/ext3- 5.79   8.64236.85811.49  7.40130.146
dbench4/ext3rs  - 5.80   8.57236.35611.45  7.38130.467
dbench4/ext3i   - 5.16   7.16265.6219.25   6.31147.872
dbench4/reiser  - 7.11   10.85   192.49111.85  8.59115.392
dbench4/xfs - 6.60   8.88207.59911.98  7.69114.177

dbench8/ext2- 16.04  14.27   181.93 18.32  12.14   149.228
dbench8/ext3- 18.87  17.04   165.21423.77  14.88   119.995
dbench8/ext3rs  - 11.52  17.21   237.64723.35  15.04   123.088
dbench8/ext3i   - 11.20  14.66   268.05221.00  12.49   130.223
dbench8/reiser  - 13.92  21.65   196.30624.98  17.67   114.083
dbench8/xfs - 14.84  18.23   184.26325.84  15.53   105.743

dbench16/ext2   - 20.39  28.79   267.97947.37  25.13   115.582
dbench16/ext3   - 25.69  34.54   212.74553.43  30.78   102.266
dbench16/ext3rs - 24.47  34.36   223.37 54.68  30.44   100.082
dbench16/ext3i  - 22.94  29.40   238.21544.89  25.73   121.962
dbench16/reiser - 28.56  43.86   191.40756.48  36.41   96.9686
dbench16/xfs- 33.49  36.94   163.15978.68  32.56   69.4764

dbench32/ext2   - 43.54  58.87   250.947108.1  51.24   101.354
dbench32/ext3   - 61.83  70.66   176.707139.84 63.77   78.8818
dbench32/ext3rs - 67.24  71.69   162.651145.03 63.38   75.5155
dbench32/ext3i  - 55.29  59.41   197.757100.24 52.50   109.26
dbench32/reiser - 76.32  93.45   143.127128.77 77.94   85.7179
dbench32/xfs- 119.4  88.09   91.4746670.45 76.19   16.298



The followins crazy listing shows tiobench's results for SMP box:

Sequential Reads
  File  Blk   Num   Avg CPU
IdentifierSize  Size  Thr   Rate  (CPU%)  Latency   Eff
 -- - ---  -- -- - -
ext2  512   40961   40.95 12.80% 0.095   320
ext3  512   40961   45.03 13.99% 0.086   322
ext3rs512   40961   50.53 

Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-30 Thread Alex Tomas
>>>>> Stephen C Tweedie (SCT) writes:

 SCT> Hi,
 SCT> On Tue, 2005-01-25 at 19:30, Alex Tomas wrote:

 >> >> journal_dirty_metadata(handle, bh)
 >> >> {
 >> >> transaction->t_reserved--;
 >> >> handle->h_buffer_credits--;
 >> >> if (jh->b_tcount > 0) {
 >> >> /* modifed, no need to track it any more */
 >> >>  transaction-> t_outstanding_credits++;
 >> >>jh-> b_tcount = -1;
 >> >>  }
 >> >> }
 >> 
 SCT> Actually, the whole thing can be wrapped in if (jh->b_tcount > 0) {}, I
 SCT> think.

 >> the idea is:
 >> 1) the sooner we drop reservation, the higher probability to cover many
 >> changes by single transaction

 SCT> But that's exactly why we _don't_ want to do this.  You're dropping the
 SCT> reservation, but remember, we return unused handle credits to the
 SCT> transaction at the end of the handle's life.

yup, you're right. so, here is an implementation of this.
tested on UP/SMP by dbench/fsx. Stephen, Andrew, could you
review the patch and give it a run? 

thanks, Alex


fix against credits leak in journal_release_buffer()

The idea is to charge a buffer in journal_dirty_metadata(),
not in journal_get_*_access()). Each buffer has flag call
journal_dirty_metadata() sets on the buffer.

Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>

Index: linux-2.6.10/include/linux/journal-head.h
===
--- linux-2.6.10.orig/include/linux/journal-head.h  2003-06-24 
18:05:26.0 +0400
+++ linux-2.6.10/include/linux/journal-head.h   2005-01-29 03:20:05.0 
+0300
@@ -32,6 +32,13 @@
unsigned b_jlist;
 
/*
+* This flag signals the buffer has been modified by
+* the currently running transaction
+* [jbd_lock_bh_state()]
+*/
+   unsigned b_modified;
+
+   /*
 * Copy of the buffer data frozen for writing to the log.
 * [jbd_lock_bh_state()]
 */
Index: linux-2.6.10/include/linux/ext3_jbd.h
===
--- linux-2.6.10.orig/include/linux/ext3_jbd.h  2005-01-28 19:32:15.0 
+0300
+++ linux-2.6.10/include/linux/ext3_jbd.h   2005-01-29 03:13:41.0 
+0300
@@ -113,9 +113,9 @@
 
 static inline int
 __ext3_journal_get_undo_access(const char *where, handle_t *handle,
-   struct buffer_head *bh, int *credits)
+   struct buffer_head *bh)
 {
-   int err = journal_get_undo_access(handle, bh, credits);
+   int err = journal_get_undo_access(handle, bh);
if (err)
ext3_journal_abort_handle(where, __FUNCTION__, bh, handle,err);
return err;
@@ -123,19 +123,18 @@
 
 static inline int
 __ext3_journal_get_write_access(const char *where, handle_t *handle,
-   struct buffer_head *bh, int *credits)
+   struct buffer_head *bh)
 {
-   int err = journal_get_write_access(handle, bh, credits);
+   int err = journal_get_write_access(handle, bh);
if (err)
ext3_journal_abort_handle(where, __FUNCTION__, bh, handle,err);
return err;
 }
 
 static inline void
-ext3_journal_release_buffer(handle_t *handle, struct buffer_head *bh,
-   int credits)
+ext3_journal_release_buffer(handle_t *handle, struct buffer_head *bh)
 {
-   journal_release_buffer(handle, bh, credits);
+   journal_release_buffer(handle, bh);
 }
 
 static inline void
@@ -175,12 +174,10 @@
 }
 
 
-#define ext3_journal_get_undo_access(handle, bh, credits) \
-   __ext3_journal_get_undo_access(__FUNCTION__, (handle), (bh), (credits))
+#define ext3_journal_get_undo_access(handle, bh) \
+   __ext3_journal_get_undo_access(__FUNCTION__, (handle), (bh))
 #define ext3_journal_get_write_access(handle, bh) \
-   __ext3_journal_get_write_access(__FUNCTION__, (handle), (bh), NULL)
-#define ext3_journal_get_write_access_credits(handle, bh, credits) \
-   __ext3_journal_get_write_access(__FUNCTION__, (handle), (bh), (credits))
+   __ext3_journal_get_write_access(__FUNCTION__, (handle), (bh))
 #define ext3_journal_revoke(handle, blocknr, bh) \
__ext3_journal_revoke(__FUNCTION__, (handle), (blocknr), (bh))
 #define ext3_journal_get_create_access(handle, bh) \
Index: linux-2.6.10/include/linux/jbd.h
===
--- linux-2.6.10.orig/include/linux/jbd.h   2005-01-28 19:32:15.0 
+0300
+++ linux-2.6.10/include/linux/jbd.h2005-01-29 03:13:41.0 +0300
@@ -865,15 +865,12 @@
 extern handle_t *journal_start(journal_t *, int nblocks);
 extern int  journal_restart (handle_t *, int nblocks);
 extern int  journal_exte

Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-25 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes:

 >> journal_dirty_metadata(handle, bh)
 >> {
 >> transaction->t_reserved--;
 >> handle->h_buffer_credits--;
 >> if (jh->b_tcount > 0) {
 >> /* modifed, no need to track it any more */
 >>  transaction-> t_outstanding_credits++;
 >>jh-> b_tcount = -1;
 >>  }
 >> }

 SCT> Actually, the whole thing can be wrapped in if (jh->b_tcount > 0) {}, I
 SCT> think.  If we have already charged the transaction for this buffer, then
 SCT> there's no need to charge the handle for it again.  That's going to be
 SCT> particularly important for truncate(), where we are continually updating
 SCT> the same blocks (eg. bitmap, indirect blocks), so we want to make sure
 SCT> that after the first journal_dirty_metadata() call, no further charge is
 SCT> made.

the idea is:
1) the sooner we drop reservation, the higher probability to cover many
   changes by single transaction
1) having h_buffer_credits being decremented for each modification
   could help us to debug handle overflow situations

 SCT> If we do that, do we in fact need t_reserved at all?

hmm. if t_outstanding_credits holds number of modified buffers,
then we need sum of all running h_buffer_credits to protect
from transaction overflow. t_reserved is sum of h_buffer_credits.


thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-25 Thread Alex Tomas

Hi, could you review the following solution?


 t_outstanding_credits - number of _modified_ blocks in the transaction
 t_reserved - number of blocks all running handle reserved

 transaction size = t_outstanding_credits + t_reserved;

 



#define TSIZE(t)((t)->t_outstanding_credits + (t)->t_reserved)

journal_start(blocks)
{
if (TSIZE(transaction) + blocks > MAX)
wait_for_space(journal);

transaction->t_reserved += blocks;
handle->h_buffer_credits = blocks;
}


journal_get_write_access(handle, bh)
{
if (jh->b_tcount >= 0)
jh->b_tcount++;
}

journal_dirty_metadata(handle, bh)
{
transaction->t_reserved--;
handle->h_buffer_credits--;
if (jh->b_tcount > 0) {
/* modifed, no need to track it any more */
transaction->t_outstanding_credits++;
jh->b_tcount = -1;
}
}

journal_release_buffer(handle, bh)
{
if (jh->b_tcount > 0) {
/* it's not modified yet */
jh->b_tcount--;
if (jh->b_tcount == 0) {
   /* remove from the transaction */
}
}
}

journal_stop(handle)
{
transaction->t_outstanding_credits -= handle->h_buffer_credits;
}


thanks, Alex


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-25 Thread Alex Tomas

Hi, could you review the following solution?


 t_outstanding_credits - number of _modified_ blocks in the transaction
 t_reserved - number of blocks all running handle reserved

 transaction size = t_outstanding_credits + t_reserved;

 



#define TSIZE(t)((t)-t_outstanding_credits + (t)-t_reserved)

journal_start(blocks)
{
if (TSIZE(transaction) + blocks  MAX)
wait_for_space(journal);

transaction-t_reserved += blocks;
handle-h_buffer_credits = blocks;
}


journal_get_write_access(handle, bh)
{
if (jh-b_tcount = 0)
jh-b_tcount++;
}

journal_dirty_metadata(handle, bh)
{
transaction-t_reserved--;
handle-h_buffer_credits--;
if (jh-b_tcount  0) {
/* modifed, no need to track it any more */
transaction-t_outstanding_credits++;
jh-b_tcount = -1;
}
}

journal_release_buffer(handle, bh)
{
if (jh-b_tcount  0) {
/* it's not modified yet */
jh-b_tcount--;
if (jh-b_tcount == 0) {
   /* remove from the transaction */
}
}
}

journal_stop(handle)
{
transaction-t_outstanding_credits -= handle-h_buffer_credits;
}


thanks, Alex


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-25 Thread Alex Tomas
 Stephen C Tweedie (SCT) writes:

  journal_dirty_metadata(handle, bh)
  {
  transaction-t_reserved--;
  handle-h_buffer_credits--;
  if (jh-b_tcount  0) {
  /* modifed, no need to track it any more */
   transaction- t_outstanding_credits++;
 jh- b_tcount = -1;
   }
  }

 SCT Actually, the whole thing can be wrapped in if (jh-b_tcount  0) {}, I
 SCT think.  If we have already charged the transaction for this buffer, then
 SCT there's no need to charge the handle for it again.  That's going to be
 SCT particularly important for truncate(), where we are continually updating
 SCT the same blocks (eg. bitmap, indirect blocks), so we want to make sure
 SCT that after the first journal_dirty_metadata() call, no further charge is
 SCT made.

the idea is:
1) the sooner we drop reservation, the higher probability to cover many
   changes by single transaction
1) having h_buffer_credits being decremented for each modification
   could help us to debug handle overflow situations

 SCT If we do that, do we in fact need t_reserved at all?

hmm. if t_outstanding_credits holds number of modified buffers,
then we need sum of all running h_buffer_credits to protect
from transaction overflow. t_reserved is sum of h_buffer_credits.


thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-24 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes:

 >> +   /* return credit back to the handle if it was really spent */
 >> +   if (credits) {
 >> +   handle->h_buffer_credits++; 
 >> +  spin_lock(>h_transaction->t_handle_lock);
 >> +  handle->h_transaction->t_outstanding_credits++;
 >> +  spin_lock(>h_transaction->t_handle_lock);
 >> +  }

 SCT> That returns the credit to A (satisfying ext3), but you just grew
 SCT> t_outstanding_credits, thus growing the journal commitments without
 SCT> checking if it's safe to do so or being able to handle failure.  So it
 SCT> just reintroduces the original bug.

incremented h_buffer_credits will be subtracted from incremented
t_outstanding_credits in journal_stop(). so, there is no imbalance
at this point. 

then, if (b_tcount == 0) we can correct t_outstanind_credits or
h_buffer_credits. wrong?

let's try another way ... we have two processes: P1 and P2. they
access block B.

the code is:
if (credits != 0) {
   handle->h_buffer_credits++;
   transaction->t_outstanding_credits++;
}
if (jh->b_tcount == 0)
   transcation->t_outstanding_credits--;

case 1:
 P1 accesses B (*credits=1)
 P1 releases B

(credits != 0) h1->h_buffer_credits++;
(credits != 0) transaction->t_outstanding_credits++;
(b_tcount == 0) transaction->t_outstanding_credits--;

OUTPUT: transaction->t_outstanding_credits -= 1


case 2:
 P1 accesses B (*credits=1)
 P2 accesses B (*credits=0)
 P1 releases B
 P2 modifies B

(credits != 0) h1->h_buffer_credits++;
(credits != 0) transaction->t_outstainding_credits++;
(b_tcount != 0)

 OUTPUT: journal_stop() will subtract incremented h_buffer_credits
 from incremented t_outstading_credits => no changes


case 3:
 P1 accesses B (*credits=1)
 P2 accesses B (*credits=0)
 P2 releases B
 P1 modifies B

(credits != 0)
(credits != 0)
(b_tcount == 0)

 OUTPUT: no changes


case 4:
 P1 accesses B (*credits=1)
 P2 accesses B (*credits=0)
 P2 releases B
 P1 releases B

(credits != 0) 
(credits != 0)
(b_tcount == 0)

(credits != 0) h1->h_buffer_credits++;
(credits != 0) transaction->t_outstanding_credits++;
(b_tcount == 0) transaction->t_outstanding_credits--;

 OUTPUT: P2 will change nothing, P1 will drop the buffer and
 correct t_outstanding_credits


case 5:
 P1 accesses B (*credits=1)
 P2 accesses B (*credits=0)
 P1 releases B
 P2 releases B

(credits != 0) h1->h_buffer_credits++;
(credits != 0) transaction->t_outstanding_credits++;
(b_tcount == 0)

(credits != 0)
(credits != 0)
(b_tcount == 0) transaction->t_outstanding_credits--;

 OUTPUT: P1 will change own credits, P2 will drop the buffer
 and correct t_outstanding_credits



thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: log space management optimization

2005-01-24 Thread Alex Tomas
>>>>> Stephen C Tweedie (SCT) writes:

 SCT> If you returned them to the handle directly, it would be slightly more
 SCT> efficient.

good point. thanks. here is the fixed patch.


during truncate ext3 calls journal_forget() for freed blocks, but
before these blocks go to the transaction and jbd reserves space
in log for them (->t_outstanding_credits). also, journal_forget()
removes these blocks from the transaction, but doesn't correct
log space reservation. for example, removal of 500MB file reserves
136 blocks, but only 10 blocks go to the log. a commit is expensive
and correct reservation allows us to avoid needless commits. here
is the patch. tested on UP.



Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>
Index: linux-2.6.7/fs/jbd/transaction.c
===
--- linux-2.6.7.orig/fs/jbd/transaction.c   2004-08-26 17:12:40.0 
+0400
+++ linux-2.6.7/fs/jbd/transaction.c2005-01-24 22:51:34.0 +0300
@@ -1204,6 +1204,7 @@
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
struct journal_head *jh;
+   int drop_reserve = 0;
 
BUFFER_TRACE(bh, "entry");
 
@@ -1227,6 +1228,7 @@
J_ASSERT_JH(jh, !jh->b_committed_data);
 
__journal_unfile_buffer(jh);
+   drop_reserve = 1;
 
/* 
 * We are no longer going to journal this buffer.
@@ -1249,7 +1251,7 @@
spin_unlock(>j_list_lock);
jbd_unlock_bh_state(bh);
__bforget(bh);
-   return;
+   goto drop;
}
}
} else if (jh->b_transaction) {
@@ -1264,6 +1266,7 @@
if (jh->b_next_transaction) {
J_ASSERT(jh->b_next_transaction == transaction);
jh->b_next_transaction = NULL;
+   drop_reserve = 1;
}
}
 
@@ -1271,6 +1274,13 @@
spin_unlock(>j_list_lock);
jbd_unlock_bh_state(bh);
__brelse(bh);
+
+drop:
+   if (drop_reserve) {
+   /* no need to reserve log space for this block -bzzz */
+   handle->h_buffer_credits++;
+   }
+
return;
 }
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-24 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes:

 >> +   /* return credit back to the handle if it was really spent */
 >> +   if (credits)
 >> +   handle->h_buffer_credits++; 

 >> +   jh->b_tcount--;
 >> +   if (jh->b_tcount == 0) {
 >> +   /* 
 >> +* this was last reference to the block from the current
 >> +* transaction and we'd like to return credit to the
 >> +* whole transaction -bzzz
 >> +*/
 >> +   if (!credits)
 >> +   handle->h_buffer_credits++; 

 SCT> I think there's a problem here.

 SCT> What if:
 SCT>   Process A gets write access, and is the first to do so (*credits=1)
 SCT>   Processes B gets write access (*credits=0)
 SCT>   B modifies the buffer and finishes
 SCT>   A looks again, sees B's modifications, and releases the buffer because
 SCT> it's no use any more.

 SCT> Now, B's release didn't return credits.  The bh is part of the
 SCT> transaction and was not released.

hmmm. that's a good catch. so, with this patch A increments h_buffer_credits
and this one will go to the t_outstanding_credits while the buffer is still
part of the transaction. indeed, an imbalance.

probably something like the following would be enough?

 +  /* return credit back to the handle if it was really spent */
 +  if (credits) {
 +  handle->h_buffer_credits++; 
 +  spin_lock(>h_transaction->t_handle_lock);
 +  handle->h_transaction->t_outstanding_credits++;
 +  spin_lock(>h_transaction->t_handle_lock);
 +  }

thanks, Alex


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

2005-01-24 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes:

 SCT>   /*
 SCT>* Be pessimistic here about the number of those free blocks which
 SCT>* might be required for log descriptor control blocks.
 SCT>*/
 SCT>   ...
 SCT>   left -= (left >> 3);

oops. i overlooked this line. so, the fix becomes minor improvement patch ;)
thanks!

do you think the following can be improved?

/*
 * @@@ AKPM: This seems rather over-defensive.  We're giving commit
 * a _lot_ of headroom: 1/4 of the journal plus the size of
 * the committing transaction.  Really, we only need to give it
 * committing_transaction->t_outstanding_credits plus "enough" for
 * the log control blocks.
 * Also, this test is inconsitent with the matching one in
 * journal_extend().
 */
if (__log_space_left(journal) < jbd_space_needed(journal)) {


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

2005-01-24 Thread Alex Tomas
> Stephen C Tweedie (SCT) writes:
 SCT> I don't see how that "limit" is relevant here.  wbuf is nothing but the
 SCT> size of the IO batches we pass to ll_rw_block() during that commit
 SCT> phase.  j_free affects the total size of space the *entire* commit has
 SCT> to run into, and (as akpm has commented with a big marker beside it)
 SCT> start_this_handle() reserves a *lot* of headroom for the extra space
 SCT> that may be needed for transaction metadata.



/* If there's no more to do, or if the descriptor is full,
   let the IO rip! */

if (bufs == ARRAY_SIZE(wbuf) ||
commit_transaction->t_buffers == NULL ||
space_left < sizeof(journal_block_tag_t) + 16) {



/* Force a new descriptor to be generated next
   time round the loop. */
descriptor = NULL;
bufs = 0;

^^^


 SCT> Have you really seen this patch make a difference in testing?

of course


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

2005-01-24 Thread Alex Tomas
 Stephen C Tweedie (SCT) writes:
 SCT I don't see how that limit is relevant here.  wbuf is nothing but the
 SCT size of the IO batches we pass to ll_rw_block() during that commit
 SCT phase.  j_free affects the total size of space the *entire* commit has
 SCT to run into, and (as akpm has commented with a big marker beside it)
 SCT start_this_handle() reserves a *lot* of headroom for the extra space
 SCT that may be needed for transaction metadata.



/* If there's no more to do, or if the descriptor is full,
   let the IO rip! */

if (bufs == ARRAY_SIZE(wbuf) ||
commit_transaction-t_buffers == NULL ||
space_left  sizeof(journal_block_tag_t) + 16) {



/* Force a new descriptor to be generated next
   time round the loop. */
descriptor = NULL;
bufs = 0;

^^^


 SCT Have you really seen this patch make a difference in testing?

of course


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow

2005-01-24 Thread Alex Tomas
 Stephen C Tweedie (SCT) writes:

 SCT   /*
 SCT* Be pessimistic here about the number of those free blocks which
 SCT* might be required for log descriptor control blocks.
 SCT*/
 SCT   ...
 SCT   left -= (left  3);

oops. i overlooked this line. so, the fix becomes minor improvement patch ;)
thanks!

do you think the following can be improved?

/*
 * @@@ AKPM: This seems rather over-defensive.  We're giving commit
 * a _lot_ of headroom: 1/4 of the journal plus the size of
 * the committing transaction.  Really, we only need to give it
 * committing_transaction-t_outstanding_credits plus enough for
 * the log control blocks.
 * Also, this test is inconsitent with the matching one in
 * journal_extend().
 */
if (__log_space_left(journal)  jbd_space_needed(journal)) {


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-24 Thread Alex Tomas
 Stephen C Tweedie (SCT) writes:

  +   /* return credit back to the handle if it was really spent */
  +   if (credits)
  +   handle-h_buffer_credits++; 

  +   jh-b_tcount--;
  +   if (jh-b_tcount == 0) {
  +   /* 
  +* this was last reference to the block from the current
  +* transaction and we'd like to return credit to the
  +* whole transaction -bzzz
  +*/
  +   if (!credits)
  +   handle-h_buffer_credits++; 

 SCT I think there's a problem here.

 SCT What if:
 SCT   Process A gets write access, and is the first to do so (*credits=1)
 SCT   Processes B gets write access (*credits=0)
 SCT   B modifies the buffer and finishes
 SCT   A looks again, sees B's modifications, and releases the buffer because
 SCT it's no use any more.

 SCT Now, B's release didn't return credits.  The bh is part of the
 SCT transaction and was not released.

hmmm. that's a good catch. so, with this patch A increments h_buffer_credits
and this one will go to the t_outstanding_credits while the buffer is still
part of the transaction. indeed, an imbalance.

probably something like the following would be enough?

 +  /* return credit back to the handle if it was really spent */
 +  if (credits) {
 +  handle-h_buffer_credits++; 
 +  spin_lock(handle-h_transaction-t_handle_lock);
 +  handle-h_transaction-t_outstanding_credits++;
 +  spin_lock(handle-h_transaction-t_handle_lock);
 +  }

thanks, Alex


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: log space management optimization

2005-01-24 Thread Alex Tomas
 Stephen C Tweedie (SCT) writes:

 SCT If you returned them to the handle directly, it would be slightly more
 SCT efficient.

good point. thanks. here is the fixed patch.


during truncate ext3 calls journal_forget() for freed blocks, but
before these blocks go to the transaction and jbd reserves space
in log for them (-t_outstanding_credits). also, journal_forget()
removes these blocks from the transaction, but doesn't correct
log space reservation. for example, removal of 500MB file reserves
136 blocks, but only 10 blocks go to the log. a commit is expensive
and correct reservation allows us to avoid needless commits. here
is the patch. tested on UP.



Signed-off-by: Alex Tomas [EMAIL PROTECTED]
Index: linux-2.6.7/fs/jbd/transaction.c
===
--- linux-2.6.7.orig/fs/jbd/transaction.c   2004-08-26 17:12:40.0 
+0400
+++ linux-2.6.7/fs/jbd/transaction.c2005-01-24 22:51:34.0 +0300
@@ -1204,6 +1204,7 @@
transaction_t *transaction = handle-h_transaction;
journal_t *journal = transaction-t_journal;
struct journal_head *jh;
+   int drop_reserve = 0;
 
BUFFER_TRACE(bh, entry);
 
@@ -1227,6 +1228,7 @@
J_ASSERT_JH(jh, !jh-b_committed_data);
 
__journal_unfile_buffer(jh);
+   drop_reserve = 1;
 
/* 
 * We are no longer going to journal this buffer.
@@ -1249,7 +1251,7 @@
spin_unlock(journal-j_list_lock);
jbd_unlock_bh_state(bh);
__bforget(bh);
-   return;
+   goto drop;
}
}
} else if (jh-b_transaction) {
@@ -1264,6 +1266,7 @@
if (jh-b_next_transaction) {
J_ASSERT(jh-b_next_transaction == transaction);
jh-b_next_transaction = NULL;
+   drop_reserve = 1;
}
}
 
@@ -1271,6 +1274,13 @@
spin_unlock(journal-j_list_lock);
jbd_unlock_bh_state(bh);
__brelse(bh);
+
+drop:
+   if (drop_reserve) {
+   /* no need to reserve log space for this block -bzzz */
+   handle-h_buffer_credits++;
+   }
+
return;
 }
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()

2005-01-24 Thread Alex Tomas
 Stephen C Tweedie (SCT) writes:

  +   /* return credit back to the handle if it was really spent */
  +   if (credits) {
  +   handle-h_buffer_credits++; 
  +  spin_lock(handle-h_transaction-t_handle_lock);
  +  handle-h_transaction-t_outstanding_credits++;
  +  spin_lock(handle-h_transaction-t_handle_lock);
  +  }

 SCT That returns the credit to A (satisfying ext3), but you just grew
 SCT t_outstanding_credits, thus growing the journal commitments without
 SCT checking if it's safe to do so or being able to handle failure.  So it
 SCT just reintroduces the original bug.

incremented h_buffer_credits will be subtracted from incremented
t_outstanding_credits in journal_stop(). so, there is no imbalance
at this point. 

then, if (b_tcount == 0) we can correct t_outstanind_credits or
h_buffer_credits. wrong?

let's try another way ... we have two processes: P1 and P2. they
access block B.

the code is:
if (credits != 0) {
   handle-h_buffer_credits++;
   transaction-t_outstanding_credits++;
}
if (jh-b_tcount == 0)
   transcation-t_outstanding_credits--;

case 1:
 P1 accesses B (*credits=1)
 P1 releases B

(credits != 0) h1-h_buffer_credits++;
(credits != 0) transaction-t_outstanding_credits++;
(b_tcount == 0) transaction-t_outstanding_credits--;

OUTPUT: transaction-t_outstanding_credits -= 1


case 2:
 P1 accesses B (*credits=1)
 P2 accesses B (*credits=0)
 P1 releases B
 P2 modifies B

(credits != 0) h1-h_buffer_credits++;
(credits != 0) transaction-t_outstainding_credits++;
(b_tcount != 0)

 OUTPUT: journal_stop() will subtract incremented h_buffer_credits
 from incremented t_outstading_credits = no changes


case 3:
 P1 accesses B (*credits=1)
 P2 accesses B (*credits=0)
 P2 releases B
 P1 modifies B

(credits != 0)
(credits != 0)
(b_tcount == 0)

 OUTPUT: no changes


case 4:
 P1 accesses B (*credits=1)
 P2 accesses B (*credits=0)
 P2 releases B
 P1 releases B

(credits != 0) 
(credits != 0)
(b_tcount == 0)

(credits != 0) h1-h_buffer_credits++;
(credits != 0) transaction-t_outstanding_credits++;
(b_tcount == 0) transaction-t_outstanding_credits--;

 OUTPUT: P2 will change nothing, P1 will drop the buffer and
 correct t_outstanding_credits


case 5:
 P1 accesses B (*credits=1)
 P2 accesses B (*credits=0)
 P1 releases B
 P2 releases B

(credits != 0) h1-h_buffer_credits++;
(credits != 0) transaction-t_outstanding_credits++;
(b_tcount == 0)

(credits != 0)
(credits != 0)
(b_tcount == 0) transaction-t_outstanding_credits--;

 OUTPUT: P1 will change own credits, P2 will drop the buffer
 and correct t_outstanding_credits



thanks, Alex

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] JBD: log space management optimization

2005-01-19 Thread Alex Tomas

Good day,

during truncate ext3 calls journal_forget() for freed blocks, but
before these blocks go to the transaction and jbd reserves space
in log for them (->t_outstanding_credits). also, journal_forget()
removes these blocks from the transaction, but doesn't correct
log space reservation. for example, removal of 500MB file reserves
136 blocks, but only 10 blocks go to the log. a commit is expensive
and correct reservation allows us to avoid needless commits. here
is the patch. tested on UP.

thanks, Alex


Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>

Index: linux-2.6.7/fs/jbd/transaction.c
===
--- linux-2.6.7.orig/fs/jbd/transaction.c   2004-08-26 17:12:40.0 
+0400
+++ linux-2.6.7/fs/jbd/transaction.c2005-01-19 17:23:30.058160408 +0300
@@ -1204,6 +1257,7 @@
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
struct journal_head *jh;
+   int drop_reserve = 0;
 
BUFFER_TRACE(bh, "entry");
 
@@ -1227,6 +1281,7 @@
J_ASSERT_JH(jh, !jh->b_committed_data);
 
__journal_unfile_buffer(jh);
+   drop_reserve = 1;
 
/* 
 * We are no longer going to journal this buffer.
@@ -1249,7 +1304,7 @@
spin_unlock(>j_list_lock);
jbd_unlock_bh_state(bh);
__bforget(bh);
-   return;
+   goto drop;
}
}
} else if (jh->b_transaction) {
@@ -1264,6 +1319,7 @@
if (jh->b_next_transaction) {
J_ASSERT(jh->b_next_transaction == transaction);
jh->b_next_transaction = NULL;
+   drop_reserve = 1;
}
}
 
@@ -1271,6 +1327,15 @@
spin_unlock(>j_list_lock);
jbd_unlock_bh_state(bh);
__brelse(bh);
+
+drop:
+   if (drop_reserve) {
+   /* no need to reserve log space for this block -bzzz */
+   spin_lock(>t_handle_lock);
+   transaction->t_outstanding_credits--;
+   spin_unlock(>t_handle_lock);
+   }
+
return;
 }
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] JBD: journal_release_buffer()

2005-01-19 Thread Alex Tomas

Good day,

journal_release_buffer() can cause journal overflow in some
(very rare) conditions. Please, take a look at possible fix.

Th journal_head structure holds number of users of the buffer
for current transaction. The routines do_get_write_access()
and journal_get_create_access() tracks this number:
1) resets it to zero if the block's becoming part of the current
   transaction
2) increments it

journal_release_buffer() decrements it and if it's 0, then the
blocks isn't member of the transaction.

The patch has been tested on UP with dbench and tool that
uses xattr very much. 


Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>

Index: linux-2.6.7/include/linux/journal-head.h
===
--- linux-2.6.7.orig/include/linux/journal-head.h   2003-06-24 
18:05:26.0 +0400
+++ linux-2.6.7/include/linux/journal-head.h2005-01-19 14:09:59.0 
+0300
@@ -80,6 +80,11 @@
 * [j_list_lock]
 */
struct journal_head *b_cpnext, *b_cpprev;
+
+   /*
+* counter to track users of the buffer in current transaction
+*/
+   int b_tcount;
 };
 
 #endif /* JOURNAL_HEAD_H_INCLUDED */
Index: linux-2.6.7/fs/jbd/transaction.c
===
--- linux-2.6.7.orig/fs/jbd/transaction.c   2004-08-26 17:12:40.0 
+0400
+++ linux-2.6.7/fs/jbd/transaction.c2005-01-19 17:23:30.058160408 +0300
@@ -611,6 +611,10 @@
handle->h_buffer_credits--;
if (credits)
(*credits)++;
+
+   /* the block's becoming member of the trasaction -bzzz */
+   jh->b_tcount = 0;
+
goto done;
}
 
@@ -694,6 +698,9 @@
if (credits)
(*credits)++;
 
+   /* the block's becoming member of the trasaction -bzzz */
+   jh->b_tcount = 0;
+
/*
 * Finally, if the buffer is not journaled right now, we need to make
 * sure it doesn't get written to disk before the caller actually
@@ -723,6 +730,11 @@
memcpy(jh->b_frozen_data, source+offset, jh2bh(jh)->b_size);
kunmap_atomic(source, KM_USER0);
}
+
+   /* track all references to the block to be able to recognize the
+* situation when the buffer is not part of transaction -bzzz */
+   jh->b_tcount++;
+
jbd_unlock_bh_state(bh);
 
/*
@@ -822,11 +834,20 @@
jh->b_transaction = transaction;
JBUFFER_TRACE(jh, "file as BJ_Reserved");
__journal_file_buffer(jh, transaction, BJ_Reserved);
+   jh->b_tcount = 0;
} else if (jh->b_transaction == journal->j_committing_transaction) {
JBUFFER_TRACE(jh, "set next transaction");
jh->b_next_transaction = transaction;
+   jh->b_tcount = 0;
}
spin_unlock(>j_list_lock);
+
+   /*
+* track all reference to the block to be able to recognize
+* the situation when the buffer is not part of transaction -bzzz
+*/
+   jh->b_tcount++;
+
jbd_unlock_bh_state(bh);
 
/*
@@ -1178,8 +1199,40 @@
 void
 journal_release_buffer(handle_t *handle, struct buffer_head *bh, int credits)
 {
+   journal_t *journal = handle->h_transaction->t_journal;
+   struct journal_head *jh = bh2jh(bh);
+
BUFFER_TRACE(bh, "entry");
-   handle->h_buffer_credits += credits;
+
+   /* return credit back to the handle if it was really spent */
+   if (credits)
+   handle->h_buffer_credits++; 
+
+   jbd_lock_bh_state(bh);
+   J_ASSERT(jh->b_tcount > 0);
+
+   jh->b_tcount--;
+   if (jh->b_tcount == 0) {
+   /* we can drop it from the transaction -bzzz */
+   J_ASSERT(jh->b_transaction == handle->h_transaction ||
+   jh->b_next_transaction == 
handle->h_transaction);
+   if (jh->b_transaction == handle->h_transaction) {
+   spin_lock(>j_list_lock);
+   __journal_unfile_buffer(jh);
+   spin_unlock(>j_list_lock);
+   } else if(jh->b_next_transaction) {
+   jh->b_next_transaction = NULL;
+   }
+
+   /* 
+* this was last reference to the block from the current
+* transaction and we'd like to return credit to the
+* whole transaction -bzzz
+*/
+   if (!credits)
+   handle->h_buffer_credits++; 
+   }
+   jbd_unlock_bh_state(bh);
 }
 
 /** 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] JBD: fix against journal overflow

2005-01-19 Thread Alex Tomas

Good day,

under some quite high load, jbd can hit J_ASSERT(journal->j_free > 1)
in journal_next_log_block(). The cause is the following:

journal_commit_transaction()
{
struct buffer_head *wbuf[64];



/* If there's no more to do, or if the descriptor is full,
   let the IO rip! */

if (bufs == ARRAY_SIZE(wbuf) ||
commit_transaction->t_buffers == NULL ||
space_left < sizeof(journal_block_tag_t) + 16) {

so, the real limit isn't size of journal descriptor, but wbuf.

__log_space_left()
{
/*
 * Be pessimistic here about the number of those free blocks which
 * might be required for log descriptor control blocks.
 */

#define MIN_LOG_RESERVED_BLOCKS 32 /* Allow for rounding errors */

left -= MIN_LOG_RESERVED_BLOCKS;


so, jbd expects upto 32 blocks to be used for internal purpose. but
with 64 blocks a descriptor limit we easily can break the limit.

The fix allocates wbuf in journal_init_dev(). That's enough because
we have only commit thread per filesystem.

thank, Alex


The journal_commit_transaction() generates too many descriptor blocks
because static array wbuf can hold 64 blocks only. The fix is to have
persistent array big enough to hold max. possible blocks.

Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>

Index: linux-2.6.7/include/linux/jbd.h
===
--- linux-2.6.7.orig/include/linux/jbd.h2004-08-26 17:12:16.0 
+0400
+++ linux-2.6.7/include/linux/jbd.h 2005-01-19 17:08:33.144512008 +0300
@@ -826,6 +826,12 @@
struct jbd_revoke_table_s *j_revoke_table[2];
 
/*
+* array of bhs for journal_commit_transaction
+*/
+   struct buffer_head  **j_wbuf;
+   int j_wbufsize;
+
+   /*
 * An opaque pointer to fs-private information.  ext3 puts its
 * superblock pointer here
 */
Index: linux-2.6.7/fs/jbd/commit.c
===
--- linux-2.6.7.orig/fs/jbd/commit.c2004-08-26 17:12:40.0 +0400
+++ linux-2.6.7/fs/jbd/commit.c 2005-01-19 17:28:32.965111552 +0300
@@ -103,7 +103,7 @@
 {
transaction_t *commit_transaction;
struct journal_head *jh, *new_jh, *descriptor;
-   struct buffer_head *wbuf[64];
+   struct buffer_head **wbuf = journal->j_wbuf;
int bufs;
int flags;
int err;
@@ -271,7 +283,7 @@
BUFFER_TRACE(bh, "start journal writeout");
get_bh(bh);
wbuf[bufs++] = bh;
-   if (bufs == ARRAY_SIZE(wbuf)) {
+   if (bufs == journal->j_wbufsize) {
jbd_debug(2, "submit %d writes\n",
bufs);
spin_unlock(>j_list_lock);
@@ -488,7 +500,7 @@
/* If there's no more to do, or if the descriptor is full,
   let the IO rip! */
 
-   if (bufs == ARRAY_SIZE(wbuf) ||
+   if (bufs == journal->j_wbufsize ||
commit_transaction->t_buffers == NULL ||
space_left < sizeof(journal_block_tag_t) + 16) {
 
Index: linux-2.6.7/fs/jbd/journal.c
===
--- linux-2.6.7.orig/fs/jbd/journal.c   2005-01-19 12:07:59.0 +0300
+++ linux-2.6.7/fs/jbd/journal.c2005-01-19 17:11:08.589880720 +0300
@@ -687,6 +687,7 @@
 {
journal_t *journal = journal_init_common();
struct buffer_head *bh;
+   int n;
 
if (!journal)
return NULL;
@@ -702,6 +703,17 @@
journal->j_sb_buffer = bh;
journal->j_superblock = (journal_superblock_t *)bh->b_data;
 
+   /* journal descriptor can store upto n blocks -bzzz */
+   n = journal->j_blocksize / sizeof(journal_block_tag_t);
+   journal->j_wbufsize = n;
+   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   if (!journal->j_wbuf) {
+   printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
+   __FUNCTION__);
+   kfree(journal);
+   journal = NULL;
+   }
+   
return journal;
 }
  
@@ -717,7 +729,7 @@
 {
struct buffer_head *bh;
journal_t *journal = journal_init_common();
-   int err;
+   int err, n;
unsigned long blocknr;
 
if (!journal)
@@ -734,6 +746,17 @@
journal->j_maxlen = inode->i_size >> inode->i_sb->s_blocksize_bits;
journal->j_blocksize = inode->i_sb->s_blocksize;
 
+   /* journal descriptor can store 

[PATCH] JBD: fix against journal overflow

2005-01-19 Thread Alex Tomas

Good day,

under some quite high load, jbd can hit J_ASSERT(journal-j_free  1)
in journal_next_log_block(). The cause is the following:

journal_commit_transaction()
{
struct buffer_head *wbuf[64];



/* If there's no more to do, or if the descriptor is full,
   let the IO rip! */

if (bufs == ARRAY_SIZE(wbuf) ||
commit_transaction-t_buffers == NULL ||
space_left  sizeof(journal_block_tag_t) + 16) {

so, the real limit isn't size of journal descriptor, but wbuf.

__log_space_left()
{
/*
 * Be pessimistic here about the number of those free blocks which
 * might be required for log descriptor control blocks.
 */

#define MIN_LOG_RESERVED_BLOCKS 32 /* Allow for rounding errors */

left -= MIN_LOG_RESERVED_BLOCKS;


so, jbd expects upto 32 blocks to be used for internal purpose. but
with 64 blocks a descriptor limit we easily can break the limit.

The fix allocates wbuf in journal_init_dev(). That's enough because
we have only commit thread per filesystem.

thank, Alex


The journal_commit_transaction() generates too many descriptor blocks
because static array wbuf can hold 64 blocks only. The fix is to have
persistent array big enough to hold max. possible blocks.

Signed-off-by: Alex Tomas [EMAIL PROTECTED]

Index: linux-2.6.7/include/linux/jbd.h
===
--- linux-2.6.7.orig/include/linux/jbd.h2004-08-26 17:12:16.0 
+0400
+++ linux-2.6.7/include/linux/jbd.h 2005-01-19 17:08:33.144512008 +0300
@@ -826,6 +826,12 @@
struct jbd_revoke_table_s *j_revoke_table[2];
 
/*
+* array of bhs for journal_commit_transaction
+*/
+   struct buffer_head  **j_wbuf;
+   int j_wbufsize;
+
+   /*
 * An opaque pointer to fs-private information.  ext3 puts its
 * superblock pointer here
 */
Index: linux-2.6.7/fs/jbd/commit.c
===
--- linux-2.6.7.orig/fs/jbd/commit.c2004-08-26 17:12:40.0 +0400
+++ linux-2.6.7/fs/jbd/commit.c 2005-01-19 17:28:32.965111552 +0300
@@ -103,7 +103,7 @@
 {
transaction_t *commit_transaction;
struct journal_head *jh, *new_jh, *descriptor;
-   struct buffer_head *wbuf[64];
+   struct buffer_head **wbuf = journal-j_wbuf;
int bufs;
int flags;
int err;
@@ -271,7 +283,7 @@
BUFFER_TRACE(bh, start journal writeout);
get_bh(bh);
wbuf[bufs++] = bh;
-   if (bufs == ARRAY_SIZE(wbuf)) {
+   if (bufs == journal-j_wbufsize) {
jbd_debug(2, submit %d writes\n,
bufs);
spin_unlock(journal-j_list_lock);
@@ -488,7 +500,7 @@
/* If there's no more to do, or if the descriptor is full,
   let the IO rip! */
 
-   if (bufs == ARRAY_SIZE(wbuf) ||
+   if (bufs == journal-j_wbufsize ||
commit_transaction-t_buffers == NULL ||
space_left  sizeof(journal_block_tag_t) + 16) {
 
Index: linux-2.6.7/fs/jbd/journal.c
===
--- linux-2.6.7.orig/fs/jbd/journal.c   2005-01-19 12:07:59.0 +0300
+++ linux-2.6.7/fs/jbd/journal.c2005-01-19 17:11:08.589880720 +0300
@@ -687,6 +687,7 @@
 {
journal_t *journal = journal_init_common();
struct buffer_head *bh;
+   int n;
 
if (!journal)
return NULL;
@@ -702,6 +703,17 @@
journal-j_sb_buffer = bh;
journal-j_superblock = (journal_superblock_t *)bh-b_data;
 
+   /* journal descriptor can store upto n blocks -bzzz */
+   n = journal-j_blocksize / sizeof(journal_block_tag_t);
+   journal-j_wbufsize = n;
+   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   if (!journal-j_wbuf) {
+   printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
+   __FUNCTION__);
+   kfree(journal);
+   journal = NULL;
+   }
+   
return journal;
 }
  
@@ -717,7 +729,7 @@
 {
struct buffer_head *bh;
journal_t *journal = journal_init_common();
-   int err;
+   int err, n;
unsigned long blocknr;
 
if (!journal)
@@ -734,6 +746,17 @@
journal-j_maxlen = inode-i_size  inode-i_sb-s_blocksize_bits;
journal-j_blocksize = inode-i_sb-s_blocksize;
 
+   /* journal descriptor can store upto n blocks -bzzz */
+   n = journal-j_blocksize / sizeof(journal_block_tag_t);
+   journal-j_wbufsize = n;
+   journal-j_wbuf

[PATCH] JBD: journal_release_buffer()

2005-01-19 Thread Alex Tomas

Good day,

journal_release_buffer() can cause journal overflow in some
(very rare) conditions. Please, take a look at possible fix.

Th journal_head structure holds number of users of the buffer
for current transaction. The routines do_get_write_access()
and journal_get_create_access() tracks this number:
1) resets it to zero if the block's becoming part of the current
   transaction
2) increments it

journal_release_buffer() decrements it and if it's 0, then the
blocks isn't member of the transaction.

The patch has been tested on UP with dbench and tool that
uses xattr very much. 


Signed-off-by: Alex Tomas [EMAIL PROTECTED]

Index: linux-2.6.7/include/linux/journal-head.h
===
--- linux-2.6.7.orig/include/linux/journal-head.h   2003-06-24 
18:05:26.0 +0400
+++ linux-2.6.7/include/linux/journal-head.h2005-01-19 14:09:59.0 
+0300
@@ -80,6 +80,11 @@
 * [j_list_lock]
 */
struct journal_head *b_cpnext, *b_cpprev;
+
+   /*
+* counter to track users of the buffer in current transaction
+*/
+   int b_tcount;
 };
 
 #endif /* JOURNAL_HEAD_H_INCLUDED */
Index: linux-2.6.7/fs/jbd/transaction.c
===
--- linux-2.6.7.orig/fs/jbd/transaction.c   2004-08-26 17:12:40.0 
+0400
+++ linux-2.6.7/fs/jbd/transaction.c2005-01-19 17:23:30.058160408 +0300
@@ -611,6 +611,10 @@
handle-h_buffer_credits--;
if (credits)
(*credits)++;
+
+   /* the block's becoming member of the trasaction -bzzz */
+   jh-b_tcount = 0;
+
goto done;
}
 
@@ -694,6 +698,9 @@
if (credits)
(*credits)++;
 
+   /* the block's becoming member of the trasaction -bzzz */
+   jh-b_tcount = 0;
+
/*
 * Finally, if the buffer is not journaled right now, we need to make
 * sure it doesn't get written to disk before the caller actually
@@ -723,6 +730,11 @@
memcpy(jh-b_frozen_data, source+offset, jh2bh(jh)-b_size);
kunmap_atomic(source, KM_USER0);
}
+
+   /* track all references to the block to be able to recognize the
+* situation when the buffer is not part of transaction -bzzz */
+   jh-b_tcount++;
+
jbd_unlock_bh_state(bh);
 
/*
@@ -822,11 +834,20 @@
jh-b_transaction = transaction;
JBUFFER_TRACE(jh, file as BJ_Reserved);
__journal_file_buffer(jh, transaction, BJ_Reserved);
+   jh-b_tcount = 0;
} else if (jh-b_transaction == journal-j_committing_transaction) {
JBUFFER_TRACE(jh, set next transaction);
jh-b_next_transaction = transaction;
+   jh-b_tcount = 0;
}
spin_unlock(journal-j_list_lock);
+
+   /*
+* track all reference to the block to be able to recognize
+* the situation when the buffer is not part of transaction -bzzz
+*/
+   jh-b_tcount++;
+
jbd_unlock_bh_state(bh);
 
/*
@@ -1178,8 +1199,40 @@
 void
 journal_release_buffer(handle_t *handle, struct buffer_head *bh, int credits)
 {
+   journal_t *journal = handle-h_transaction-t_journal;
+   struct journal_head *jh = bh2jh(bh);
+
BUFFER_TRACE(bh, entry);
-   handle-h_buffer_credits += credits;
+
+   /* return credit back to the handle if it was really spent */
+   if (credits)
+   handle-h_buffer_credits++; 
+
+   jbd_lock_bh_state(bh);
+   J_ASSERT(jh-b_tcount  0);
+
+   jh-b_tcount--;
+   if (jh-b_tcount == 0) {
+   /* we can drop it from the transaction -bzzz */
+   J_ASSERT(jh-b_transaction == handle-h_transaction ||
+   jh-b_next_transaction == 
handle-h_transaction);
+   if (jh-b_transaction == handle-h_transaction) {
+   spin_lock(journal-j_list_lock);
+   __journal_unfile_buffer(jh);
+   spin_unlock(journal-j_list_lock);
+   } else if(jh-b_next_transaction) {
+   jh-b_next_transaction = NULL;
+   }
+
+   /* 
+* this was last reference to the block from the current
+* transaction and we'd like to return credit to the
+* whole transaction -bzzz
+*/
+   if (!credits)
+   handle-h_buffer_credits++; 
+   }
+   jbd_unlock_bh_state(bh);
 }
 
 /** 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] JBD: log space management optimization

2005-01-19 Thread Alex Tomas

Good day,

during truncate ext3 calls journal_forget() for freed blocks, but
before these blocks go to the transaction and jbd reserves space
in log for them (-t_outstanding_credits). also, journal_forget()
removes these blocks from the transaction, but doesn't correct
log space reservation. for example, removal of 500MB file reserves
136 blocks, but only 10 blocks go to the log. a commit is expensive
and correct reservation allows us to avoid needless commits. here
is the patch. tested on UP.

thanks, Alex


Signed-off-by: Alex Tomas [EMAIL PROTECTED]

Index: linux-2.6.7/fs/jbd/transaction.c
===
--- linux-2.6.7.orig/fs/jbd/transaction.c   2004-08-26 17:12:40.0 
+0400
+++ linux-2.6.7/fs/jbd/transaction.c2005-01-19 17:23:30.058160408 +0300
@@ -1204,6 +1257,7 @@
transaction_t *transaction = handle-h_transaction;
journal_t *journal = transaction-t_journal;
struct journal_head *jh;
+   int drop_reserve = 0;
 
BUFFER_TRACE(bh, entry);
 
@@ -1227,6 +1281,7 @@
J_ASSERT_JH(jh, !jh-b_committed_data);
 
__journal_unfile_buffer(jh);
+   drop_reserve = 1;
 
/* 
 * We are no longer going to journal this buffer.
@@ -1249,7 +1304,7 @@
spin_unlock(journal-j_list_lock);
jbd_unlock_bh_state(bh);
__bforget(bh);
-   return;
+   goto drop;
}
}
} else if (jh-b_transaction) {
@@ -1264,6 +1319,7 @@
if (jh-b_next_transaction) {
J_ASSERT(jh-b_next_transaction == transaction);
jh-b_next_transaction = NULL;
+   drop_reserve = 1;
}
}
 
@@ -1271,6 +1327,15 @@
spin_unlock(journal-j_list_lock);
jbd_unlock_bh_state(bh);
__brelse(bh);
+
+drop:
+   if (drop_reserve) {
+   /* no need to reserve log space for this block -bzzz */
+   spin_lock(transaction-t_handle_lock);
+   transaction-t_outstanding_credits--;
+   spin_unlock(transaction-t_handle_lock);
+   }
+
return;
 }
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/