hmm. so you trade 265% degradation of creation for 40% improvement of unlink?
thanks, Alex
Coly Li wrote:
normal ext4 ext4 with dir inode
reservation
mount options: -o data=writeback -o
data=writeback,dir_ireserve=low
hmm. so you trade 265% degradation of creation for 40% improvement of unlink?
thanks, Alex
Coly Li wrote:
normal ext4 ext4 with dir inode
reservation
mount options: -o data=writeback -o
data=writeback,dir_ireserve=low
On 9/19/07, David Chinner <[EMAIL PROTECTED]> wrote:
> The problem is this: to alter the fundamental block size of the
> filesystem we also need to alter the data block size and that is
> exactly the piece that linux does not support right now. So while
> we have the capability to use large block
On 9/19/07, David Chinner [EMAIL PROTECTED] wrote:
The problem is this: to alter the fundamental block size of the
filesystem we also need to alter the data block size and that is
exactly the piece that linux does not support right now. So while
we have the capability to use large block sizes
ACK, of course.
thanks, Alex
Mingming Cao wrote:
On Thu, 2007-06-14 at 19:29 +0400, Dmitriy Monakhov wrote:
I just cant belive my eyes then i saw this at the first time...
simple test: strace dd if=/dev/zero of=/mnt/file
Thanks for reporting it.
open("/dev/zero", O_RDONLY) = 0
ACK, of course.
thanks, Alex
Mingming Cao wrote:
On Thu, 2007-06-14 at 19:29 +0400, Dmitriy Monakhov wrote:
I just cant belive my eyes then i saw this at the first time...
simple test: strace dd if=/dev/zero of=/mnt/file
Thanks for reporting it.
open(/dev/zero, O_RDONLY) = 0
Andrew Morton wrote:
I'm still not understanding. The terms you're using are a bit ambiguous.
What does "find some dirty unallocated blocks" mean? Find a page which is
dirty and which does not have a disk mapping?
Normally the above operation would be implemented via
Andrew Morton wrote:
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote:
Andrew Morton wrote:
Yes, there can be issues with needing to allocate journal space within the
context of a commit. But
no-no, this isn't required. we only need to mark pages/blocks
Andrew Morton wrote:
Yes, there can be issues with needing to allocate journal space within the
context of a commit. But
no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts
Andrew Morton wrote:
Yes, there can be issues with needing to allocate journal space within the
context of a commit. But
no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts
Andrew Morton wrote:
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote:
Andrew Morton wrote:
Yes, there can be issues with needing to allocate journal space within the
context of a commit. But
no-no, this isn't required. we only need to mark pages/blocks within
Andrew Morton wrote:
I'm still not understanding. The terms you're using are a bit ambiguous.
What does find some dirty unallocated blocks mean? Find a page which is
dirty and which does not have a disk mapping?
Normally the above operation would be implemented via
Andrew Morton wrote:
We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer.
Basically, do ordered-data with a commit-time inode walk,
Andrew Morton wrote:
We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer.
Basically, do ordered-data with a commit-time inode walk,
Theodore Ts'o wrote:
P.S. One bug which I've noted --- if there is a failure due to disk
filling up, running e2fsck on the filesystem will show that the i_blocks
fields on the inodes where there was a failure to allocate disk blocks
are left incorrect. I'm guessing this is a bug in the delayed
Theodore Ts'o wrote:
P.S. One bug which I've noted --- if there is a failure due to disk
filling up, running e2fsck on the filesystem will show that the i_blocks
fields on the inodes where there was a failure to allocate disk blocks
are left incorrect. I'm guessing this is a bug in the delayed
I think one problem with mmap/msync is that they can't maintain
i_size atomically like regular write does. so, one needs to
implement own i_size management in userspace.
thanks, Alex
> Side note: the only reason O_DIRECT exists is because database people are
> too used to it, because other
I think one problem with mmap/msync is that they can't maintain
i_size atomically like regular write does. so, one needs to
implement own i_size management in userspace.
thanks, Alex
Side note: the only reason O_DIRECT exists is because database people are
too used to it, because other OS's
> Peter Staubach (PS) writes:
PS> Just out of curosity, what keeps i_nlink from going to 0 immediately
PS> after the new test is executed?
i_mutex in vfs_link() and vfs_unlink()
thanks, Alex
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a
Peter Staubach (PS) writes:
PS Just out of curosity, what keeps i_nlink from going to 0 immediately
PS after the new test is executed?
i_mutex in vfs_link() and vfs_unlink()
thanks, Alex
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to
> Eric Sandeen (ES) writes:
ES> Al says "no" and I'm not arguing. :)
ES> Apparently this may be OK with some filesystems, and Al says he doesn't
ES> want to know about i_nlink in the vfs in any case.
well, generic_drop_inode() uses i_nlink ...
ES> But I suppose there may be other
> Eric Sandeen (ES) writes:
ES> I tend to agree, chatting w/ Al I think he does too. :) I'll test
ES> a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit
ES> that if things go well.
shouldn't VFS do that?
thanks, Alex
-
To unsubscribe from this list: send the line
> Eric Sandeen (ES) writes:
ES> so I think it's possible that link can sneak in there & find it after
ES> the mutex is dropped...? Is this ok? :) It's certainly -happening-
ES> anyway
yes, but it shouldn't allow to re-link such inode back, IMHO.
a filesystem may start some
.
thanks, Alex
>>>>> Alex Tomas (AT) writes:
AT> interesting ..
AT> I thought VFS doesn't allow concurrent operations.
AT> if unlink goes first, then link should wait on the
AT> parent's i_mutex and then found no source name.
AT> thanks, Alex
>>>>>
interesting ..
I thought VFS doesn't allow concurrent operations.
if unlink goes first, then link should wait on the
parent's i_mutex and then found no source name.
thanks, Alex
> Eric Sandeen (ES) writes:
ES> )
ES> I've been looking at a case where many threads are opening, unlinking,
interesting ..
I thought VFS doesn't allow concurrent operations.
if unlink goes first, then link should wait on the
parent's i_mutex and then found no source name.
thanks, Alex
Eric Sandeen (ES) writes:
ES )
ES I've been looking at a case where many threads are opening, unlinking, and
ES
.
thanks, Alex
Alex Tomas (AT) writes:
AT interesting ..
AT I thought VFS doesn't allow concurrent operations.
AT if unlink goes first, then link should wait on the
AT parent's i_mutex and then found no source name.
AT thanks, Alex
Eric Sandeen (ES) writes:
ES )
ES I've been looking
Eric Sandeen (ES) writes:
ES so I think it's possible that link can sneak in there find it after
ES the mutex is dropped...? Is this ok? :) It's certainly -happening-
ES anyway
yes, but it shouldn't allow to re-link such inode back, IMHO.
a filesystem may start some non-revertable
Eric Sandeen (ES) writes:
ES I tend to agree, chatting w/ Al I think he does too. :) I'll test
ES a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit
ES that if things go well.
shouldn't VFS do that?
thanks, Alex
-
To unsubscribe from this list: send the line
Eric Sandeen (ES) writes:
ES Al says no and I'm not arguing. :)
ES Apparently this may be OK with some filesystems, and Al says he doesn't
ES want to know about i_nlink in the vfs in any case.
well, generic_drop_inode() uses i_nlink ...
ES But I suppose there may be other filesystems
> David Chinner (DC) writes:
DC> So that mean's we'll have 2 separate mechanisms for marking
DC> pages as delalloc. XFS uses the BH_delay flag to indicate
DC> that a buffer (block) attached to the page is using delalloc.
>>
>> well, for blocksize=pagesize we can save 56 bytes on every
David Chinner (DC) writes:
DC So that mean's we'll have 2 separate mechanisms for marking
DC pages as delalloc. XFS uses the BH_delay flag to indicate
DC that a buffer (block) attached to the page is using delalloc.
well, for blocksize=pagesize we can save 56 bytes on every page.
DC
Hi,
> Andrew Morton (AM) writes:
AM> Should be cacheline_aligned_in_smp.
AM> That's assuming it needs to be cacheline aligned at all. It can consume a
AM> lot of space.
the idea is to make block reservation cheap because it's called
for every page.
AM>
AM> oh, this should be
> Christoph Hellwig (CH) writes:
CH> Note that recording delayed alloc state at a page granularity in addition
CH> to just the buffer heads has a lot of advantages aswell and would help
CH> xfs, too. But I think it makes a lot more sense to record it as a radix
CH> tree tag to speed up
Good day,
> David Chinner (DC) writes:
DC> So that mean's we'll have 2 separate mechanisms for marking
DC> pages as delalloc. XFS uses the BH_delay flag to indicate
DC> that a buffer (block) attached to the page is using delalloc.
well, for blocksize=pagesize we can save 56 bytes on
Good day,
David Chinner (DC) writes:
DC So that mean's we'll have 2 separate mechanisms for marking
DC pages as delalloc. XFS uses the BH_delay flag to indicate
DC that a buffer (block) attached to the page is using delalloc.
well, for blocksize=pagesize we can save 56 bytes on every page.
Christoph Hellwig (CH) writes:
CH Note that recording delayed alloc state at a page granularity in addition
CH to just the buffer heads has a lot of advantages aswell and would help
CH xfs, too. But I think it makes a lot more sense to record it as a radix
CH tree tag to speed up the gang
Hi,
Andrew Morton (AM) writes:
AM Should be cacheline_aligned_in_smp.
AM That's assuming it needs to be cacheline aligned at all. It can consume a
AM lot of space.
the idea is to make block reservation cheap because it's called
for every page.
AM looks
AM oh, this should be
Index: linux-2.6.20-rc1/include/linux/ext4_fs.h
===
--- linux-2.6.20-rc1.orig/include/linux/ext4_fs.h 2006-12-14
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/ext4_fs.h2006-12-22 20:21:12.0
+0300
@@
/ext4/writeback.c2006-12-22 22:59:33.0
+0300
@@ -0,0 +1,1167 @@
+/*
+ * Copyright (c) 2003-2006, Cluster File Systems, Inc, [EMAIL PROTECTED]
+ * Written by Alex Tomas <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modi
Index: linux-2.6.20-rc1/include/linux/page-flags.h
===
--- linux-2.6.20-rc1.orig/include/linux/page-flags.h2006-12-14
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/page-flags.h 2006-12-22 20:05:31.0
+0300
Good day,
probably the previous set of patches (including mballoc/lg)
is too large. so, I reworked delayed allocation a bit so
that it can be used on top of regular balloc, though it
still can be used with extents-enabled files only.
this time series contains just 3 patches:
-
Good day,
probably the previous set of patches (including mballoc/lg)
is too large. so, I reworked delayed allocation a bit so
that it can be used on top of regular balloc, though it
still can be used with extents-enabled files only.
this time series contains just 3 patches:
-
Index: linux-2.6.20-rc1/include/linux/page-flags.h
===
--- linux-2.6.20-rc1.orig/include/linux/page-flags.h2006-12-14
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/page-flags.h 2006-12-22 20:05:31.0
+0300
Systems, Inc, [EMAIL PROTECTED]
+ * Written by Alex Tomas [EMAIL PROTECTED]
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed
Index: linux-2.6.20-rc1/include/linux/ext4_fs.h
===
--- linux-2.6.20-rc1.orig/include/linux/ext4_fs.h 2006-12-14
04:14:23.0 +0300
+++ linux-2.6.20-rc1/include/linux/ext4_fs.h2006-12-22 20:21:12.0
+0300
@@
> Andrew Morton (AM) writes:
AM> What lock protects the fields in struct ext[234]_reserve_window from being
AM> concurrently modified by two CPUs? None, it seems. Ditto
AM> ext[234]_reserve_window_node. i_mutex will cover it for write(), but not
AM> for pageout over a file hole. If we
Andrew Morton (AM) writes:
AM What lock protects the fields in struct ext[234]_reserve_window from being
AM concurrently modified by two CPUs? None, it seems. Ditto
AM ext[234]_reserve_window_node. i_mutex will cover it for write(), but not
AM for pageout over a file hole. If we end up
> Jan Blunck (JB) writes:
JB> Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not
found
JB> by __d_lookup() until it is rehashed which is implicit done by ->lookup().
that means we can have two processes allocated dentry for
same name. they'll call ->lookup() each
Jan Blunck (JB) writes:
JB Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not
found
JB by __d_lookup() until it is rehashed which is implicit done by -lookup().
that means we can have two processes allocated dentry for
same name. they'll call -lookup() each against
> Jan Blunck (JB) writes:
JB> i_sem does NOT protect the dcache. Also not in real_lookup(). The lock
must be
JB> acquired for ->lookup() and because we might sleep on i_sem, we have to
get it
JB> early and check for repopulation of the dcache.
dentry is part of dcache, right? i_sem
> Jan Blunck (JB) writes:
>> 1) i_sem protects dcache too
JB> Where? i_sem is the per-inode lock, and shouldn't be used else.
read comments in fs/namei.c:read_lookup()
>> 2) tmpfs has no "own" data, so we can use it this way (see 2nd patch)
>> 3) I have pdirops patch for ext3, but it
Jan Blunck (JB) writes:
1) i_sem protects dcache too
JB Where? i_sem is the per-inode lock, and shouldn't be used else.
read comments in fs/namei.c:read_lookup()
2) tmpfs has no own data, so we can use it this way (see 2nd patch)
3) I have pdirops patch for ext3, but it needs some
Jan Blunck (JB) writes:
JB i_sem does NOT protect the dcache. Also not in real_lookup(). The lock
must be
JB acquired for -lookup() and because we might sleep on i_sem, we have to
get it
JB early and check for repopulation of the dcache.
dentry is part of dcache, right? i_sem protects
> Jan Blunck (JB) writes:
JB> With luck you have s_pdirops_size (or 1024) different renames altering
JB> concurrently one directory inode. Therefore you need a lock protecting
JB> your filesystem data. This is basically the job done by i_sem. So in
JB> my opinion you only move "The
Jan Blunck (JB) writes:
JB With luck you have s_pdirops_size (or 1024) different renames altering
JB concurrently one directory inode. Therefore you need a lock protecting
JB your filesystem data. This is basically the job done by i_sem. So in
JB my opinion you only move The Problem from the
Index: linux-2.6.10/mm/shmem.c
===
--- linux-2.6.10.orig/mm/shmem.c2005-01-28 19:32:16.0 +0300
+++ linux-2.6.10/mm/shmem.c 2005-02-19 20:05:32.642599576 +0300
@@ -1849,7 +1849,7 @@
#endif
};
-static int
fs/inode.c |1
fs/namei.c | 66 ++---
include/linux/fs.h | 11
3 files changed, 54 insertions(+), 24 deletions(-)
Index: linux-2.6.10/fs/namei.c
===
Good day Al and all
could you review couple patches that implement $subj
for vfs and tmpfs. In short the idea is that we can
protect operations taking semaphore related for set
of names. definitely, protection at vfs layer isn't
enough and filesystem will need to protect their own
structures by
Good day Al and all
could you review couple patches that implement $subj
for vfs and tmpfs. In short the idea is that we can
protect operations taking semaphore related for set
of names. definitely, protection at vfs layer isn't
enough and filesystem will need to protect their own
structures by
fs/inode.c |1
fs/namei.c | 66 ++---
include/linux/fs.h | 11
3 files changed, 54 insertions(+), 24 deletions(-)
Index: linux-2.6.10/fs/namei.c
===
Index: linux-2.6.10/mm/shmem.c
===
--- linux-2.6.10.orig/mm/shmem.c2005-01-28 19:32:16.0 +0300
+++ linux-2.6.10/mm/shmem.c 2005-02-19 20:05:32.642599576 +0300
@@ -1849,7 +1849,7 @@
#endif
};
-static int
> Sonny Rao (SR) writes:
SR> Alex, small buglet, If the FIBMAP-ioctl get's called on a file with
SR> delayed allocation, you need to flush it (or at least allocate) before
SR> returning the mappings. This doesn't seem to work properly at
SR> present.
good catch. thanks.
-
To
Sonny Rao (SR) writes:
SR Alex, small buglet, If the FIBMAP-ioctl get's called on a file with
SR delayed allocation, you need to flush it (or at least allocate) before
SR returning the mappings. This doesn't seem to work properly at
SR present.
good catch. thanks.
-
To unsubscribe
Good day all,
I've updated the patchset against 2.6.10. A bunch of bugs have been
fixed and mballoc now behaves smarter a bit. Extents and mballoc
patches collects some stats they print upon umount. NOTE: they must
not be used to store important data. A lot of things are to be done.
Please
Good day all,
I've updated the patchset against 2.6.10. A bunch of bugs have been
fixed and mballoc now behaves smarter a bit. Extents and mballoc
patches collects some stats they print upon umount. NOTE: they must
not be used to store important data. A lot of things are to be done.
Please
>>>>> Stephen C Tweedie (SCT) writes:
SCT> Hi,
SCT> On Tue, 2005-01-25 at 19:30, Alex Tomas wrote:
>> >> journal_dirty_metadata(handle, bh)
>> >> {
>> >> transaction->t_reserved--;
>> >> handle->h_buf
> Stephen C Tweedie (SCT) writes:
>> journal_dirty_metadata(handle, bh)
>> {
>> transaction->t_reserved--;
>> handle->h_buffer_credits--;
>> if (jh->b_tcount > 0) {
>> /* modifed, no need to track it any more */
>> transaction-> t_outstanding_credits++;
>>
Hi, could you review the following solution?
t_outstanding_credits - number of _modified_ blocks in the transaction
t_reserved - number of blocks all running handle reserved
transaction size = t_outstanding_credits + t_reserved;
#define TSIZE(t)((t)->t_outstanding_credits +
Hi, could you review the following solution?
t_outstanding_credits - number of _modified_ blocks in the transaction
t_reserved - number of blocks all running handle reserved
transaction size = t_outstanding_credits + t_reserved;
#define TSIZE(t)((t)-t_outstanding_credits +
Stephen C Tweedie (SCT) writes:
journal_dirty_metadata(handle, bh)
{
transaction-t_reserved--;
handle-h_buffer_credits--;
if (jh-b_tcount 0) {
/* modifed, no need to track it any more */
transaction- t_outstanding_credits++;
jh- b_tcount = -1;
> Stephen C Tweedie (SCT) writes:
>> + /* return credit back to the handle if it was really spent */
>> + if (credits) {
>> + handle->h_buffer_credits++;
>> + spin_lock(>h_transaction->t_handle_lock);
>> +
is expensive
and correct reservation allows us to avoid needless commits. here
is the patch. tested on UP.
Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>
Index: linux-2.6.7/fs/jbd/transaction.c
===
--- linux-2.6.7.orig/fs/jbd/transa
> Stephen C Tweedie (SCT) writes:
>> + /* return credit back to the handle if it was really spent */
>> + if (credits)
>> + handle->h_buffer_credits++;
>> + jh->b_tcount--;
>> + if (jh->b_tcount == 0) {
>> + /*
>> +* this was last reference to
> Stephen C Tweedie (SCT) writes:
SCT> /*
SCT>* Be pessimistic here about the number of those free blocks which
SCT>* might be required for log descriptor control blocks.
SCT>*/
SCT> ...
SCT> left -= (left >> 3);
oops. i overlooked this line. so, the fix becomes minor
> Stephen C Tweedie (SCT) writes:
SCT> I don't see how that "limit" is relevant here. wbuf is nothing but the
SCT> size of the IO batches we pass to ll_rw_block() during that commit
SCT> phase. j_free affects the total size of space the *entire* commit has
SCT> to run into, and (as akpm
Stephen C Tweedie (SCT) writes:
SCT I don't see how that limit is relevant here. wbuf is nothing but the
SCT size of the IO batches we pass to ll_rw_block() during that commit
SCT phase. j_free affects the total size of space the *entire* commit has
SCT to run into, and (as akpm has
Stephen C Tweedie (SCT) writes:
SCT /*
SCT* Be pessimistic here about the number of those free blocks which
SCT* might be required for log descriptor control blocks.
SCT*/
SCT ...
SCT left -= (left 3);
oops. i overlooked this line. so, the fix becomes minor improvement
Stephen C Tweedie (SCT) writes:
+ /* return credit back to the handle if it was really spent */
+ if (credits)
+ handle-h_buffer_credits++;
+ jh-b_tcount--;
+ if (jh-b_tcount == 0) {
+ /*
+* this was last reference to the block from the
us to avoid needless commits. here
is the patch. tested on UP.
Signed-off-by: Alex Tomas [EMAIL PROTECTED]
Index: linux-2.6.7/fs/jbd/transaction.c
===
--- linux-2.6.7.orig/fs/jbd/transaction.c 2004-08-26 17:12:40.0
Stephen C Tweedie (SCT) writes:
+ /* return credit back to the handle if it was really spent */
+ if (credits) {
+ handle-h_buffer_credits++;
+ spin_lock(handle-h_transaction-t_handle_lock);
+ handle-h_transaction-t_outstanding_credits++;
+
ion. for example, removal of 500MB file reserves
136 blocks, but only 10 blocks go to the log. a commit is expensive
and correct reservation allows us to avoid needless commits. here
is the patch. tested on UP.
thanks, Alex
Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>
Index: linux-2.6
.
Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>
Index: linux-2.6.7/include/linux/journal-head.h
===
--- linux-2.6.7.orig/include/linux/journal-head.h 2003-06-24
18:05:26.0 +0400
+++ linux-2.6.7/include/linux/journal-
y descriptor blocks
because static array wbuf can hold 64 blocks only. The fix is to have
persistent array big enough to hold max. possible blocks.
Signed-off-by: Alex Tomas <[EMAIL PROTECTED]>
Index: linux-2.6.7/include/linux/jbd.h
===
blocks
because static array wbuf can hold 64 blocks only. The fix is to have
persistent array big enough to hold max. possible blocks.
Signed-off-by: Alex Tomas [EMAIL PROTECTED]
Index: linux-2.6.7/include/linux/jbd.h
===
--- linux-2.6.7
.
Signed-off-by: Alex Tomas [EMAIL PROTECTED]
Index: linux-2.6.7/include/linux/journal-head.h
===
--- linux-2.6.7.orig/include/linux/journal-head.h 2003-06-24
18:05:26.0 +0400
+++ linux-2.6.7/include/linux/journal-head.h
. for example, removal of 500MB file reserves
136 blocks, but only 10 blocks go to the log. a commit is expensive
and correct reservation allows us to avoid needless commits. here
is the patch. tested on UP.
thanks, Alex
Signed-off-by: Alex Tomas [EMAIL PROTECTED]
Index: linux-2.6.7/fs/jbd
87 matches
Mail list logo