[patch] e2fsprogs time value is interpreted as signed

2007-06-17 Thread Dmitriy Monakhov
after bug: http://bugzilla.kernel.org/show_bug.cgi?id=5079
was fixed by:
commit 4d7bf11d649c72621ca31b8ea12b9c94af380e63
Author: Markus Rechberger <[EMAIL PROTECTED]>
Date:   Tue May 8 00:23:39 2007 -0700
__u32 time value becomes interpreat as signed value. This patch add
coresponding fix to debug fs routine.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 debugfs/util.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/debugfs/util.c b/debugfs/util.c
index c6096ab..53a0813 100644
--- a/debugfs/util.c
+++ b/debugfs/util.c
@@ -187,7 +187,7 @@ int check_fs_bitmaps(char *name)
 char *time_to_string(__u32 cl)
 {
static int  do_gmt = -1;
-   time_t  t = (time_t) cl;
+   time_t  t = (time_t)(signed) cl;
char *  tz;
 
if (do_gmt == -1) {
-- 
1.5.2


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bugme-new] [Bug 8643] New: Bug with negative timestamps on 64bit machines

2007-06-17 Thread Dmitriy Monakhov
On 23:05 Сбт 16 Июн , Andrew Morton wrote:
> 
> someone ort to fix this
duplicate: http://bugzilla.kernel.org/show_bug.cgi?id=5079
Already fixed by:
commit 4d7bf11d649c72621ca31b8ea12b9c94af380e63
Author: Markus Rechberger <[EMAIL PROTECTED]>
Date:   Tue May 8 00:23:39 2007 -0700
I don't have sufficient permission to change bug status, so
let's someone else do it.
> 
> 
> Begin forwarded message:
> 
> Date: Sat, 16 Jun 2007 17:50:35 -0700 (PDT)
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> Subject: [Bugme-new] [Bug 8643] New: Bug with negative timestamps on 64bit 
> machines
> 
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=8643
> 
>Summary: Bug with negative timestamps on 64bit machines
>Product: File System
>Version: 2.5
>  KernelVersion: 2.6.21-1.3228.fc7
>   Platform: All
> OS/Version: Linux
>   Tree: Fedora
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: ext3
> AssignedTo: [EMAIL PROTECTED]
> ReportedBy: [EMAIL PROTECTED]
> 
> 
> Most recent kernel where this bug did not occur:
> Distribution: Fedora Core 7
> Hardware Environment: AMD Athlon 64
> Software Environment:
> Problem Description:
> 
> Negative timestamps (i.e. before 1970) are incorrectly handled due to some
> obvious 64bit issues.  They show up normally as long as the cache has not been
> purged.
> 
> Steps to reproduce:
> 
> Touch a file in an ext3-file system with a negative timestamp, e.g.:
> 
>   # touch -t 19690101 /opt/foo
> 
> Unless you have been writing a lot of data to this partition after the last
> command, the next one should display the following:
> 
>   #  ls -l /opt/foo
>   -rw-r--r-- 1 root root 0 1969-01-01 00:00 /opt/foo
> 
> This is still correct.  But now we purge the file system cache by unmounting
> and mounting the partition again:
> 
>   # umount /opt/foo
>   # mount /opt/foo
> 
> Now the timestamp will be corrupted:
> 
>   # ls -l /opt/foo 
>   -rw-r--r-- 1 root root 0 2105-02-07 06:28 /opt/foo
> 
> It seems obvious that the upper 32 bits of the 64 bits representing the time
> stamp get lost, which explains the above observation.
> 
> On 32bit architectures there is no such problem.  I haven't managed to
> reproduce this problem with other file systems (e.g. XFS).
> 
> 
> -- 
> Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iov_iter_fault_in_readable fix

2007-06-16 Thread Dmitriy Monakhov
On 18:31 Чтв 14 Июн , Christoph Hellwig wrote:
> On Wed, Jun 13, 2007 at 05:57:59PM +0400, Dmitriy Monakhov wrote:
> > Function prerform check for signgle region, with out respect to
> > segment nature of iovec, For example writev no longer works :)
> 
> Btw, could someone please start to collect all sniplets like this in
> a nice simple regression test suite?  If no one wants to start a new
> one we should probably just put it into xfsqa (which should be useable
> for other filesystems aswell despite the name)
I've prepared testcase (testcases/kernel/syscalls/writev/writev06.c) 
and sent it to ltp mailing list.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] deny partial write for loop dev fd

2007-06-16 Thread Dmitriy Monakhov
Partial write can be easily supported by LO_CRYPT_NONE mode, but
it is not easy in LO_CRYPT_CRYPTOAPI case, because of its block nature.
I don't know who still used cryptoapi, but theoretically it is possible.
So let's leave things as they are. Loop device doesn't support partial
write before Nick's "write_begin/write_end" patch set, and let's it 
behave the same way after.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 drivers/block/loop.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4bab9b1..de122f3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -244,10 +244,8 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
 
ret = pagecache_write_end(file, mapping, pos, size, copied,
page, fsdata);
-   if (ret < 0)
+   if (ret < 0 || ret != copied)
goto fail;
-   if (ret < copied)
-   copied = ret;
 
if (unlikely(transfer_result))
goto fail;
-- 
1.5.2


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: delayed allocatiou result in Oops

2007-06-16 Thread Dmitriy Monakhov
On 16:16 Птн 15 Июн , Mingming Cao wrote:
> I hit almost the same issue today also, but with different error #, and
> one more kernel oops, when run fsstress on x86_64. 
> 
> EXT4-fs: writeback error = -2
> EXT4-fs: writeback error = -2
This error never happens in writeback in my case, only ENOSPC.
Btw: i've send one more micro fix (see in this thread " PATCH] ext4:fix 
invariant checking in ext4_rebalance_reservation")
> 
> Unable to handle kernel NULL pointer dereference at  RIP: 
>  [] block_read_full_page+0xb5/0x267
> PGD 1f9842067 PUD 1f9843067 PMD 0 
> Oops:  [5] SMP 
> CPU 3 
> Modules linked in:
> Pid: 10900, comm: fsstress Not tainted 2.6.22-rc4-autokern1 #1
> RIP: 0010:[]  [] 
> block_read_full_page+0xb5/0x267
> RSP: :8101f984fa48  EFLAGS: 00010213
> RAX: 0179 RBX:  RCX: 000c
> RDX: 000c RSI: 802e0f7b RDI: 81017ff578c8
> RBP: 81017ff578c8 R08: 8101f984fbe8 R09: 8101f984fbe0
> R10:  R11:  R12: 00e5
> R13:  R14: 1000 R15: 
> FS:  () GS:8101803ec5c0(0063) knlGS:f7dec460
> CS:  0010 DS: 002b ES: 002b CR0: 8005003b
> CR2:  CR3: 0001f9841000 CR4: 06e0
> Process fsstress (pid: 10900, threadinfo 8101f984e000, task 
> 8101f9824280)
> Stack:  1000f9ad4080 0001  0179
>  8100de7c4100 802e0f7b 03363e3761bf 804eac2d
>  8101f984fb48 0082 81017e9bc550 81017e9bc588
> Call Trace:
>  [] ext4_get_block+0x0/0x104
>  [] thread_return+0x0/0xd5
>  [] do_mpage_readpage+0x411/0x430
>  [] io_schedule+0x26/0x32
>  [] __wait_on_bit_lock+0x5f/0x6d
>  [] mpage_readpage+0x42/0x5b
>  [] ext4_get_block+0x0/0x104
>  [] wake_bit_function+0x0/0x23
>  [] file_read_actor+0x89/0xf4
>  [] find_get_page+0x1e/0x4d
>  [] do_generic_mapping_read+0x20e/0x3df
>  [] file_read_actor+0x0/0xf4
>  [] generic_file_aio_read+0x11d/0x154
>  [] do_sync_read+0xc8/0x10b
>  [] permission+0xbb/0xbd
>  [] autoremove_wake_function+0x0/0x2e
>  [] nameidata_to_filp+0x25/0x34
>  [] do_filp_open+0x2d/0x3d
>  [] vfs_getattr+0x2b/0x2f
>  [] vfs_fstat+0x33/0x3a
>  [] vfs_read+0xab/0x12e
>  [] sys_read+0x45/0x6e
>  [] ia32_sysret+0x0/0xa
> 
[skip]
> [ cut here ]
> kernel BUG at fs/ext4/writeback.c:266!
Yepp. I've saw this oops too, But IMHO it may be artefact caused by
previous one.
> invalid opcode:  [6] SMP 
> CPU 3 
> Modules linked in:
> Pid: 10851, comm: fsstress Not tainted 2.6.22-rc4-autokern1 #1
> RIP: 0010:[]  [] 
> ext4_wb_submit_extent+0x1ef/0x3d9
> RSP: :8101e47cfab8  EFLAGS: 00010246
> RAX: 0001182c RBX: 8100c6709ca0 RCX: 000c
> RDX: 0001 RSI:  RDI: 8101e8de5000
> RBP: 8100c6709a48 R08: 8101b1056338 R09: 
> R10: 8101b1056338 R11: 8100c6709a48 R12: 0040
> R13: 81017eaa5b98 R14: 0040 R15: 0001
> FS:  () GS:8101803ec5c0(0063) knlGS:f7dec460
> CS:  0010 DS: 002b ES: 002b CR0: 8005003b
> CR2: f7dcb004 CR3: 0001e47c1000 CR4: 06e0
> Process fsstress (pid: 10851, threadinfo 8101e47ce000, task 
> 8101e47a6b30)
> Stack:  8101cf22c9b8  0001 000c0001
>  8100c6709a48 00018028938e 8101e47cfb68 
>  8101e47cfd28 8100c6709ca0 8100c6709a48 8100c6709990
> Call Trace:
>  [] ext4_wb_handle_extent+0x3b5/0x48c
>  [] ext4_ext_walk_space+0x18a/0x20c
>  [] ext4_wb_handle_extent+0x0/0x48c
>  [] ext4_wb_flush+0x5b/0x153
>  [] ext4_wb_writepages+0x34b/0x398
>  [] do_writepages+0x20/0x2d
>  [] __writeback_single_inode+0x1df/0x3a7
>  [] find_get_pages_tag+0x34/0x89
>  [] pagevec_lookup_tag+0x1a/0x24
>  [] wait_on_page_writeback_range+0xc7/0x10d
>  [] sync_sb_inodes+0x1cb/0x2a0
>  [] sync_inodes_sb+0xa5/0xb9
>  [] __up_read+0x10/0x8a
>  [] __sync_inodes+0x6a/0xb1
>  [] sync_inodes+0x11/0x29
>  [] do_sync+0x2c/0x50
>  [] sys_sync+0xb/0xf
>  [] ia32_sysret+0x0/0xa
> 
> 
> Code: 0f 0b eb fe f0 41 0f ba 75 00 14 48 8b 4c 24 40 01 51 10 48 
> RIP  [] ext4_wb_submit_extent+0x1ef/0x3d9
>  RSP 
> 
> I will try the patch below...Alex, any hint about the second oops?
> 
> Mingming
> Alex please 
> On Fri, 2007-06-15 at 09:14 +0400, Alex Tomas wrote:
> > looks like an error in error handling path (notice -28 (ENOSPC) before)
> > 
> > thanks for the report, Alex
&

[PATCH] ext4:fix invariant checking in ext4_rebalance_reservation

2007-06-16 Thread Dmitriy Monakhov
Variable "free" was declarated as __u64 so conidition (free < 0) always
false, even if free was overflowed during substraction.
---
 fs/ext4/balloc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 43ae8f8..a9655f1 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1909,8 +1909,8 @@ void ext4_rebalance_reservation(struct 
ext4_reservation_slot *rs, __u64 free)
chunk = free;
if (rs[i].rs_reserved || i == smp_processor_id()) {
rs[i].rs_reserved = chunk;
+   BUG_ON(free < chunk);
free -= chunk;
-   BUG_ON(free < 0);
}
}
BUG_ON(free);
-- 
1.5.2

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4:fix unexpected error from ext4_reserve_global

2007-06-14 Thread Dmitriy Monakhov
I just cant belive my eyes then i saw this at the first time...
simple test: strace dd if=/dev/zero of=/mnt/file

open("/dev/zero", O_RDONLY) = 0
close(1)= 0
open("/mnt/test/file", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1
read(0, "\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0"..., 512) = 512
read(0, "\0\0\0\0\0\0\0\0\0"..., 512) = 512
write(1, "\0\0\0\0\0\0\0\0"..., 512) = -1 ENOENT (No such fil
e or directory)

This strange error returned from ext4_reserve_global().
It's just typo because:
a) In fact this is 100% ENOSPC situation
b) simular function ext4_reserve_local() returns -ENOSPC

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
 fs/ext4/balloc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 4d7bfd2..43ae8f8 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1920,7 +1920,7 @@ int ext4_reserve_global(struct super_block *sb, int 
blocks)
 {
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_reservation_slot *rs;
-   int i, rc = -ENOENT;
+   int i, rc = -ENOSPC;
__u64 free = 0;
 
rs = sbi->s_reservation_slots;
-- 
1.5.2


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


delayed allocatiou result in Oops

2007-06-14 Thread Dmitriy Monakhov
Simple test failed on ext4 when delayed allocation was used.
#mkfs.ext3 -b4096 /dev/vzvg/test2
#mount -text4dev /dev/vzvg/test2  /mnt/test -odelalloc
#fsstress -d /mnt/test/ -l100  -n10 -p20  -f dwrite=0


EXT4-fs: writeback error = -28
..
EXT4-fs: writeback error = -28
Unable to handle kernel NULL pointer dereference at  RIP: 
 [] block_read_full_page+0xab/0x25f
PGD 44c1067 PUD 44fd067 PMD 0 
Oops:  [2] SMP 
CPU 0 
Modules linked in: ext4dev jbd2
Pid: 4833, comm: fsstress Not tainted 2.6.22-rc4-mm2 #9
RIP: 0010:[]  [] 
block_read_full_page+0xab/0x25f
RSP: 0018:810004df9a58  EFLAGS: 00010203
RAX: 1000 RBX: 8100cf4256f8 RCX: 000c

RDX: 0001 RSI: 000c RDI: 8100cf4256f8
RBP:  R08: 810004df9be8 R09: 810004df9c58
R10:  R11:  R12: 0052
R13: 1000 R14:  R15: 00d3
FS:  2adfe3f7d6f0() GS:8073() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2:  CR3: 04362000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process fsstress (pid: 4833, threadinfo 810004df8000, task 810004c867a0)
Stack:  880180f2 8100054a23f0  
 810005dcbb80 81000549bf00  810005def8b0
 88029e60 8025b2be 8100054da540 8100cf496fb0
Call Trace:
 [] :ext4dev:ext4_get_block+0x0/0x109
 [] find_get_page+0x21/0x51
 [] do_mpage_readpage+0x45f/0x480
 [] :ext4dev:ext4_get_block+0x0/0x109
 [] :jbd2:jbd2_journal_dirty_metadata+0x197/0x1be
 [] bit_waitqueue+0x1c/0x99
 [] mpage_readpage+0x4e/0x67
 [] :ext4dev:ext4_get_block+0x0/0x109
 [] do_lookup+0x63/0x1ae
 [] file_read_actor+0x8d/0xf6
 [] find_get_page+0x21/0x51
 [] do_generic_mapping_read+0x23c/0x3da
 [] file_read_actor+0x0/0xf6
 [] generic_file_aio_read+0x119/0x156
 [] do_sync_read+0xc9/0x10c

 [] cp_new_stat+0xe5/0xfd
 [] autoremove_wake_function+0x0/0x2e
 [] vfs_read+0xaa/0x132
 [] sys_read+0x45/0x6e
 [] system_call+0x7e/0x83
Code: 8b 45 00 a8 01 0f 85 e6 00 00 00 8b 45 00 a8 20 0f 85 c9 00 


I've digged this a litle bit with folowig results:

int block_read_full_page(struct page *page, get_block_t *get_block)
{
...
1914:   if (!page_has_buffers(page)) <<< page_has_buffers(page) == true 
create_empty_buffers(page, blocksize, 0);
head = page_buffers(page);   page_buffers(page) == NULL  
<private == NULL
<<< So we have private page without buffers, it is WRONG.

iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
bh = head;
nr = 0;
i = 0;

do {
if (buffer_uptodate(bh)) << Null pointer deref here result in 
oops
...
}

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

2007-06-13 Thread Dmitriy Monakhov
On 13:43 Срд 13 Июн , Nick Piggin wrote:
> On Wed, Jun 13, 2007 at 05:40:05PM +0400, Dmitriy Monakhov wrote:
> > On 14:19 ?? 29 ?? , [EMAIL PROTECTED] wrote:
> > > 
> > > The patch titled
> > >  fs: introduce write_begin, write_end, and perform_write aops
> > > has been added to the -mm tree.  Its filename is
> > >  fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> > > 
> > > *** Remember to use Documentation/SubmitChecklist when testing your code 
> > > ***
> > > 
> > > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to 
> > > find
> > > out what to do about this
> > > 
> > > --
> > > Subject: fs: introduce write_begin, write_end, and perform_write aops
> > > From: Nick Piggin <[EMAIL PROTECTED]>
> > > 
> > > These are intended to replace prepare_writ eand commit_write with more
> > > flexible alternatives that are also able to avoid the buffered write
> > > deadlock problems efficiently (which prepare_write is unable to do).
> > > 
> > > [EMAIL PROTECTED]: API design contributions, code review and fixes]
> > > Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
> > > Signed-off-by: Mark Fasheh <[EMAIL PROTECTED]>
> > > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> > I've finaly find time to review Nick's "write_begin/write_end aop" patch 
> > set.
> > And i have some fixes and questions. My be i've missed somthing and it was 
> > already disscussed, but i cant find in LKML.
> 
> Thanks, that's really helpful.
> 
>  
> > 1) loop dev:
> > loop.c code itself is not perfect. In fact before Nick's patch
> > partial write was't possible. Assumption what write chunks are
> > always page aligned is realy weird ( see "index++" line).
> > Fixed by "new aop loop fix" patch
> 
> I think you're right, fix looks good.
> 
>  
> > 2)block_write_begin:
> > After we enter to block_write_begin with *pagep == NULL and
> > some page was grabed we remember this page in *pagep
> > And if __block_prepare_write() we have to clear *pagep , as 
> > it was before. Because this may confuse caller.
> > for example caller may have folowing code:
> > ret = block_write_begin(..., pagep,...)
> > if (ret && *pagep != NULL) {
> > unlock_page(*pagep);
> > page_cache_release(*pagep);
> > }
> > Fixed my "new aop block_write_begin fix" patch
> 
> I don't think the caller can rely on that if it returns failure.
> But that is more defensive I guess. Maybe setting it to 1 or
> so would catch abusers.
> 
>  
> > 3) __page_symlink:
> > Nick's patch add folowing code:
> > + err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
> > + AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
> > symlink now consume whole page. I have only one question "WHY???".
> > I don't see any advantages, but where are huge list of
> > dissdvantages:
> > a) fs with blksize == 1k and pagesize == 16k after this patch
> >waste more than 10x times disk space for nothing.
> > b) What happends if we want use fs with blksize == 4k on i386
> >after it was used by ia64 ??? (before this patch it was
> >possible).
One more visiable effect caused by wrong symlink size:
fsstress cause folowing error:

EXT3-fs unexpected failure: !buffer_revoked(bh); 
inconsistent data on disk
ext3_forget: aborting transaction: IO failure in __ext3_journal_revoke
ext3_abort called.

EXT3-fs error (device dm-4): ext3_forget: error -5 when attempting
revoke
Remounting filesystem read-only 
Aborting journal on device dm-4.
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_free_blocks_sb: Journal has aborted

journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_truncate: IO failure
journal commit I/O error
journal commit I/O error
journal commit I/O error
journal commit I/O error
EXT3-fs error (device dm-4) in ext3_reserve_inode_write: Journal has
aborted
EXT3-fs error (device dm-4) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device dm-4) in ext3_rese

Re: [patch] new aop loop fix

2007-06-13 Thread Dmitriy Monakhov
On 14:36 Срд 13 Июн , Hugh Dickins wrote:
> On Wed, 13 Jun 2007, Dmitriy Monakhov wrote:
> 
> > loop.c code itself is not perfect. In fact before Nick's patch
> > partial write was't possible. Assumption what write chunks are
> > always page aligned is realy weird ( see "index++" line).
> > 
> > Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
> 
> I'm interested, because I'm trying to chase down an -mm bug which
> occasionally leaves me with 1k of zeroes where it shouldn't (in a
> 1k bsize ext2 looped over tmpfs).  The length of time for this to
> happen varies a lot, so bisection has been misleading: maybe the
> problem lies in Nick's patches, maybe it does not.
> 
> But I don't understand your fix below at all.  _If_ you need to
> change the handling of index, then you need to change the handling
> of offset too, don't you?
> 
> But what's wrong with how inded was handled anyway?  Yes, it might
> be being incremented at the bottom of the loop when we haven't
> reached the end of this page, but in that case we're not going
> round the loop again anyway: len will now be 0.  So no problem.
> 
> One of us is missing something: please enlighten me - thanks.
Yepp. You absolutely right, wrong patch was attached :(
Btw: Nick's patches broke  LO_CRYPT_XOR mode, but it is ok because
this feature was absolete and not used by anyone, am i right here?
> 
> Hugh
> 
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 4bab9b1..8726da5 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -215,7 +215,6 @@ static int do_lo_send_aops(struct loop_device *lo, 
> > struct bio_vec *bvec,
> > int len, ret;
> >  
> > mutex_lock(&mapping->host->i_mutex);
> > -   index = pos >> PAGE_CACHE_SHIFT;
> > offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
> > bv_offs = bvec->bv_offset;
> > len = bvec->bv_len;
> > @@ -226,6 +225,7 @@ static int do_lo_send_aops(struct loop_device *lo, 
> > struct bio_vec *bvec,
> > struct page *page;
> > void *fsdata;
> >  
> > +   index = pos >> PAGE_CACHE_SHIFT;
> > IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
> > size = PAGE_CACHE_SIZE - offset;
> > if (size > len)
> > @@ -255,7 +255,6 @@ static int do_lo_send_aops(struct loop_device *lo, 
> > struct bio_vec *bvec,
> > bv_offs += copied;
> > len -= copied;
> > offset = 0;
> > -   index++;
> > pos += copied;
> > }
> > ret = 0;
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


iov_iter_fault_in_readable fix

2007-06-13 Thread Dmitriy Monakhov
Function prerform check for signgle region, with out respect to
segment nature of iovec, For example writev no longer works :)

/* TESTCASE BEGIN */
#include 
#include 
#include 
#include 
#include 
#include 
#define SIZE  (4096 * 2)
int main(int argc, char* argv[])
{   
char* ptr[4];
struct iovec iov[2];
int fd, ret;
ptr[0] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
ptr[1] = mmap(NULL, SIZE, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
ptr[2] = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
ptr[3] = mmap(NULL, SIZE, PROT_NONE, 
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

iov[0].iov_base = ptr[0] + (SIZE -1);
iov[0].iov_len = 1;
memset(ptr[0], 1, SIZE);

iov[1].iov_base = ptr[2];
iov[1].iov_len = SIZE;
memset(ptr[2], 2, SIZE);

fd = open(argv[1], O_CREAT|O_RDWR|O_TRUNC, 0666);
ret = writev(fd, iov, sizeof(iov) / sizeof(struct iovec));
return 0;
}   
/* TESTCASE END*/
We will get folowing result:
writev(3, [{"\1", 1}, {"\2"..., 8192}], 2) = -1 EFAULT (Bad 
address)

this is hidden bug, and it was invisiable because _fault_in_readable
return value was ignored before. Lets iov_iter_fault_in_readable
perform checks for all segments.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fef19fc..7e025ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,7 +433,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 size_t iov_iter_copy_from_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t *bytes);
 size_t iov_iter_single_seg_count(struct iov_iter *i);
 
 static inline void iov_iter_init(struct iov_iter *i,
diff --git a/mm/filemap.c b/mm/filemap.c
index 8d59ed9..8600c3e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1817,10 +1817,32 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t* bytes)
 {
-   char __user *buf = i->iov->iov_base + i->iov_offset;
-   return fault_in_pages_readable(buf, bytes);
+   size_t len = *bytes;
+   int ret;
+   if (likely(i->nr_segs == 1)) {
+   ret = fault_in_pages_readable(i->iov->iov_base, len);
+   if (ret)
+   *bytes = 0;
+   } else {
+   const struct iovec *iov = i->iov;
+   size_t base = i->iov_offset;
+   *bytes = 0;
+   while (len) {
+   int copy = min(len, iov->iov_len - base);
+   if ((ret = fault_in_pages_readable(iov->iov_base + 
base, copy)))
+   break;
+   *bytes += copy;
+   len -= copy;
+   base += copy;
+   if (iov->iov_len == base) {
+   iov++;
+   base = 0;
+   }
+   }
+   }
+   return ret; 
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
@@ -2110,7 +2132,7 @@ static ssize_t generic_perform_write_2copy(struct file 
*file,
 * to check that the address is actually valid, when atomic
 * usercopies are used, below.
 */
-   if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+   if (unlikely(iov_iter_fault_in_readable(i, &bytes) && !bytes)) {
status = -EFAULT;
break;
}
@@ -2284,7 +2306,7 @@ again:
 * to check that the address is actually valid, when atomic
 * usercopies are used, below.
 */
-   if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+   if (unlikely(iov_iter_fault_in_readable(i, &bytes)&& !bytes)) {
status = -EFAULT;
break;
}


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] new aop loop fix

2007-06-13 Thread Dmitriy Monakhov
loop.c code itself is not perfect. In fact before Nick's patch
partial write was't possible. Assumption what write chunks are
always page aligned is realy weird ( see "index++" line).

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4bab9b1..8726da5 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -215,7 +215,6 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
int len, ret;
 
mutex_lock(&mapping->host->i_mutex);
-   index = pos >> PAGE_CACHE_SHIFT;
offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
bv_offs = bvec->bv_offset;
len = bvec->bv_len;
@@ -226,6 +225,7 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
struct page *page;
void *fsdata;
 
+   index = pos >> PAGE_CACHE_SHIFT;
IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
size = PAGE_CACHE_SIZE - offset;
if (size > len)
@@ -255,7 +255,6 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
bv_offs += copied;
len -= copied;
offset = 0;
-   index++;
pos += copied;
}
ret = 0;

> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + fs-introduce-write_begin-write_end-and-perform_write-aops.patch added to -mm tree

2007-06-13 Thread Dmitriy Monakhov
On 14:19 Втр 29 Май , [EMAIL PROTECTED] wrote:
> 
> The patch titled
>  fs: introduce write_begin, write_end, and perform_write aops
> has been added to the -mm tree.  Its filename is
>  fs-introduce-write_begin-write_end-and-perform_write-aops.patch
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> out what to do about this
> 
> --
> Subject: fs: introduce write_begin, write_end, and perform_write aops
> From: Nick Piggin <[EMAIL PROTECTED]>
> 
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).
> 
> [EMAIL PROTECTED]: API design contributions, code review and fixes]
> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
> Signed-off-by: Mark Fasheh <[EMAIL PROTECTED]>
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
I've finaly find time to review Nick's "write_begin/write_end aop" patch set.
And i have some fixes and questions. My be i've missed somthing and it was 
already disscussed, but i cant find in LKML.

1) loop dev:
loop.c code itself is not perfect. In fact before Nick's patch
partial write was't possible. Assumption what write chunks are
always page aligned is realy weird ( see "index++" line).
Fixed by "new aop loop fix" patch

2)block_write_begin:
After we enter to block_write_begin with *pagep == NULL and
some page was grabed we remember this page in *pagep
And if __block_prepare_write() we have to clear *pagep , as 
it was before. Because this may confuse caller.
for example caller may have folowing code:
ret = block_write_begin(..., pagep,...)
if (ret && *pagep != NULL) {
unlock_page(*pagep);
page_cache_release(*pagep);
}
Fixed my "new aop block_write_begin fix" patch

3) __page_symlink:
Nick's patch add folowing code:
+ err = pagecache_write_begin(NULL, mapping, 0,PAGE_CACHE_SIZE,
+ AOP_FLAG_UNINTERRUPTIBLE, &page,&fsdata);
symlink now consume whole page. I have only one question "WHY???".
I don't see any advantages, but where are huge list of
dissdvantages:
a) fs with blksize == 1k and pagesize == 16k after this patch
   waste more than 10x times disk space for nothing.
b) What happends if we want use fs with blksize == 4k on i386
   after it was used by ia64 ??? (before this patch it was
   possible).

I dont prepare patch for this because i dont understand issue
witch Nick aimed to fix.

4) iov_iter_fault_in_readable:
Function prerform check for signgle region, with out respect to
segment nature of iovec, For example writev no longer works :) :
writev(3, [{"\1", 1}, {"\2"..., 4096}], 2) = -1 EFAULT (Bad address)
this hidden bug, and it was invisiable because _fault_in_readable
return value was ignored before. Lets iov_iter_fault_in_readable
perform checks for all segments.
Fixed by :"iov_iter_fault_in_readable fix"

5) ext3_write_end:
Before  write_begin/write_end patch set we have folowing locking
order:
stop_journal(handle);
unlock_page(page);
But now order is oposite:
unlock_page(page);
stop_journal(handle);
Can we got any race condition now? I'm not sure is it actual problem,
may be somebody cant describe this.




-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext3: dirindex error pointer issues (b)

2007-03-11 Thread Dmitriy Monakhov
Dmitriy Monakhov <[EMAIL PROTECTED]> writes:

>  - ext3_dx_find_entry() exit with out setting proper error pointer
>  - do_split() exit with out setting proper error pointer
>it is realy painful because many callers contain folowing code:
>de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
>if (!(de))
> return retval;
><<< WOW retval wasn't changed by do_split(), so caller failed
><<< but return SUCCESS :)
>  - Rearrange do_split() error path. Current error path is realy ugly, all
>this up and down jump stuff doesn't make code easy to understand.
Ohh my first patch change error message semantics in do_split(). Initially when 
ext3_append() failed we just exit without printing error. In fact ext3_append()
may fail, it is legal and it's happens qite often (ENOSPC for example). This 
cause annoying fake error message. So restore this semantic as it was before
patch.
Andrew please apply incremental patch what fix it:

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 1a52586..7edb617 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1197,8 +1197,8 @@ journal_error:
brelse(*bh);
brelse(bh2);
*bh = NULL;
-errout:
ext3_std_error(dir->i_sb, err);
+errout:
*error = err;
return NULL;
 }
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index f0a6c26..02a75db 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1195,8 +1195,8 @@ journal_error:
brelse(*bh);
brelse(bh2);
*bh = NULL;
-errout:
ext4_std_error(dir->i_sb, err);
+errout:
*error = err;
return NULL;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext3: dirindex error pointer issues

2007-03-04 Thread Dmitriy Monakhov
Andreas Dilger <[EMAIL PROTECTED]> writes:

> On Mar 04, 2007  17:18 +0300, Dmitriy Monakhov wrote:
>>  - ext3_dx_find_entry() exit with out setting proper error pointer
>>  - do_split() exit with out setting proper error pointer
>>it is realy painful because many callers contain folowing code:
>>de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
>>if (!(de))
>> return retval;
>><<< WOW retval wasn't changed by do_split(), so caller failed
>><<< but return SUCCESS :)
>>  - Rearrange do_split() error path. Current error path is realy ugly, all
>>this up and down jump stuff doesn't make code easy to understand.
>> 
>> Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
>> ---
>>  fs/ext3/namei.c |   26 +++---
>>  fs/ext4/namei.c |   26 +++---
>>  2 files changed, 30 insertions(+), 22 deletions(-)
>> 
>> diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
>> index 49159f1..1a52586 100644
>> --- a/fs/ext3/namei.c
>> +++ b/fs/ext3/namei.c
>> @@ -969,6 +969,7 @@ static struct buffer_head * ext3_dx_find_entry(struct 
>> dentry *dentry,
>>(block<>+((char *)de - bh->b_data))) {
>>  brelse (bh);
>> +*err = ERR_BAD_DX_DIR;
>>  goto errout;
>>  }
>>  *res_dir = de;
>
> I have no objection to this change (by principle of least surprise) but
> I don't know if it fixes a real problem.  The one caller of this function
> treats ERR_BAD_DX_DIR the same as bh == NULL.
  Execly  ERR_BAD_DX_DIR treats  the same as bh == NULL and let's look at
  callers code: 
  struct buffer_head * ext3_find_entry(..)
  {
  ...
  bh = ext3_dx_find_entry(dentry, res_dir, &err);
/*
 * On success, or if the error was file not found,
 * return.  Otherwise, fall back to doing a search the
 * old fashioned way.
 */
if (bh || (err != ERR_BAD_DX_DIR))
  <<<<< after ext3_dx_find_entry() has failed , but don't set err pointer 
  <<<<< we get  (bh == NULL, err == 0) , and then just return NULL.
return bh;
 ...
 }

>
>> @@ -1134,9 +1135,9 @@ static struct ext3_dir_entry_2 *do_split(handle_t 
>> *handle, struct inode *dir,
>>  char *data1 = (*bh)->b_data, *data2;
>>  unsigned split;
>>  struct ext3_dir_entry_2 *de = NULL, *de2;
>> -int err;
>> +int err = 0;
>>  
>> -bh2 = ext3_append (handle, dir, &newblock, error);
>> +bh2 = ext3_append (handle, dir, &newblock, &err);
>
> Why don't we just remove "err" entirely and use *error to avoid any risk
> of setting err and not returning it in error?  Also reduces stack usage
> that tiny little bit.
I've chosen this approuch becouse of:
1) this approuch was used in block allocation code.
2) this approuch prevent  folowing errors:
  *error = do_some_function(.)
   /* some code*/
  if(error) 
 we do this checking as we do it thousands times before, but forget
 what error it pointer here. And compiler can't warn us here because 
 it is absolutely legal operation. At least it is better to rename 
 error to errorp.

Btw: I've thought about adding assertations to error path witch may check
what errp pointer was properly initialized after error happends.
folowing code is draft only:
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -63,6 +63,7 @@ static struct buffer_head *ext3_append(handle_t *handle,
EXT3_I(inode)->i_disksize = inode->i_size;
ext3_journal_get_write_access(handle,bh);
}
+   assert(bh || *err);
return bh;
 }
 
@@ -433,6 +434,7 @@ fail2:
frame--;
}
 fail:
+   assert(*err != 0);
return NULL;
 }

>
>
> In ext3_dx_add_entry() it appears like we should "goto journal_error"
> to report an error after the failed call to do_split(), instead of just
> "goto cleanup" so that we report an error in the filesystem.  Not 100%
> sure of this.
do_split() already invoked ext3_std_error() on it's "journal_error" path.

>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext3: dirindex error pointer issues

2007-03-04 Thread Dmitriy Monakhov
 - ext3_dx_find_entry() exit with out setting proper error pointer
 - do_split() exit with out setting proper error pointer
   it is realy painful because many callers contain folowing code:
   de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
   if (!(de))
return retval;
   <<< WOW retval wasn't changed by do_split(), so caller failed
   <<< but return SUCCESS :)
 - Rearrange do_split() error path. Current error path is realy ugly, all
   this up and down jump stuff doesn't make code easy to understand.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 fs/ext3/namei.c |   26 +++---
 fs/ext4/namei.c |   26 +++---
 2 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 49159f1..1a52586 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -969,6 +969,7 @@ static struct buffer_head * ext3_dx_find_entry(struct 
dentry *dentry,
  (blockb_data, *data2;
unsigned split;
struct ext3_dir_entry_2 *de = NULL, *de2;
-   int err;
+   int err = 0;
 
-   bh2 = ext3_append (handle, dir, &newblock, error);
+   bh2 = ext3_append (handle, dir, &newblock, &err);
if (!(bh2)) {
brelse(*bh);
*bh = NULL;
@@ -1145,14 +1146,9 @@ static struct ext3_dir_entry_2 *do_split(handle_t 
*handle, struct inode *dir,
 
BUFFER_TRACE(*bh, "get_write_access");
err = ext3_journal_get_write_access(handle, *bh);
-   if (err) {
-   journal_error:
-   brelse(*bh);
-   brelse(bh2);
-   *bh = NULL;
-   ext3_std_error(dir->i_sb, err);
-   goto errout;
-   }
+   if (err)
+   goto journal_error;
+
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext3_journal_get_write_access(handle, frame->bh);
if (err)
@@ -1195,8 +1191,16 @@ static struct ext3_dir_entry_2 *do_split(handle_t 
*handle, struct inode *dir,
goto journal_error;
brelse (bh2);
dxtrace(dx_show_index ("frame", frame->entries));
-errout:
return de;
+
+journal_error:
+   brelse(*bh);
+   brelse(bh2);
+   *bh = NULL;
+errout:
+   ext3_std_error(dir->i_sb, err);
+   *error = err;
+   return NULL;
 }
 #endif
 
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index e7e1d79..682a3d7 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -967,6 +967,7 @@ static struct buffer_head * ext4_dx_find_entry(struct 
dentry *dentry,
  (blockb_data, *data2;
unsigned split;
struct ext4_dir_entry_2 *de = NULL, *de2;
-   int err;
+   int err = 0;
 
-   bh2 = ext4_append (handle, dir, &newblock, error);
+   bh2 = ext4_append (handle, dir, &newblock, &err);
if (!(bh2)) {
brelse(*bh);
*bh = NULL;
@@ -1143,14 +1144,9 @@ static struct ext4_dir_entry_2 *do_split(handle_t 
*handle, struct inode *dir,
 
BUFFER_TRACE(*bh, "get_write_access");
err = ext4_journal_get_write_access(handle, *bh);
-   if (err) {
-   journal_error:
-   brelse(*bh);
-   brelse(bh2);
-   *bh = NULL;
-   ext4_std_error(dir->i_sb, err);
-   goto errout;
-   }
+   if (err)
+   goto journal_error;
+
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext4_journal_get_write_access(handle, frame->bh);
if (err)
@@ -1193,8 +1189,16 @@ static struct ext4_dir_entry_2 *do_split(handle_t 
*handle, struct inode *dir,
goto journal_error;
brelse (bh2);
dxtrace(dx_show_index ("frame", frame->entries));
-errout:
return de;
+
+journal_error:
+   brelse(*bh);
+   brelse(bh2);
+   *bh = NULL;
+errout:
+   ext4_std_error(dir->i_sb, err);
+   *error = err;
+   return NULL;
 }
 #endif
 
-- 
1.5.0.1


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] ext3: Handle ext[34]_journal_stop() failure

2007-02-28 Thread Dmitriy Monakhov

Where are many places where _journal_stop() return code wasn't
checked. Off cause _journal_stop() failed very rarely (and usually
with fatal consequences), but this does'n meen it should not be checked.
For example most retry loops looks like follows:
ext3_journal_stop(handle);
if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
goto retry;
It is realy insane do "retry" if ext3_journal_stop() has failed, because
this may increase damage in case of real problem.

Signed-off-by: Monakhov Dmitriy <[EMAIL PROTECTED]>
---
 fs/ext3/acl.c   |   12 
 fs/ext3/inode.c |   22 +-
 fs/ext3/ioctl.c |   12 
 fs/ext3/namei.c |   48 
 fs/ext3/xattr.c |4 ++--
 fs/ext4/acl.c   |   12 
 fs/ext4/inode.c |   22 +-
 fs/ext4/ioctl.c |   12 
 fs/ext4/namei.c |   48 
 fs/ext4/xattr.c |4 ++--
 10 files changed, 134 insertions(+), 62 deletions(-)

diff --git a/fs/ext3/acl.c b/fs/ext3/acl.c
index 1e5038d..bbf4d8a 100644
--- a/fs/ext3/acl.c
+++ b/fs/ext3/acl.c
@@ -375,7 +375,7 @@ int
 ext3_acl_chmod(struct inode *inode)
 {
struct posix_acl *acl, *clone;
-int error;
+   int error, error2;
 
if (S_ISLNK(inode->i_mode))
return -EOPNOTSUPP;
@@ -402,7 +402,9 @@ ext3_acl_chmod(struct inode *inode)
goto out;
}
error = ext3_set_acl(handle, inode, ACL_TYPE_ACCESS, clone);
-   ext3_journal_stop(handle);
+   error2 = ext3_journal_stop(handle);
+   if (error2 && (!error || error == -ENOSPC))
+   error = error2;
if (error == -ENOSPC &&
ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
@@ -485,7 +487,7 @@ ext3_xattr_set_acl(struct inode *inode, int type, const 
void *value,
 {
handle_t *handle;
struct posix_acl *acl;
-   int error, retries = 0;
+   int error, error2, retries = 0;
 
if (!test_opt(inode->i_sb, POSIX_ACL))
return -EOPNOTSUPP;
@@ -509,7 +511,9 @@ retry:
if (IS_ERR(handle))
return PTR_ERR(handle);
error = ext3_set_acl(handle, inode, type, acl);
-   ext3_journal_stop(handle);
+   error2 = ext3_journal_stop(handle);
+   if (error2 && (!error || error == -ENOSPC))
+   error = error2;
if (error == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
 
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 816f0c5..68d1c71 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -962,7 +962,9 @@ static int ext3_get_block(struct inode *inode, sector_t 
iblock,
 * Huge direct-io writes can hold off commits for long
 * periods of time.  Let this commit run.
 */
-   ext3_journal_stop(handle);
+   ret = ext3_journal_stop(handle);
+   if (ret)
+   goto get_block;
handle = ext3_journal_start(inode, DIO_CREDITS);
if (IS_ERR(handle))
ret = PTR_ERR(handle);
@@ -2998,7 +3000,13 @@ int ext3_setattr(struct dentry *dentry, struct iattr 
*attr)
if (attr->ia_valid & ATTR_GID)
inode->i_gid = attr->ia_gid;
error = ext3_mark_inode_dirty(handle, inode);
-   ext3_journal_stop(handle);
+   if (error) {
+   ext3_journal_stop(handle);
+   goto err_out;
+   }
+   error = ext3_journal_stop(handle);
+   if (error)
+   goto err_out;
}
 
if (S_ISREG(inode->i_mode) &&
@@ -3016,7 +3024,9 @@ int ext3_setattr(struct dentry *dentry, struct iattr 
*attr)
rc = ext3_mark_inode_dirty(handle, inode);
if (!error)
error = rc;
-   ext3_journal_stop(handle);
+   rc = ext3_journal_stop(handle);
+   if (!error)
+   error = rc;
}
 
rc = inode_setattr(inode, attr);
@@ -3231,7 +3241,7 @@ int ext3_change_inode_journal_flag(struct inode *inode, 
int val)
 {
journal_t *journal;
handle_t *handle;
-   int err;
+   int err, err2;
 
/*
 * We have to be very careful here: changing a data block's
@@ -3274,7 +3284,9 @@ int ext3_change_inode_journal_flag(struct inode *inode, 
int val)
 
err = ext3_mark_inode_dirty(handle, inode);
handle->h_sync = 1;
-   ext3_journal_stop(handle);
+   err2 = ext3_journal_stop(handle);
+   if (!err)
+   err = err2;
ext3_std_error(inode->i_sb, err);
 
return err;
diff --git a/fs/ext3/ioctl.c b/fs/ext3

[PATCH 1/1][RFC] EXT34 retry loop issue V(2)

2007-02-12 Thread Dmitriy Monakhov

Patch depends on : "[PATCH 0/1][RFC]  prepare_write positive return value V(2)"

This patch solve ext3/4 retry loop issue.
Issue description:
What we can do if block_prepare_write fail inside ext3_prepare_write ?
a) Stop transaction and do retry if possible, but what happend if 
   reboot comes after journal_force_commit, but before we exhaust
   all retry attempts and generic_file_buffered_write call trancate? 
   We may get allocated blocks outside i_size. Witch is bad.

b) Commit newly allocated blocks. This approach was used after Andrey's patch.
   But if reboot comes, or error happend, user will be surprised that i_size
   increased but blocks are zero filed. This is internal write operation state
   becames visiable to user. Witch is also bad. 
   
c) Just return as match bytes as we can deal with rigth now back to 
   caller, and let's caller handles it and than call commit. In this scenario
   we never stop journal in the midle of some internal fs operation.
   If reboot comes before commit trunsaction was't closed so it will 
   be droped while journal replay.

Only (c) tend to garantie attomic data operation.

Also fix some issues introduced by 
retries-in-ext3_prepare_write-violate-ordering-requirements:
  i_size may increase after error happend.
  possible data corruption caused commiting non uptodate bh.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index dba6dd2..4c5e9f7 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1154,6 +1154,18 @@ static int do_journal_get_write_access(handle_t *handle,
  * transaction must include the content of the newly allocated blocks.
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
+ * What we can do if some blocks was allocated?
+ * a) Stop transaction and do retry if possible, but what happend if
+ *reboot comes after journal_force_commit, but before we exhaust
+ *all retry attempts and caller call trancate?
+ *We may get allocated blocks outside i_size. Witch is bad.
+ * b) Commit newly allocated blocks. And if reboot comes, user will be
+ *surprised that i_size increased but blocks are zero filed. Witch is
+ *also bad.
+ * c) Just return as match bytes as we can deal with rigth now back to
+ *caller, and if reboot comes trunsaction was't closed so it will
+ *be droped while journal replay.
+ * Only (c) tend to garantie attomic data operation.
  */
 static int ext3_prepare_failure(struct file *file, struct page *page,
unsigned from, unsigned to)
@@ -1186,7 +1198,7 @@ skip:
block_start = to;
break;
}
-   if (!buffer_mapped(bh))
+   if (!buffer_mapped(bh) || !buffer_uptodate(bh))
/* prepare_write failed on this bh */
break;
if (ext3_should_journal_data(mapping->host)) {
@@ -1204,8 +1216,8 @@ skip:
if (block_start <= from)
goto skip;
 
-   /* commit allocated and zeroed buffers */
-   return mapping->a_ops->commit_write(file, page, from, block_start);
+   /* return number of bytes till last mapped && uptodate block */
+   return block_start - from;
 }
 
 static int ext3_prepare_write(struct file *file, struct page *page,
@@ -1239,7 +1251,8 @@ retry:
 
 failure:
ret2 = ext3_prepare_failure(file, page, from, to);
-   if (ret2 < 0)
+   if (ret2)
+   /* ready to partial write, or fatal error */
return ret2;
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 806eee1..da39f80 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1153,6 +1153,18 @@ static int do_journal_get_write_access(handle_t *handle,
  * transaction must include the content of the newly allocated blocks.
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
+ * What we can do if some blocks was allocated?
+ * a) Stop transaction and do retry if possible, but what happend if
+ *reboot comes after journal_force_commit, but before we exhaust
+ *all retry attempts and caller call trancate?
+ *We may get allocated blocks outside i_size. Witch is bad.
+ * b) Commit newly allocated blocks. And if reboot comes, user will be
+ *surprised that i_size increased but blocks are zero filed. Witch is
+ *also bad.
+ * c) Just return as match bytes as we can deal with rigth now back to
+ *caller, and if reboot comes trunsaction was't closed so it will
+ *be droped while journal replay.
+ * Only (c) tend to garantie attomic data operation.
  */
 static int ext4_prepare_failure(struct file *file, struct page *page,
unsigned from, un

[PATCH 0/1][RFC] prepare_write positive return value V(2)

2007-02-12 Thread Dmitriy Monakhov
Changes from ver(1)
 - __page_symlink(): In order to be on a safe side add explicit zeroing content
 before fail (just in case).
 -do_lo_send_aops(): If prepare_write can't handle total size of bytes requested
 from loop_dev it is safer to fail.
 -  pipe_to_file():  Limit the size of the copy to size wich prepare_write ready
 to handle. instead of failing.
 - ecryptfs: was't changed in this version, because where is no even 
 AOP_TRUNCATED_PAGE handling code where.

Almost all read/write operation handles data with chunks(segments or pages)
and result has integral behaviour for folowing scenario: 
for_each_chunk() {
 res = op();
 if(IS_ERROR(res))
   return progress ? progress : res;
 progress += res;
}
prepare_write may has integral behaviour in case of blksize < pgsize,
for example we successfully allocated/read some blocks, but not all of them,
and than some error happend. Currently we eliminate this progress by doing
vmtrunate() after prepare_has failed.
It is good to have ability to signal about this progress. Interprete positive
prepare_write() ret code as bytes num that fs ready to handle at this moment.
 
BTH: This approach was used in OpenVZ 2.6.9 kernel in order to make FS with 
delayed allocation more correct, and its works well.
I think not everybody will happy about this,  but let's discuss all advantages
and disadvantages of this change.

fixes not dirrectly connected with proposed prepare_write semantic changes:
 __page_symlink : If find_or_create_page has failed on second retry attempt
  function will exit with wrong error code.
 __generic_cont_expand: Add correct AOP_TRUNCATED_PAGE handling.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
-
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 6b5b642..dc332dc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -224,6 +224,7 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
sector_t IV;
unsigned size;
int transfer_result;
+   int partial = 0;
 
IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
size = PAGE_CACHE_SIZE - offset;
@@ -239,11 +240,20 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
page_cache_release(page);
continue;
}
-   goto unlock;
+   if (ret > 0 && ret < PAGE_CACHE_SIZE) {
+   /* 
+* ->prepare_write() can't handle total size of bytes
+* requested. In fact this is not likely to happen, 
+* because we request one block only.
+*/
+   partial = 1;
+   size = ret;
+   } else
+   goto unlock;
}
transfer_result = lo_do_transfer(lo, WRITE, page, offset,
bvec->bv_page, bv_offs, size, IV);
-   if (unlikely(transfer_result)) {
+   if (unlikely(transfer_result || partial)) {
char *kaddr;
 
/*
@@ -266,7 +276,7 @@ static int do_lo_send_aops(struct loop_device *lo, struct 
bio_vec *bvec,
}
goto unlock;
}
-   if (unlikely(transfer_result))
+   if (unlikely(transfer_result || partial))
goto unlock;
bv_offs += size;
len -= size;
diff --git a/fs/buffer.c b/fs/buffer.c
index 1ad674f..ed06560 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2027,13 +2027,20 @@ static int __generic_cont_expand(struct inode *inode, 
loff_t size,
if (size > inode->i_sb->s_maxbytes)
goto out;
 
+retry:
err = -ENOMEM;
page = grab_cache_page(mapping, index);
if (!page)
goto out;
err = mapping->a_ops->prepare_write(NULL, page, offset, offset);
+   if (err == AOP_TRUNCATED_PAGE) {
+   page_cache_release(page);
+   goto retry;
+   }
if (err) {
/*
+* ->prepare_write() called with from==to and must not 
+* return positive size of bytes in this case.
 * ->prepare_write() may have instantiated a few blocks
 * outside i_size.  Trim these off again.
 */
@@ -2044,6 +2051,10 @@ static int __generic_cont_expand(struct inode *inode, 
loff_t size,
}
 
err = mapping->a_ops->commit_write(NULL, page, offset, offset);
+   if (err == AO

[PATCH] :EXT[3,4] jbd layer function called instead of fs specific one

2007-01-28 Thread Dmitriy Monakhov
jbd function called instead of fs specific one.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
--
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index beaf25f..8a824f4 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -947,7 +947,7 @@ out:
 static int ext3_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
 {
-   handle_t *handle = journal_current_handle();
+   handle_t *handle = ext3_journal_current_handle();
int ret = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
 
@@ -1717,7 +1717,7 @@ static ssize_t ext3_direct_IO(int rw, st
/*
 * Reacquire the handle: ext3_get_block() can restart the transaction
 */
-   handle = journal_current_handle();
+   handle = ext3_journal_current_handle();
 
 out_stop:
if (handle) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a127cc0..fbff4b9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -946,7 +946,7 @@ out:
 static int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
 {
-   handle_t *handle = journal_current_handle();
+   handle_t *handle = ext4_journal_current_handle();
int ret = 0;
unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
 
@@ -1716,7 +1716,7 @@ static ssize_t ext4_direct_IO(int rw, st
/*
 * Reacquire the handle: ext4_get_block() can restart the transaction
 */
-   handle = journal_current_handle();
+   handle = ext4_journal_current_handle();
 
 out_stop:
if (handle) {


Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race

2007-01-14 Thread Dmitriy Monakhov
Eric Sandeen <[EMAIL PROTECTED]> writes:

> I've been looking at a case where many threads are opening, unlinking, and
> hardlinking files on ext3 .
How many concurent threads do you use and how long does it takes to trigger 
this race? I've tried to reproduce this with two threads, but not succeed.
  
fd = create("src")
close(fd)
unlink("src")

link("src", "dst")
unlink("dst")

Original testcase will be the best answer :).
Thanks.
>  At unmount time I see an oops, because the superblock's
> orphan list points to a freed inode.
>
> I did some tracing of the inodes, and it looks like this:
>
>   ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan
>   i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add
>   i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter
>   i_state:0x7 cpu:1 i_count:2 i_nlink:0
>
>   ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter
>   i_state:0x7 cpu:3 i_count:1 i_nlink:0
>
>   ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done
>   i_state:0x7 cpu:3 i_count:1 i_nlink:1
>
> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, 
> so
> it puts it on the orphan inode list.  Then link comes along, and bumps the 
> link
> back up to 1.  So now we are on the orphan inode list, but we are not 
> unlinked.
>
> Eventually when count goes to 0, and we still have 1 link, again no action is
> taken to remove the inode from the orphan list, because it is still linked 
> (i.e.
> we don't go through ext3_delete())
>
> When this inode is eventually freed, the sb orphan list gets corrupted, 
> because 
> we have freed it without first removing it from the orphan list.
>
> I think the simple solution is to remove the inode from the orphan list
> when we bump the link back up from 0 to 1.  I put that test in there because
> there are other potential reasons that we might be on the list (truncates,
> direct IO).
>
> Comments?
>
> Thanks,
> -Eric
>
> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused
> arg, and are very infrequently called.  I'll probably submit a patch
> to just put the single line of code into the caller, too.
>
> ---
>
> Remove inode from the orphan list in ext3_link() if we might have
> raced with ext3_unlink(), which potentially put it on the list.
> If we're on the list with nlink > 0, we'll never get cleaned up
> properly and eventually may corrupt the list.
>
> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
>
> Index: linux-2.6.19/fs/ext3/namei.c
> ===
> --- linux-2.6.19.orig/fs/ext3/namei.c
> +++ linux-2.6.19/fs/ext3/namei.c
> @@ -2204,6 +2204,9 @@ retry:
>   inode->i_ctime = CURRENT_TIME_SEC;
>   ext3_inc_count(handle, inode);
>   atomic_inc(&inode->i_count);
> + /* did we race w/ unlink? */
> + if (inode->i_nlink == 1)
> + ext3_orphan_del(handle, inode);
>  
>   err = ext3_add_nondir(handle, dentry, inode);
>   ext3_journal_stop(handle);
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html