Re: 2.6.18-mm2: ext3 BUG?

2006-10-23 Thread Jan-Benedict Glaw
On Wed, 2006-10-11 12:42:02 +0200, Jan Kara [EMAIL PROTECTED] wrote:
  On Sun, 2006-10-08 08:33:30 +0200, Jan-Benedict Glaw [EMAIL PROTECTED] 
  wrote:
  While I could reproduce it with a 200MB file, it seems I can't break
  it with a 10MB file.
   Hmm, I was running the test for several ours without any problem...
 The kernel is 2.6.17.6, ext3 in ordered data mode, standard SATA disk. I'm
 now running it again and trying my luck ;). What is your testing environment?

kolbe34-backup:/mnt# uname -a
Linux kolbe34-backup 2.6.17-2-686 #1 SMP Wed Sep 13 16:34:10 UTC 2006 i686 
GNU/Linux
kolbe34-backup:/mnt# cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 7
model name  : Pentium III (Katmai)
stepping: 3
cpu MHz : 448.674
cache size  : 512 KB
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat 
pse36 mmx fxsr sse up
bogomips: 898.38
kolbe34-backup:/mnt# grep -i preem /boot/config-2.6.17-2-686 
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
kolbe34-backup:/mnt# lspci 
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge 
(rev 03)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge 
(rev 03)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
00:0e.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet32 
LANCE] (rev 16)
01:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage Pro AGP 1X/2X 
(rev 5c)
kolbe34-backup:/mnt# lspci -n
00:00.0 0600: 8086:7190 (rev 03)
00:01.0 0604: 8086:7191 (rev 03)
00:07.0 0601: 8086:7110 (rev 02)
00:07.1 0101: 8086:7111 (rev 01)
00:07.2 0c03: 8086:7112 (rev 01)
00:07.3 0680: 8086:7113 (rev 02)
00:0e.0 0200: 1022:2000 (rev 16)
01:00.0 0300: 1002:4742 (rev 5c)
kolbe34-backup:~# hdparm -i /dev/hdb

/dev/hdb:

 Model=ST3300822A, FwRev=3.AAE, SerialNo=5NF24YCN
 Config={ HardSect NotMFM HdSw15uSec Fixed DTR10Mbs RotSpdTol.5% }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 *udma2 udma3 udma4 udma5 
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3 
ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6 ATA/ATAPI-7

 * signifies the current active mode


Still running Debian's 2.6.17-2-686, I'm now tracking down the file
size when I start to see this type of corruption. Right now, it seems
I never get it with a 16384 KB (16 MB) large file, but I get it with a
21504 KB (21 MB) file.

Is there something important that changes handling of file contents in
the 16..21 MB range?

dumpe2fs output at http://lug-owl.de/~jbglaw/ext3-dumpe2fs.txt for
that filesystem.  I'll now run with a 18.5 MB file...

MfG, JBG

-- 
  Jan-Benedict Glaw  [EMAIL PROTECTED]  +49-172-7608481
  Signature of:   Wenn ich wach bin, träume ich.
  the second  :


signature.asc
Description: Digital signature


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Theodore Tso
On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
   Hello,
 
   I've written a simple patch implementing ext3 ioctl for file
 relocation. Basically you call ioctl on a file, give it list of blocks
 and it relocates the file into given blocks (provided they are still
 free). The idea is to use it as a kernel part of ext3 online
 defragmenter (or generally disk access optimizer). Now I don't have the
 user space part that finds larger runs of free blocks and so on so that
 it can really be used as a defragmenter. I just send this as a kind of
 proof-of-concept to hear some comments. Attached is also a simple
 program that demonstrates the use of the ioctl.

As a suggestion, I would pass the inode number and inode generation
number into the ext3_file_mode_data array:

struct ext3_file_move_data {
int extents;
struct ext3_reloc_extent __user *ext_array;
};

This will be much more efficient for the userspace relocator, since it
won't need to translate from an inode number to a pathname, and then
try to open the file before relocating it.

I'd also use an explicit 64-bit block numbers type so that we don't
have to worry about the ABI changing when we support 64-bit block
numbers.


The other problem I see with this patch is that there will be cache
coherency problems between the buffer cache and the page cache.  I
think you will want to pull the data blocks of the file into the page
cache, and then write them out from the page cache, and only *then*
update the indirect blocks and commit the transaction.

So what needs to happen is the following:

1) Validate that inode and generation number.  Make sure the new
(destination) blocks passed in are valid and not in use.  Allocate
them to prevent anyone else from using those blocks.

2) Pull the blocks into the page cache (if they are not already
there), and the write them out to the new location on disk.  If any of
the I/O's fail, abort.

3) Update the indirect blocks or extent tree to point at the newly
allocated and copied data blocks.

In the current patch, it looks like you add the inode being relocated
to the orphan list, and then update the direct/indirect blocks first
--- and if you fail the inode gets truncated.  That's bad since we
don't want to lose any data if we crash in the middle of the defrag
operation

Great to see that you're working on this problem!  I'd love to see
this functionality into ext4.

Regards,

- Ted

P.S.  There is also the question of whether we'll be able to get this
interface past the ioctl() police, but the atomicity requirements of
such an interface are a poster child for why we really, REALLY, can't
do this via a sysfs interface.  We might be forced to create a new
filesystem, or create a pseudo inode which we open via a magic
pathname, though.  That in my opinion is uglier than an ioctl, but the
ioctl police really don't like the problem of needing to maintain
32/64 bit translation functions, and this interface would surely cause
problems for the x86_64 and PPC platforms, since they have to support
32-bit and 64-bit system ABI's.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jan Kara
  Hello,

I've written a simple patch implementing ext3 ioctl for file
  relocation. Basically you call ioctl on a file, give it list of blocks
  and it relocates the file into given blocks (provided they are still
  free). The idea is to use it as a kernel part of ext3 online
  defragmenter (or generally disk access optimizer). Now I don't have the
  user space part that finds larger runs of free blocks and so on so that
  it can really be used as a defragmenter. I just send this as a kind of
  proof-of-concept to hear some comments. Attached is also a simple
  program that demonstrates the use of the ioctl.
 
 As a suggestion, I would pass the inode number and inode generation
 number into the ext3_file_mode_data array:
 
 struct ext3_file_move_data {
   int extents;
   struct ext3_reloc_extent __user *ext_array;
 };
 
 This will be much more efficient for the userspace relocator, since it
 won't need to translate from an inode number to a pathname, and then
 try to open the file before relocating it.
  Hmm, I was also thinking about it. Probably you're right. It just
seemed elegant to call ioctl on a file and *plop* it's relocated ;).

 I'd also use an explicit 64-bit block numbers type so that we don't
 have to worry about the ABI changing when we support 64-bit block
 numbers.
  Right, will fix.

 The other problem I see with this patch is that there will be cache
 coherency problems between the buffer cache and the page cache.  I
 think you will want to pull the data blocks of the file into the page
 cache, and then write them out from the page cache, and only *then*
 update the indirect blocks and commit the transaction.
  Hmm, I thought I got this right. We build a new tree, copy all data to
it (no writes happen so trees remain consistent), we switch block
pointers from inode. So from now on, any get_block() will correctly
return new block number and block will be read from disk (hmm, probably
I'm missing sync after writing out all the data). Now we call
invalidate_inode_pages2() so all buffers mapped to old blocks are freed
from memory. So there should not be problems with this... OTOH doing the
data copy via page-cache (of the temporarily set-up inode) should not be
a big problem either and we can avoid one sync which should be a win.

 So what needs to happen is the following:
 
 1) Validate that inode and generation number.  Make sure the new
 (destination) blocks passed in are valid and not in use.  Allocate
 them to prevent anyone else from using those blocks.
 
 2) Pull the blocks into the page cache (if they are not already
 there), and the write them out to the new location on disk.  If any of
 the I/O's fail, abort.
 
 3) Update the indirect blocks or extent tree to point at the newly
 allocated and copied data blocks.
 
 In the current patch, it looks like you add the inode being relocated
 to the orphan list, and then update the direct/indirect blocks first
  No, I create temporary inode that holds allocated blocks and that is
added to the orphan list. Hence if we crash in the middle of relocation,
all blocks are correctly freed.

 --- and if you fail the inode gets truncated.  That's bad since we
 don't want to lose any data if we crash in the middle of the defrag
 operation
 
 Great to see that you're working on this problem!  I'd love to see
 this functionality into ext4.
  Thanks for comments.

 P.S.  There is also the question of whether we'll be able to get this
 interface past the ioctl() police, but the atomicity requirements of
 such an interface are a poster child for why we really, REALLY, can't
 do this via a sysfs interface.  We might be forced to create a new
 filesystem, or create a pseudo inode which we open via a magic
 pathname, though.  That in my opinion is uglier than an ioctl, but the
 ioctl police really don't like the problem of needing to maintain
 32/64 bit translation functions, and this interface would surely cause
 problems for the x86_64 and PPC platforms, since they have to support
 32-bit and 64-bit system ABI's.
  Umm, yes. I'm open to suggestions with respect to which interface to
choose. ioctl() was just the easiest to code ;).

Bye
Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Andreas Dilger
On Oct 23, 2006  18:31 +0400, Alex Tomas wrote:
 isn't that a kernel responsbility to find/allocate target blocks?
 wouldn't it better to specify desirable target group and minimal
 acceptable chunk of free blocks?

In some cases this is useful (e.g. if file has small fragments after
being written in small pieces or in a fragmented free space).  In other
cases the user tool HAS to be able to specify the new mapping in order
to make progress.

Consider if there are two very large fragmented files and user-space
defrag tool wants to make contiguous free space.  If kernel is left to do
allocation it will always consume the largest chunk of free space first,
even if it is not yet optimal (e.g. large 1MB aligned extent).

I would make this interface optionally allow the target extent to be
specified, but if target block == 0 then the kernel is free to do its
own allocation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jan Kara
  Theodore Tso (TT) writes:
 
  TT On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
   Hello,
   
   I've written a simple patch implementing ext3 ioctl for file
   relocation. Basically you call ioctl on a file, give it list of blocks
   and it relocates the file into given blocks (provided they are still
   free). The idea is to use it as a kernel part of ext3 online
   defragmenter (or generally disk access optimizer). 
 
 isn't that a kernel responsbility to find/allocate target blocks?
 wouldn't it better to specify desirable target group and minimal
 acceptable chunk of free blocks?
  Kernel definitely allocates those blocks (because it's the only
reasonably race-free way). The problem of finding those blocks is a bit
harder - it may be quite complicated decision where to put the file
(also given, that sometimes you may need to shift away some file to
make space for some other one). Also what I'm aiming for is, that
userspace defragmenter could be fed some access patterns and it
optimizes layout of several files to speedup startup (i.e. blocks of
those several files would be interleaved so that their sequence is close
to the one seen during start-up).

Honza
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jan Kara
 On Oct 23, 2006  18:31 +0400, Alex Tomas wrote:
 I would make this interface optionally allow the target extent to be
 specified, but if target block == 0 then the kernel is free to do its
 own allocation.
  That's a good idea! I'll change the handling so that if block==0 we
just allocate blocks of given extent as we wish...

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Eric Sandeen

Alex Tomas wrote:

Theodore Tso (TT) writes:


 TT On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
  Hello,
  
  I've written a simple patch implementing ext3 ioctl for file

  relocation. Basically you call ioctl on a file, give it list of blocks
  and it relocates the file into given blocks (provided they are still
  free). The idea is to use it as a kernel part of ext3 online
  defragmenter (or generally disk access optimizer). 


isn't that a kernel responsbility to find/allocate target blocks?
wouldn't it better to specify desirable target group and minimal
acceptable chunk of free blocks?


XFS does this by allocating new blocks for a temporary file (initiated from 
userspace, implemented in kernelspace of course), then just checks to see if the 
result is better than what we had before; if so, then swap the storage space  
throw away the temporary file (which now has the original, more-fragmented file 
blocks).


see xfs_swapext() in xfs_dfrag.c for the extent swapping part of this.

You probably want to avoid the page cache in all of this too, doing O_DIRECT IO 
if possible, I don't think there's any reason to churn the page cache while the 
defragmenter runs over a filesystem?


-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Andreas Dilger
On Oct 23, 2006  10:16 -0400, Theodore Tso wrote:
 As a suggestion, I would pass the inode number and inode generation
 number into the ext3_file_mode_data array:
 
 struct ext3_file_move_data {
   int extents;
   struct ext3_reloc_extent __user *ext_array;
 };
 
 This will be much more efficient for the userspace relocator, since it
 won't need to translate from an inode number to a pathname, and then
 try to open the file before relocating it.
 
 I'd also use an explicit 64-bit block numbers type so that we don't
 have to worry about the ABI changing when we support 64-bit block
 numbers.

I would in fact go so far as to allow only a single extent to be specified
per call.  This is to avoid the passing of any pointers as part of the
interface (hello ioctl police :-), and also makes the kernel code simpler.
I don't think the syscall/ioctl overhead is significant compared to the
journal and IO overhead.

Also, I would specify both the source extent and the target extent in
the inode.  This first allows defragmenting only part of the file
instead of (it appears) requiring the whole file to be relocated.  That
would be a killer if the file being defragmented is larger than free
space.  It secondly provides a level of insurance that what the kernel
is relocating matches what userspace thinks it is doing.  It would
protect against problems if the kernel ever does block relocation
itself (e.g. merge fragments into a single extent on (re)write, or for
snapshot/COW).

 The other problem I see with this patch is that there will be cache
 coherency problems between the buffer cache and the page cache.  I
 think you will want to pull the data blocks of the file into the page
 cache, and then write them out from the page cache, and only *then*
 update the indirect blocks and commit the transaction.

Alternately (maybe even better) is to treat it as O_DIRECT and ensure
the page cache is flushed.  This also avoids polluting the whole page
cache while running a defragmenter on the filesystem.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jan Kara
 On Oct 23, 2006  10:16 -0400, Theodore Tso wrote:
  As a suggestion, I would pass the inode number and inode generation
  number into the ext3_file_mode_data array:
  
  struct ext3_file_move_data {
  int extents;
  struct ext3_reloc_extent __user *ext_array;
  };
  
  This will be much more efficient for the userspace relocator, since it
  won't need to translate from an inode number to a pathname, and then
  try to open the file before relocating it.
  
  I'd also use an explicit 64-bit block numbers type so that we don't
  have to worry about the ABI changing when we support 64-bit block
  numbers.
 
 I would in fact go so far as to allow only a single extent to be specified
 per call.  This is to avoid the passing of any pointers as part of the
 interface (hello ioctl police :-), and also makes the kernel code simpler.
 I don't think the syscall/ioctl overhead is significant compared to the
 journal and IO overhead.
  I'm not sure it makes the kernel code simplier - if we have to replace
just a part of the file, we have to rewrite references to blocks at
several places inside indiretc tree. If we relocate whole file, we just
replace block pointers from inode. Furthermore it makes it kind of
harder to tell where indirect blocks would go - and it would be
impossible for the defragmenter to force some unusual placement of
indirect blocks... Currently blocks (including indirect ones) are just
being allocated in the DFS order from the given list.

 Also, I would specify both the source extent and the target extent in
 the inode.  This first allows defragmenting only part of the file
 instead of (it appears) requiring the whole file to be relocated.  That
 would be a killer if the file being defragmented is larger than free
 space.  It secondly provides a level of insurance that what the kernel
 is relocating matches what userspace thinks it is doing.  It would
 protect against problems if the kernel ever does block relocation
 itself (e.g. merge fragments into a single extent on (re)write, or for
 snapshot/COW).
  I agree that this is the positive side of your approach :).

  The other problem I see with this patch is that there will be cache
  coherency problems between the buffer cache and the page cache.  I
  think you will want to pull the data blocks of the file into the page
  cache, and then write them out from the page cache, and only *then*
  update the indirect blocks and commit the transaction.
 
 Alternately (maybe even better) is to treat it as O_DIRECT and ensure
 the page cache is flushed.  This also avoids polluting the whole page
 cache while running a defragmenter on the filesystem.
  That's what I'm trying to do (but maybe my code is buggy).

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3: bogus i_mode errors with 2.6.18.1

2006-10-23 Thread Andreas Dilger
On Oct 23, 2006  16:45 +0200, Andre Noll wrote:
 stress tests on a 6.3T ext3 filesystem which runs on top of software
 raid 6 revealed the following:
 
 [663594.224641] init_special_inode: bogus i_mode (4412)
 [663596.355652] init_special_inode: bogus i_mode (5123)
 [663596.355832] init_special_inode: bogus i_mode (71562)

This would appear to be inode table corruption.

 [663763.319514] EXT3-fs error (device md0): ext3_new_block: Allocating block 
 in system zone - blocks from 517570560, length 1

This is bitmap corruption.

 [663763.339423] EXT3-fs error (device md0): ext3_free_blocks: Freeing blocks 
 in system zones - Block = 517570560, count = 1

Hmm, this would appear to be a buglet in error handling.  If the block just
allocated above is in the system zone it should be marked in-use in the
bitmap but otherwise ignored.  We definitely should NOT be freeing it on
error.

Yikes!  It seems a patch I submitted to 2.4 that fixed the behaviour
of ext3_new_block() so that if we detect this block shouldn't be
allocated it is skipped instead of corrupting the filesystem if it
is running with errors=continue...

It looks like ext3_free_blocks() needs a similar fix - i.e. report an
error and don't actually free those blocks.

 This system is currently not in production use, so I can use it for tests ATM.

I would suggest that you give this a try on RAID5, just for starters.
I'm not aware of any problems in the existing ext3 code for anything
below 8TB.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Ext3 online defrag

2006-10-23 Thread Jeff Garzik
On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
 isn't that a kernel responsbility to find/allocate target blocks?
 wouldn't it better to specify desirable target group and minimal
 acceptable chunk of free blocks?

The kernel doesn't have enough knowledge to know whether or not the
defragger prefers one blkdev location over another.

When you are trying to consolidate blocks, you must specify the
destination as well as source blocks.

Certainly, to prevent corruption and other nastiness, you must fail if
the destination isn't available...

(ext2meta did all this...)

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html