Re: fallocate support for bitmap-based files

2007-07-06 Thread Mike Waychison

Valerie Henson wrote:

On Fri, Jun 29, 2007 at 06:07:25PM -0400, Mike Waychison wrote:
Relying on (a tweaked) reservations code is also somewhat limitting at 
this stage given that reservations are lost on close(fd).  Unless we 
change the lifetime of the reservations (maybe for the lifetime of the 
in-core inode?), crank up the reservation sizes and deal with the 
overcommit issues, I can't think of any better way at this time to deal 
with the problem.


While I never ever intended the ext3-to-ext2 reservations port to be
used :), I think you can make some fairly minor tweaks to it and get
something that works for your use case.  Move the reservation drop to
iput() and turn up your inode cache size, or store it in a tree when
the inode is closed and go look for it again when it's reopened.
Changing the reservation size seems fairly easy.  I'm not sure how the
overcommit issues affect your use case; any data you can share on
that?


The overcommit is speculation on my part.  GFS uses a lot of files on 
the disks and we like to keep the disks near full, especially in large 
GFS configurations.  If we end up with a lot of reservations in-core, 
associated with the inode cache, we begin to rely on memory pressure for 
getting the reserved blocks back.  That memory pressure may not exist 
(leading to ENOSPC).  Unless the code is adapted to cull reservations in 
that case.




In any case, storing the reservation data on-disk seems like not such
a great idea.  It adds complexity, disk traffic, and a new set of
checks for fsck.  I wouldn't want to incur that cost unless absolutely
necessary.


Ya, I wouldn't want the reservations on disk either unless it came in as 
an explicit pre-allocation request (and was accounted for in statfs 
info).  I'm treating that as a completely different beast at the moment.


Mike Waychison
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-07-06 Thread Mike Waychison

Badari Pulavarty wrote:

On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote:

On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:

Guys, Mike and Sreenivasa at google are looking into implementing
fallocate() on ext2.  Of course, any such implementation could and should
also be portable to ext3 and ext4 bitmapped files.

I believe that Sreenivasa will mainly be doing the implementation work.


The basic plan is as follows:

- Create (with tune2fs and mke2fs) a hidden file using one of the
  reserved inode numbers.  That file will be sized to have one bit for each
  block in the partition.  Let's call this the unwritten block file.

  The unwritten block file will be initialised with all-zeroes

- at fallocate()-time, allocate the blocks to the user's file (in some
  yet-to-be-determined fashion) and, for each one which is uninitialised,
  set its bit in the unwritten block file.  The set bit means this block
  is uninitialised and needs to be zeroed out on read.

- truncate() would need to clear out set-bits in the unwritten blocks file.

- When the fs comes to read a block from disk, it will need to consult
  the unwritten blocks file to see if that block should be zeroed by the
  CPU.

- When the unwritten-block is written to, its bit in the unwritten blocks
  file gets zeroed.

- An obvious efficiency concern: if a user file has no unwritten blocks
  in it, we don't need to consult the unwritten blocks file.

  Need to work out how to do this.  An obvious solution would be to have
  a number-of-unwritten-blocks counter in the inode.  But do we have space
  for that?

  (I expect google and others would prefer that the on-disk format be
  compatible with legacy ext2!)

- One concern is the following scenario:

  - Mount fs with new kernel, fallocate() some blocks to a file.

  - Now, mount the fs under old kernel (which doesn't understand the
unwritten blocks file).

- This kernel will be able to read uninitialised data from that
  fallocated-to file, which is a security concern.

  - Now, the old kernel writes some data to a fallocated block.  But
this kernel doesn't know that it needs to clear that block's flag in
the unwritten blocks file!

  - Now mount that fs under the new kernel and try to read that file.
 The flag for the block is set, so this kernel will still zero out the
data on a read, thus corrupting the user's data

  So how to fix this?  Perhaps with a per-inode flag indicating this
  inode has unwritten blocks.  But to fix this problem, we'd require that
  the old kernel clear out that flag.

  Can anyone propose a solution to this?

  Ah, I can!  Use the compatibility flags in such a way as to prevent the
  old kernel from mounting this filesystem at all.  To mount this fs
  under an old kernel the user will need to run some tool which will

  - read the unwritten blocks file

  - for each set-bit in the unwritten blocks file, zero out the
corresponding block

  - zero out the unwritten blocks file

  - rewrite the superblock to indicate that this fs may now be mounted
by an old kernel.

  Sound sane?

- I'm assuming that there are more reserved inodes available, and that
  the changes to tune2fs and mke2fs will be basically a copy-n-paste job
  from the `tune2fs -j' code.  Correct?

- I haven't thought about what fsck changes would be needed.

  Presumably quite a few.  For example, fsck should check that set-bits
  in the unwriten blobks file do not correspond to freed blocks.  If they
  do, that should be fixed up.

  And fsck can check each inodes number-of-unwritten-blocks counters
  against the unwritten blocks file (if we implement the per-inode
  number-of-unwritten-blocks counter)

  What else should fsck do?

- I haven't thought about the implications of porting this into ext3/4. 
  Probably the commit to the unwritten blocks file will need to be atomic

  with the commit to the user's file's metadata, so the unwritten-blocks
  file will effectively need to be in journalled-data mode.

  Or, more likely, we access the unwritten blocks file via the blockdev
  pagecache (ie: use bmap, like the journal file) and then we're just
  talking direct to the disk's blocks and it becomes just more fs metadata.

- I guess resize2fs will need to be taught about the unwritten blocks
  file: to shrink and grow it appropriately.


That's all I can think of for now - I probably missed something. 


Suggestions and thought are sought, please.



Another approach we have been thinking  is using a backing
inode(per-inode-with-preallocation) to store the preallocated blocks.
When user asked for preallocation on the base inode, ext2/3 create a
temporary backing inode, and it's (pre)allocate the
corresponding blocks in the backing inode. 


When writes to the base inode, and realize we need to block allocation
on, before doing the fs real block allocation, it will check if the file
has a backing inode stores some preallocated blocks for the same logical

Re: fallocate support for bitmap-based files

2007-07-06 Thread Badari Pulavarty
On Fri, 2007-07-06 at 14:33 -0700, Mike Waychison wrote:
 Badari Pulavarty wrote:
  On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote:
  On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
  Guys, Mike and Sreenivasa at google are looking into implementing
  fallocate() on ext2.  Of course, any such implementation could and should
  also be portable to ext3 and ext4 bitmapped files.
 
  I believe that Sreenivasa will mainly be doing the implementation work.
 
 
  The basic plan is as follows:
 
  - Create (with tune2fs and mke2fs) a hidden file using one of the
reserved inode numbers.  That file will be sized to have one bit for 
  each
block in the partition.  Let's call this the unwritten block file.
 
The unwritten block file will be initialised with all-zeroes
 
  - at fallocate()-time, allocate the blocks to the user's file (in some
yet-to-be-determined fashion) and, for each one which is uninitialised,
set its bit in the unwritten block file.  The set bit means this block
is uninitialised and needs to be zeroed out on read.
 
  - truncate() would need to clear out set-bits in the unwritten blocks 
  file.
 
  - When the fs comes to read a block from disk, it will need to consult
the unwritten blocks file to see if that block should be zeroed by the
CPU.
 
  - When the unwritten-block is written to, its bit in the unwritten blocks
file gets zeroed.
 
  - An obvious efficiency concern: if a user file has no unwritten blocks
in it, we don't need to consult the unwritten blocks file.
 
Need to work out how to do this.  An obvious solution would be to have
a number-of-unwritten-blocks counter in the inode.  But do we have space
for that?
 
(I expect google and others would prefer that the on-disk format be
compatible with legacy ext2!)
 
  - One concern is the following scenario:
 
- Mount fs with new kernel, fallocate() some blocks to a file.
 
- Now, mount the fs under old kernel (which doesn't understand the
  unwritten blocks file).
 
  - This kernel will be able to read uninitialised data from that
fallocated-to file, which is a security concern.
 
- Now, the old kernel writes some data to a fallocated block.  But
  this kernel doesn't know that it needs to clear that block's flag in
  the unwritten blocks file!
 
- Now mount that fs under the new kernel and try to read that file.
   The flag for the block is set, so this kernel will still zero out the
  data on a read, thus corrupting the user's data
 
So how to fix this?  Perhaps with a per-inode flag indicating this
inode has unwritten blocks.  But to fix this problem, we'd require that
the old kernel clear out that flag.
 
Can anyone propose a solution to this?
 
Ah, I can!  Use the compatibility flags in such a way as to prevent the
old kernel from mounting this filesystem at all.  To mount this fs
under an old kernel the user will need to run some tool which will
 
- read the unwritten blocks file
 
- for each set-bit in the unwritten blocks file, zero out the
  corresponding block
 
- zero out the unwritten blocks file
 
- rewrite the superblock to indicate that this fs may now be mounted
  by an old kernel.
 
Sound sane?
 
  - I'm assuming that there are more reserved inodes available, and that
the changes to tune2fs and mke2fs will be basically a copy-n-paste job
from the `tune2fs -j' code.  Correct?
 
  - I haven't thought about what fsck changes would be needed.
 
Presumably quite a few.  For example, fsck should check that set-bits
in the unwriten blobks file do not correspond to freed blocks.  If they
do, that should be fixed up.
 
And fsck can check each inodes number-of-unwritten-blocks counters
against the unwritten blocks file (if we implement the per-inode
number-of-unwritten-blocks counter)
 
What else should fsck do?
 
  - I haven't thought about the implications of porting this into ext3/4. 
Probably the commit to the unwritten blocks file will need to be atomic
with the commit to the user's file's metadata, so the unwritten-blocks
file will effectively need to be in journalled-data mode.
 
Or, more likely, we access the unwritten blocks file via the blockdev
pagecache (ie: use bmap, like the journal file) and then we're just
talking direct to the disk's blocks and it becomes just more fs 
  metadata.
 
  - I guess resize2fs will need to be taught about the unwritten blocks
file: to shrink and grow it appropriately.
 
 
  That's all I can think of for now - I probably missed something. 
 
  Suggestions and thought are sought, please.
 
 
  Another approach we have been thinking  is using a backing
  inode(per-inode-with-preallocation) to store the preallocated blocks.
  When user asked for preallocation on the base inode, ext2/3 create a
  temporary backing inode, and it's (pre)allocate the
  

Re: fallocate support for bitmap-based files

2007-07-04 Thread Valerie Henson
On Fri, Jun 29, 2007 at 06:07:25PM -0400, Mike Waychison wrote:
 
 Relying on (a tweaked) reservations code is also somewhat limitting at 
 this stage given that reservations are lost on close(fd).  Unless we 
 change the lifetime of the reservations (maybe for the lifetime of the 
 in-core inode?), crank up the reservation sizes and deal with the 
 overcommit issues, I can't think of any better way at this time to deal 
 with the problem.

While I never ever intended the ext3-to-ext2 reservations port to be
used :), I think you can make some fairly minor tweaks to it and get
something that works for your use case.  Move the reservation drop to
iput() and turn up your inode cache size, or store it in a tree when
the inode is closed and go look for it again when it's reopened.
Changing the reservation size seems fairly easy.  I'm not sure how the
overcommit issues affect your use case; any data you can share on
that?

In any case, storing the reservation data on-disk seems like not such
a great idea.  It adds complexity, disk traffic, and a new set of
checks for fsck.  I wouldn't want to incur that cost unless absolutely
necessary.

-VAL
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-07-02 Thread Badari Pulavarty
On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote:
 On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
  Guys, Mike and Sreenivasa at google are looking into implementing
  fallocate() on ext2.  Of course, any such implementation could and should
  also be portable to ext3 and ext4 bitmapped files.
  
  I believe that Sreenivasa will mainly be doing the implementation work.
  
  
  The basic plan is as follows:
  
  - Create (with tune2fs and mke2fs) a hidden file using one of the
reserved inode numbers.  That file will be sized to have one bit for each
block in the partition.  Let's call this the unwritten block file.
  
The unwritten block file will be initialised with all-zeroes
  
  - at fallocate()-time, allocate the blocks to the user's file (in some
yet-to-be-determined fashion) and, for each one which is uninitialised,
set its bit in the unwritten block file.  The set bit means this block
is uninitialised and needs to be zeroed out on read.
  
  - truncate() would need to clear out set-bits in the unwritten blocks file.
  
  - When the fs comes to read a block from disk, it will need to consult
the unwritten blocks file to see if that block should be zeroed by the
CPU.
  
  - When the unwritten-block is written to, its bit in the unwritten blocks
file gets zeroed.
  
  - An obvious efficiency concern: if a user file has no unwritten blocks
in it, we don't need to consult the unwritten blocks file.
  
Need to work out how to do this.  An obvious solution would be to have
a number-of-unwritten-blocks counter in the inode.  But do we have space
for that?
  
(I expect google and others would prefer that the on-disk format be
compatible with legacy ext2!)
  
  - One concern is the following scenario:
  
- Mount fs with new kernel, fallocate() some blocks to a file.
  
- Now, mount the fs under old kernel (which doesn't understand the
  unwritten blocks file).
  
  - This kernel will be able to read uninitialised data from that
fallocated-to file, which is a security concern.
  
- Now, the old kernel writes some data to a fallocated block.  But
  this kernel doesn't know that it needs to clear that block's flag in
  the unwritten blocks file!
  
- Now mount that fs under the new kernel and try to read that file.
   The flag for the block is set, so this kernel will still zero out the
  data on a read, thus corrupting the user's data
  
So how to fix this?  Perhaps with a per-inode flag indicating this
inode has unwritten blocks.  But to fix this problem, we'd require that
the old kernel clear out that flag.
  
Can anyone propose a solution to this?
  
Ah, I can!  Use the compatibility flags in such a way as to prevent the
old kernel from mounting this filesystem at all.  To mount this fs
under an old kernel the user will need to run some tool which will
  
- read the unwritten blocks file
  
- for each set-bit in the unwritten blocks file, zero out the
  corresponding block
  
- zero out the unwritten blocks file
  
- rewrite the superblock to indicate that this fs may now be mounted
  by an old kernel.
  
Sound sane?
  
  - I'm assuming that there are more reserved inodes available, and that
the changes to tune2fs and mke2fs will be basically a copy-n-paste job
from the `tune2fs -j' code.  Correct?
  
  - I haven't thought about what fsck changes would be needed.
  
Presumably quite a few.  For example, fsck should check that set-bits
in the unwriten blobks file do not correspond to freed blocks.  If they
do, that should be fixed up.
  
And fsck can check each inodes number-of-unwritten-blocks counters
against the unwritten blocks file (if we implement the per-inode
number-of-unwritten-blocks counter)
  
What else should fsck do?
  
  - I haven't thought about the implications of porting this into ext3/4. 
Probably the commit to the unwritten blocks file will need to be atomic
with the commit to the user's file's metadata, so the unwritten-blocks
file will effectively need to be in journalled-data mode.
  
Or, more likely, we access the unwritten blocks file via the blockdev
pagecache (ie: use bmap, like the journal file) and then we're just
talking direct to the disk's blocks and it becomes just more fs metadata.
  
  - I guess resize2fs will need to be taught about the unwritten blocks
file: to shrink and grow it appropriately.
  
  
  That's all I can think of for now - I probably missed something. 
  
  Suggestions and thought are sought, please.
  
  
 
 Another approach we have been thinking  is using a backing
 inode(per-inode-with-preallocation) to store the preallocated blocks.
 When user asked for preallocation on the base inode, ext2/3 create a
 temporary backing inode, and it's (pre)allocate the
 corresponding blocks in the backing inode. 
 
 When writes to 

Re: fallocate support for bitmap-based files

2007-07-02 Thread Mingming Cao
On Sat, 2007-06-30 at 13:29 -0400, Andreas Dilger wrote:
 On Jun 30, 2007  10:13 -0400, Mingming Cao wrote:
  Another approach we have been thinking  is using a backing
  inode(per-inode-with-preallocation) to store the preallocated blocks.
  When user asked for preallocation on the base inode, ext2/3 create a
  temporary backing inode, and it's (pre)allocate the corresponding
  blocks in the backing inode. 
  
  When writes to the base inode, and realize we need to block allocation
  on, before doing the fs real block allocation, it will check if the file
  has a backing inode stores some preallocated blocks for the same logical
  blocks.  If so, it will transfer the preallocated blocks from backing
  inode to the base inode.
  
  We need to link the two inodes in some way, maybe store the backing
  inode number via EA in the base inode, and flag the base inode that it
  has a backing inode to get preallocated blocks.
  
  Since it doesn't change the block mapping on the original file until
  writeout, so it doesn't require a incompat feature to protect the
  preallocated contents to be read in old kernel. There some work need
  to be done in e2fsck to understand the backing inode.
 
 I don't know if you realize, but this is half-way to supporting
 snapshots within the filesystem.  

From your description it seems similar, but not sure if it's half-way
yet. Just to clarify: What's stored in the backing inode(in the
preallocation case) is just metablocks, not data blocks. The transfer
(from backing inode to the base inode) do not involve any data blocks
migration.

Another comment, if we seriously looking for supporting preallocation in
ext2 in upstreeam, I'd like to choose a solution suitable for ext3 as
well. Taking a bit from block number to flag preallocated blocks means
reduce ext2/3 fs limit to 8TB, which probably not a big deal for ext2,
but not so good for ext3.

Mingming



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-30 Thread Mingming Cao
On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
 Guys, Mike and Sreenivasa at google are looking into implementing
 fallocate() on ext2.  Of course, any such implementation could and should
 also be portable to ext3 and ext4 bitmapped files.
 
 I believe that Sreenivasa will mainly be doing the implementation work.
 
 
 The basic plan is as follows:
 
 - Create (with tune2fs and mke2fs) a hidden file using one of the
   reserved inode numbers.  That file will be sized to have one bit for each
   block in the partition.  Let's call this the unwritten block file.
 
   The unwritten block file will be initialised with all-zeroes
 
 - at fallocate()-time, allocate the blocks to the user's file (in some
   yet-to-be-determined fashion) and, for each one which is uninitialised,
   set its bit in the unwritten block file.  The set bit means this block
   is uninitialised and needs to be zeroed out on read.
 
 - truncate() would need to clear out set-bits in the unwritten blocks file.
 
 - When the fs comes to read a block from disk, it will need to consult
   the unwritten blocks file to see if that block should be zeroed by the
   CPU.
 
 - When the unwritten-block is written to, its bit in the unwritten blocks
   file gets zeroed.
 
 - An obvious efficiency concern: if a user file has no unwritten blocks
   in it, we don't need to consult the unwritten blocks file.
 
   Need to work out how to do this.  An obvious solution would be to have
   a number-of-unwritten-blocks counter in the inode.  But do we have space
   for that?
 
   (I expect google and others would prefer that the on-disk format be
   compatible with legacy ext2!)
 
 - One concern is the following scenario:
 
   - Mount fs with new kernel, fallocate() some blocks to a file.
 
   - Now, mount the fs under old kernel (which doesn't understand the
 unwritten blocks file).
 
 - This kernel will be able to read uninitialised data from that
   fallocated-to file, which is a security concern.
 
   - Now, the old kernel writes some data to a fallocated block.  But
 this kernel doesn't know that it needs to clear that block's flag in
 the unwritten blocks file!
 
   - Now mount that fs under the new kernel and try to read that file.
  The flag for the block is set, so this kernel will still zero out the
 data on a read, thus corrupting the user's data
 
   So how to fix this?  Perhaps with a per-inode flag indicating this
   inode has unwritten blocks.  But to fix this problem, we'd require that
   the old kernel clear out that flag.
 
   Can anyone propose a solution to this?
 
   Ah, I can!  Use the compatibility flags in such a way as to prevent the
   old kernel from mounting this filesystem at all.  To mount this fs
   under an old kernel the user will need to run some tool which will
 
   - read the unwritten blocks file
 
   - for each set-bit in the unwritten blocks file, zero out the
 corresponding block
 
   - zero out the unwritten blocks file
 
   - rewrite the superblock to indicate that this fs may now be mounted
 by an old kernel.
 
   Sound sane?
 
 - I'm assuming that there are more reserved inodes available, and that
   the changes to tune2fs and mke2fs will be basically a copy-n-paste job
   from the `tune2fs -j' code.  Correct?
 
 - I haven't thought about what fsck changes would be needed.
 
   Presumably quite a few.  For example, fsck should check that set-bits
   in the unwriten blobks file do not correspond to freed blocks.  If they
   do, that should be fixed up.
 
   And fsck can check each inodes number-of-unwritten-blocks counters
   against the unwritten blocks file (if we implement the per-inode
   number-of-unwritten-blocks counter)
 
   What else should fsck do?
 
 - I haven't thought about the implications of porting this into ext3/4. 
   Probably the commit to the unwritten blocks file will need to be atomic
   with the commit to the user's file's metadata, so the unwritten-blocks
   file will effectively need to be in journalled-data mode.
 
   Or, more likely, we access the unwritten blocks file via the blockdev
   pagecache (ie: use bmap, like the journal file) and then we're just
   talking direct to the disk's blocks and it becomes just more fs metadata.
 
 - I guess resize2fs will need to be taught about the unwritten blocks
   file: to shrink and grow it appropriately.
 
 
 That's all I can think of for now - I probably missed something. 
 
 Suggestions and thought are sought, please.
 
 

Another approach we have been thinking  is using a backing
inode(per-inode-with-preallocation) to store the preallocated blocks.
When user asked for preallocation on the base inode, ext2/3 create a
temporary backing inode, and it's (pre)allocate the
corresponding blocks in the backing inode. 

When writes to the base inode, and realize we need to block allocation
on, before doing the fs real block allocation, it will check if the file
has a backing inode stores some preallocated 

Re: fallocate support for bitmap-based files

2007-06-30 Thread Mingming Cao
On Sat, 2007-06-30 at 01:14 -0400, Andreas Dilger wrote:
 On Jun 29, 2007  18:26 -0400, Mike Waychison wrote:
  Andreas Dilger wrote:
  I don't think ext2 is safe for  8TB filesystems anyways, so this
  isn't a huge loss.
  
  This is reference to the idea of overloading the high-bit and not 
  related to the PAGE_SIZE blocks correct?
 
 Correct - just that the high-bit use wouldn't unduely impact the
 already-existing 8TB limit of ext2.
 

The 8TB limit on mainline ext2 was simplely caused by kernel block
variable type bugs. The bug fixes were ported back from ext3 to ext2,
when reservation+simple-multiple-balloc were backported from ext3 to
ext2. I believe ext2 in mm tree is able to address 16TB in the kernel
side.  Not sure if there are remaining work to be done in e2fsck to
handle 16TB ext2, but I assume it's not huge work.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fallocate support for bitmap-based files

2007-06-29 Thread Andrew Morton

Guys, Mike and Sreenivasa at google are looking into implementing
fallocate() on ext2.  Of course, any such implementation could and should
also be portable to ext3 and ext4 bitmapped files.

I believe that Sreenivasa will mainly be doing the implementation work.


The basic plan is as follows:

- Create (with tune2fs and mke2fs) a hidden file using one of the
  reserved inode numbers.  That file will be sized to have one bit for each
  block in the partition.  Let's call this the unwritten block file.

  The unwritten block file will be initialised with all-zeroes

- at fallocate()-time, allocate the blocks to the user's file (in some
  yet-to-be-determined fashion) and, for each one which is uninitialised,
  set its bit in the unwritten block file.  The set bit means this block
  is uninitialised and needs to be zeroed out on read.

- truncate() would need to clear out set-bits in the unwritten blocks file.

- When the fs comes to read a block from disk, it will need to consult
  the unwritten blocks file to see if that block should be zeroed by the
  CPU.

- When the unwritten-block is written to, its bit in the unwritten blocks
  file gets zeroed.

- An obvious efficiency concern: if a user file has no unwritten blocks
  in it, we don't need to consult the unwritten blocks file.

  Need to work out how to do this.  An obvious solution would be to have
  a number-of-unwritten-blocks counter in the inode.  But do we have space
  for that?

  (I expect google and others would prefer that the on-disk format be
  compatible with legacy ext2!)

- One concern is the following scenario:

  - Mount fs with new kernel, fallocate() some blocks to a file.

  - Now, mount the fs under old kernel (which doesn't understand the
unwritten blocks file).

- This kernel will be able to read uninitialised data from that
  fallocated-to file, which is a security concern.

  - Now, the old kernel writes some data to a fallocated block.  But
this kernel doesn't know that it needs to clear that block's flag in
the unwritten blocks file!

  - Now mount that fs under the new kernel and try to read that file.
 The flag for the block is set, so this kernel will still zero out the
data on a read, thus corrupting the user's data

  So how to fix this?  Perhaps with a per-inode flag indicating this
  inode has unwritten blocks.  But to fix this problem, we'd require that
  the old kernel clear out that flag.

  Can anyone propose a solution to this?

  Ah, I can!  Use the compatibility flags in such a way as to prevent the
  old kernel from mounting this filesystem at all.  To mount this fs
  under an old kernel the user will need to run some tool which will

  - read the unwritten blocks file

  - for each set-bit in the unwritten blocks file, zero out the
corresponding block

  - zero out the unwritten blocks file

  - rewrite the superblock to indicate that this fs may now be mounted
by an old kernel.

  Sound sane?

- I'm assuming that there are more reserved inodes available, and that
  the changes to tune2fs and mke2fs will be basically a copy-n-paste job
  from the `tune2fs -j' code.  Correct?

- I haven't thought about what fsck changes would be needed.

  Presumably quite a few.  For example, fsck should check that set-bits
  in the unwriten blobks file do not correspond to freed blocks.  If they
  do, that should be fixed up.

  And fsck can check each inodes number-of-unwritten-blocks counters
  against the unwritten blocks file (if we implement the per-inode
  number-of-unwritten-blocks counter)

  What else should fsck do?

- I haven't thought about the implications of porting this into ext3/4. 
  Probably the commit to the unwritten blocks file will need to be atomic
  with the commit to the user's file's metadata, so the unwritten-blocks
  file will effectively need to be in journalled-data mode.

  Or, more likely, we access the unwritten blocks file via the blockdev
  pagecache (ie: use bmap, like the journal file) and then we're just
  talking direct to the disk's blocks and it becomes just more fs metadata.

- I guess resize2fs will need to be taught about the unwritten blocks
  file: to shrink and grow it appropriately.


That's all I can think of for now - I probably missed something. 

Suggestions and thought are sought, please.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Dave Kleikamp
On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:
 Guys, Mike and Sreenivasa at google are looking into implementing
 fallocate() on ext2.  Of course, any such implementation could and should
 also be portable to ext3 and ext4 bitmapped files.
 
 I believe that Sreenivasa will mainly be doing the implementation work.
 
 
 The basic plan is as follows:
 
 - Create (with tune2fs and mke2fs) a hidden file using one of the
   reserved inode numbers.  That file will be sized to have one bit for each
   block in the partition.  Let's call this the unwritten block file.
 
   The unwritten block file will be initialised with all-zeroes
 
 - at fallocate()-time, allocate the blocks to the user's file (in some
   yet-to-be-determined fashion) and, for each one which is uninitialised,
   set its bit in the unwritten block file.  The set bit means this block
   is uninitialised and needs to be zeroed out on read.
 
 - truncate() would need to clear out set-bits in the unwritten blocks file.

By truncating the blocks file at the correct byte offset, only needing
to zero some bits of the last byte of the file.

 - When the fs comes to read a block from disk, it will need to consult
   the unwritten blocks file to see if that block should be zeroed by the
   CPU.
 
 - When the unwritten-block is written to, its bit in the unwritten blocks
   file gets zeroed.
 
 - An obvious efficiency concern: if a user file has no unwritten blocks
   in it, we don't need to consult the unwritten blocks file.
 
   Need to work out how to do this.  An obvious solution would be to have
   a number-of-unwritten-blocks counter in the inode.  But do we have space
   for that?

Would it be too expensive to test the blocks-file page each time a bit
is cleared to see if it is all-zero, and then free the page, making it a
hole?  This test would stop if if finds any non-zero word, so it may not
be too bad.  (This could further be done on a block basis if the block
size is less than a page.)

   (I expect google and others would prefer that the on-disk format be
   compatible with legacy ext2!)
 
 - One concern is the following scenario:
 
   - Mount fs with new kernel, fallocate() some blocks to a file.
 
   - Now, mount the fs under old kernel (which doesn't understand the
 unwritten blocks file).
 
 - This kernel will be able to read uninitialised data from that
   fallocated-to file, which is a security concern.
 
   - Now, the old kernel writes some data to a fallocated block.  But
 this kernel doesn't know that it needs to clear that block's flag in
 the unwritten blocks file!
 
   - Now mount that fs under the new kernel and try to read that file.
  The flag for the block is set, so this kernel will still zero out the
 data on a read, thus corrupting the user's data
 
   So how to fix this?  Perhaps with a per-inode flag indicating this
   inode has unwritten blocks.  But to fix this problem, we'd require that
   the old kernel clear out that flag.
 
   Can anyone propose a solution to this?
 
   Ah, I can!  Use the compatibility flags in such a way as to prevent the
   old kernel from mounting this filesystem at all.  To mount this fs
   under an old kernel the user will need to run some tool which will
 
   - read the unwritten blocks file
 
   - for each set-bit in the unwritten blocks file, zero out the
 corresponding block
 
   - zero out the unwritten blocks file
 
   - rewrite the superblock to indicate that this fs may now be mounted
 by an old kernel.
 
   Sound sane?

Yeah.  I think it would have to be done under a compatibility flag.  Is
going back to an older kernel really that important?  I think it's more
important to make sure it can't be mounted by an older kernel if bad
things can happen, and they can.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Mike Waychison

Dave Kleikamp wrote:

On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote:


Guys, Mike and Sreenivasa at google are looking into implementing
fallocate() on ext2.  Of course, any such implementation could and should
also be portable to ext3 and ext4 bitmapped files.

I believe that Sreenivasa will mainly be doing the implementation work.


The basic plan is as follows:

- Create (with tune2fs and mke2fs) a hidden file using one of the
 reserved inode numbers.  That file will be sized to have one bit for each
 block in the partition.  Let's call this the unwritten block file.

 The unwritten block file will be initialised with all-zeroes

- at fallocate()-time, allocate the blocks to the user's file (in some
 yet-to-be-determined fashion) and, for each one which is uninitialised,
 set its bit in the unwritten block file.  The set bit means this block
 is uninitialised and needs to be zeroed out on read.

- truncate() would need to clear out set-bits in the unwritten blocks file.



By truncating the blocks file at the correct byte offset, only needing
to zero some bits of the last byte of the file.


We were thinking the unwritten blocks file would be indexed by physical 
block number of the block device.  There wouldn't be a logical to 
physical relationship for the blocks, so we wouldn't be able to get away 
with truncating the blocks file itself.






- When the fs comes to read a block from disk, it will need to consult
 the unwritten blocks file to see if that block should be zeroed by the
 CPU.

- When the unwritten-block is written to, its bit in the unwritten blocks
 file gets zeroed.

- An obvious efficiency concern: if a user file has no unwritten blocks
 in it, we don't need to consult the unwritten blocks file.

 Need to work out how to do this.  An obvious solution would be to have
 a number-of-unwritten-blocks counter in the inode.  But do we have space
 for that?



Would it be too expensive to test the blocks-file page each time a bit
is cleared to see if it is all-zero, and then free the page, making it a
hole?  This test would stop if if finds any non-zero word, so it may not
be too bad.  (This could further be done on a block basis if the block
size is less than a page.)


When clearing the bits, we'd likely see a large stream of writes to the 
unwritten blocks, which could result in a O(n^2) pass of rescanning the 
page over and over.  Maybe a per-unwritten-block-file block 
per-block-header with a count that could be cheaply tested?  Ie: the 
unwritten block file is composed of blocks that each have a small header 
that contains count -- when the count hits zero, we could punch a hole 
in the file.






 (I expect google and others would prefer that the on-disk format be
 compatible with legacy ext2!)

- One concern is the following scenario:

 - Mount fs with new kernel, fallocate() some blocks to a file.

 - Now, mount the fs under old kernel (which doesn't understand the
   unwritten blocks file).

   - This kernel will be able to read uninitialised data from that
 fallocated-to file, which is a security concern.

 - Now, the old kernel writes some data to a fallocated block.  But
   this kernel doesn't know that it needs to clear that block's flag in
   the unwritten blocks file!

 - Now mount that fs under the new kernel and try to read that file.
The flag for the block is set, so this kernel will still zero out the
   data on a read, thus corrupting the user's data

 So how to fix this?  Perhaps with a per-inode flag indicating this
 inode has unwritten blocks.  But to fix this problem, we'd require that
 the old kernel clear out that flag.

 Can anyone propose a solution to this?

 Ah, I can!  Use the compatibility flags in such a way as to prevent the
 old kernel from mounting this filesystem at all.  To mount this fs
 under an old kernel the user will need to run some tool which will

 - read the unwritten blocks file

 - for each set-bit in the unwritten blocks file, zero out the
   corresponding block

 - zero out the unwritten blocks file

 - rewrite the superblock to indicate that this fs may now be mounted
   by an old kernel.

 Sound sane?



Yeah.  I think it would have to be done under a compatibility flag.  Is
going back to an older kernel really that important?  I think it's more
important to make sure it can't be mounted by an older kernel if bad
things can happen, and they can.



Ya, I too was originally thinking of a compat flag to keep the old 
kernel from mounting the filesystem.  We'd arrange our bootup scripts to 
check for compatibility and call out to tune2fs (or some other tool) to 
down convert (by simply writing out zero blocks for each bit set and 
clearing the bitmap).


Mike Waychison
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Theodore Tso
On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote:
 
 Guys, Mike and Sreenivasa at google are looking into implementing
 fallocate() on ext2.  Of course, any such implementation could and should
 also be portable to ext3 and ext4 bitmapped files.

What's the eventual goal of this work?  Would it be for mainline use,
or just something that would be used internally at Google?  I'm not
particularly ennthused about supporting two ways of doing fallocate();
one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
benefit reallyworth it?

What I would suggest, which would make much easier, is to make this be
an incompatible extensions (which you as you point out is needed for
security reasons anyway) and then steal the high bit from the block
number field to indicate whether or not the block has been initialized
or not.  That way you don't end up having to seek to a potentially
distant part of the disk to check out the bitmap.  Also, you don't
have to worry about how to recover if the block initialized bitmap
inode gets smashed.  

The downside is that it reduces the maximum size of the filesystem
supported by ext2 by a factor of two.  But, there are at least two
patch series floating about that promise to allow filesystem block
sizes  than PAGE_SIZE which would allow you to recover the maximum
size supported by the filesytem.

Furthermore, I suspect (especially after listening to a very fasting
Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks
ago) that for many of Google's workloads, using a filesystem blocksize
of 16K or 32K might not be a bad thing in any case.

It would be a lot simpler

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Dave Kleikamp
On Fri, 2007-06-29 at 16:52 -0400, Mike Waychison wrote:
 Dave Kleikamp wrote:
  
  By truncating the blocks file at the correct byte offset, only needing
  to zero some bits of the last byte of the file.
 
 We were thinking the unwritten blocks file would be indexed by physical 
 block number of the block device.  There wouldn't be a logical to 
 physical relationship for the blocks, so we wouldn't be able to get away 
 with truncating the blocks file itself.

I misunderstood.  I was thinking about a block-file per regular file
(that had preallocated blocks).  Ignore that comment.

 - When the fs comes to read a block from disk, it will need to consult
   the unwritten blocks file to see if that block should be zeroed by the
   CPU.
 
 - When the unwritten-block is written to, its bit in the unwritten blocks
   file gets zeroed.
 
 - An obvious efficiency concern: if a user file has no unwritten blocks
   in it, we don't need to consult the unwritten blocks file.
 
   Need to work out how to do this.  An obvious solution would be to have
   a number-of-unwritten-blocks counter in the inode.  But do we have space
   for that?
  
  
  Would it be too expensive to test the blocks-file page each time a bit
  is cleared to see if it is all-zero, and then free the page, making it a
  hole?  This test would stop if if finds any non-zero word, so it may not
  be too bad.  (This could further be done on a block basis if the block
  size is less than a page.)
 
 When clearing the bits, we'd likely see a large stream of writes to the 
 unwritten blocks, which could result in a O(n^2) pass of rescanning the 
 page over and over.  

If you start checking for zero at the bit that was just zeroed, you'd
likely find a non-zero bit right away, so you wouldn't be looking at too
much of the page in the typical case.

 Maybe a per-unwritten-block-file block 
 per-block-header with a count that could be cheaply tested?  Ie: the 
 unwritten block file is composed of blocks that each have a small header 
 that contains count -- when the count hits zero, we could punch a hole 
 in the file.

Having the data be just a bitmap seems more elegant to me.  It would be
nice to avoid keeping a count in the bitmap page if possible.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Andrew Morton
On Fri, 29 Jun 2007 16:55:25 -0400
Theodore Tso [EMAIL PROTECTED] wrote:

 On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote:
  
  Guys, Mike and Sreenivasa at google are looking into implementing
  fallocate() on ext2.  Of course, any such implementation could and should
  also be portable to ext3 and ext4 bitmapped files.
 
 What's the eventual goal of this work?  Would it be for mainline use,
 or just something that would be used internally at Google?

Mainline, preferably.

  I'm not
 particularly ennthused about supporting two ways of doing fallocate();
 one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
 benefit reallyworth it?

umm, it's worth it if you don't want to wear the overhead of journalling,
and/or if you don't want to wait on the, err, rather slow progress of ext4.

 What I would suggest, which would make much easier, is to make this be
 an incompatible extensions (which you as you point out is needed for
 security reasons anyway) and then steal the high bit from the block
 number field to indicate whether or not the block has been initialized
 or not.  That way you don't end up having to seek to a potentially
 distant part of the disk to check out the bitmap.  Also, you don't
 have to worry about how to recover if the block initialized bitmap
 inode gets smashed.  
 
 The downside is that it reduces the maximum size of the filesystem
 supported by ext2 by a factor of two.  But, there are at least two
 patch series floating about that promise to allow filesystem block
 sizes  than PAGE_SIZE which would allow you to recover the maximum
 size supported by the filesytem.
 
 Furthermore, I suspect (especially after listening to a very fasting
 Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks
 ago) that for many of Google's workloads, using a filesystem blocksize
 of 16K or 32K might not be a bad thing in any case.
 
 It would be a lot simpler
 

Hadn't thought of that.

Also, it's unclear to me why google is going this way rather than using
(perhaps suitably-tweaked) ext2 reservations code.

Because the stock ext2 block allcoator sucks big-time.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Andreas Dilger
On Jun 29, 2007  16:55 -0400, Theodore Tso wrote:
 What's the eventual goal of this work?  Would it be for mainline use,
 or just something that would be used internally at Google?  I'm not
 particularly ennthused about supporting two ways of doing fallocate();
 one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
 benefit reallyworth it?
 
 What I would suggest, which would make much easier, is to make this be
 an incompatible extensions (which you as you point out is needed for
 security reasons anyway) and then steal the high bit from the block
 number field to indicate whether or not the block has been initialized
 or not.  That way you don't end up having to seek to a potentially
 distant part of the disk to check out the bitmap.  Also, you don't
 have to worry about how to recover if the block initialized bitmap
 inode gets smashed.  
 
 The downside is that it reduces the maximum size of the filesystem
 supported by ext2 by a factor of two.  But, there are at least two
 patch series floating about that promise to allow filesystem block
 sizes  than PAGE_SIZE which would allow you to recover the maximum
 size supported by the filesytem.

I don't think ext2 is safe for  8TB filesystems anyways, so this
isn't a huge loss.

The other possibility is, assuming Google likes ext2 because they
don't care about e2fsck, is to patch ext4 to not use any
journaling (i.e. make all of the ext4_journal*() wrappers be
no-ops).  That way they would get extents, mballoc and other speedups.

That said, what is the reason for not using ext3?  Presumably performance
(which is greatly improved in ext4) or is there something else?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Mike Waychison

Andrew Morton wrote:

On Fri, 29 Jun 2007 16:55:25 -0400
Theodore Tso [EMAIL PROTECTED] wrote:



On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote:


Guys, Mike and Sreenivasa at google are looking into implementing
fallocate() on ext2.  Of course, any such implementation could and should
also be portable to ext3 and ext4 bitmapped files.


What's the eventual goal of this work?  Would it be for mainline use,
or just something that would be used internally at Google?



Mainline, preferably.



I'm not
particularly ennthused about supporting two ways of doing fallocate();
one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
benefit reallyworth it?



umm, it's worth it if you don't want to wear the overhead of journalling,
and/or if you don't want to wait on the, err, rather slow progress of ext4.



What I would suggest, which would make much easier, is to make this be
an incompatible extensions (which you as you point out is needed for
security reasons anyway) and then steal the high bit from the block
number field to indicate whether or not the block has been initialized
or not.  That way you don't end up having to seek to a potentially
distant part of the disk to check out the bitmap.  Also, you don't
have to worry about how to recover if the block initialized bitmap
inode gets smashed.  


The downside is that it reduces the maximum size of the filesystem
supported by ext2 by a factor of two.  But, there are at least two
patch series floating about that promise to allow filesystem block
sizes  than PAGE_SIZE which would allow you to recover the maximum
size supported by the filesytem.

Furthermore, I suspect (especially after listening to a very fasting
Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks
ago) that for many of Google's workloads, using a filesystem blocksize
of 16K or 32K might not be a bad thing in any case.

It would be a lot simpler




Hadn't thought of that.

Also, it's unclear to me why google is going this way rather than using
(perhaps suitably-tweaked) ext2 reservations code.

Because the stock ext2 block allcoator sucks big-time.


The primary reason this is a problem is that our writers into these 
files aren't neccesarily coming from the same hosts in the cluster, so 
their arrival times aren't sequential.  It ends up looking to the kernel 
like a random write workload, which in turn ends up causing odd 
fragmentation patterns that aren't very deterministic.  That data is 
often eventually streamed off the disk though, which is when the 
fragmentation hurts.


Currently, our clustered filesystem supports pre-allocation of the 
target chunks of files, but this is implemented by writting effectively 
zeroes to files, which in turn causes pagecache churn and a double 
write-out of the blocks.  Recently, we've changed the code to minimize 
this pagecache churn and double write out by performing an ftruncate to 
extend files, but then we'll be back to square-one in terms of 
fragmentation for the random writes.


Relying on (a tweaked) reservations code is also somewhat limitting at 
this stage given that reservations are lost on close(fd).  Unless we 
change the lifetime of the reservations (maybe for the lifetime of the 
in-core inode?), crank up the reservation sizes and deal with the 
overcommit issues, I can't think of any better way at this time to deal 
with the problem.


Mike Waychison
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Mike Waychison

Andreas Dilger wrote:

On Jun 29, 2007  16:55 -0400, Theodore Tso wrote:


What's the eventual goal of this work?  Would it be for mainline use,
or just something that would be used internally at Google?  I'm not
particularly ennthused about supporting two ways of doing fallocate();
one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
benefit reallyworth it?

What I would suggest, which would make much easier, is to make this be
an incompatible extensions (which you as you point out is needed for
security reasons anyway) and then steal the high bit from the block
number field to indicate whether or not the block has been initialized
or not.  That way you don't end up having to seek to a potentially
distant part of the disk to check out the bitmap.  Also, you don't
have to worry about how to recover if the block initialized bitmap
inode gets smashed.  


The downside is that it reduces the maximum size of the filesystem
supported by ext2 by a factor of two.  But, there are at least two
patch series floating about that promise to allow filesystem block
sizes  than PAGE_SIZE which would allow you to recover the maximum
size supported by the filesytem.



I don't think ext2 is safe for  8TB filesystems anyways, so this
isn't a huge loss.


This is reference to the idea of overloading the high-bit and not 
related to the PAGE_SIZE blocks correct?




The other possibility is, assuming Google likes ext2 because they
don't care about e2fsck, is to patch ext4 to not use any
journaling (i.e. make all of the ext4_journal*() wrappers be
no-ops).  That way they would get extents, mballoc and other speedups.



We do care about the e2fsck problem, though the cost/benefit of e2fsck 
times/memory problems vs the overhead of journalling doesn't weigh in 
journalling's favour for a lot of our per-spindle-latency bound 
applications.  These apps manage to get pretty good disk locality 
guarantees and the journal overheads can induce undesired head movement.


ext4 does look very promising, though I'm not certain it's ready for our 
consumption.


What are people's thoughts on providing ext3 non-journal mode?  We could 
benefit from several of the additions to ext3 that aren't available in 
ext2 and disabling journalling there sounds much more feasible for us 
instead of trying to backport each ext3 component to ext2.


Mike Waychison


That said, what is the reason for not using ext3?  Presumably performance
(which is greatly improved in ext4) or is there something else?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate support for bitmap-based files

2007-06-29 Thread Andreas Dilger
On Jun 29, 2007  18:26 -0400, Mike Waychison wrote:
 Andreas Dilger wrote:
 I don't think ext2 is safe for  8TB filesystems anyways, so this
 isn't a huge loss.
 
 This is reference to the idea of overloading the high-bit and not 
 related to the PAGE_SIZE blocks correct?

Correct - just that the high-bit use wouldn't unduely impact the
already-existing 8TB limit of ext2.

The other thing to note is that Val Henson already ported the ext3
reservation code to ext2, so this is a pretty straight forward
option for you and also doesn't affect the on-disk format.

 The other possibility is, assuming Google likes ext2 because they
 don't care about e2fsck, is to patch ext4 to not use any
 journaling (i.e. make all of the ext4_journal*() wrappers be
 no-ops).  That way they would get extents, mballoc and other speedups.
 
 We do care about the e2fsck problem, though the cost/benefit of e2fsck 
 times/memory problems vs the overhead of journalling doesn't weigh in 
 journalling's favour for a lot of our per-spindle-latency bound 
 applications.  These apps manage to get pretty good disk locality 
 guarantees and the journal overheads can induce undesired head movement.

You could push the journal to a separate spindle, but that may not be
practical.

 ext4 does look very promising, though I'm not certain it's ready for our 
 consumption.

FYI, the extents code (the most complex part of ext4) has been running for
a couple of years on many PB of storage at CFS, so it is by no means new
and untried code.  There are definitely less-well tested changes in ext4
but they are mostly straight forward.  I'm not saying you should jump right
into ext4, but it isn't as far away as you might think.

 What are people's thoughts on providing ext3 non-journal mode?  We could 
 benefit from several of the additions to ext3 that aren't available in 
 ext2 and disabling journalling there sounds much more feasible for us 
 instead of trying to backport each ext3 component to ext2.

This is something we've talked about for a long time, and I'd be happy to
have this possibility.  This would also allow you to take similar advantage
of extents, the improved allocator and other features.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html