Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Jeff Dike wrote:

On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote:

Jeff Dike wrote:

How about this case:

Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.

Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.

Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 
"Delete file in chunk A". If you are referring to same file thats 
growing, then deletion is not possible, because individual parts of any 
file in any chunk cannot be deleted.


No, I'm referring to a different file.  The scenario is that you have
a growing file in a nearly full disk with files being deleted (and
thus space being freed) such that allocations for the growing file
bounce back and forth between chunks.



In such scenario either lot of continuation inodes < number of chunks 
would be created or lot of sparse pieces would be created. But we can 
certainly enforce the constraint of one continuation inode per file per 
chunk, excluding the file's primary chunk in which it started.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Alan Cox wrote:
Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 


How are you handling the allocation in this situation, are you assuming
that a chunk is "out of bounds" because part of a file already lives on
it or simply keeping a single inode per chunk which has multiple sparse
pieces of the file on it ?

ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write
16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have
to find another chunk to use ?


Hello Alan,

You re-use the same inode with multiple sparse pieces.

This way you avoid hopping around continuation inodes and coming back to 
same chunk with which you started but this time on a different 
continuation inode. This may not be I/O intensive for successive 
traversals if the continuation inodes are pinned in the memory, but it 
certainly is a waste of resource - inodes. Not allowing this would make 
worst case of every file having a continuation inode in every chunk, 
even worse; may be like only single file exist in the file system and 
rest all inodes in all chunks (including file's own chunk) are 
continuation inodes.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Jeff Dike wrote:

How about this case:

Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.

Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.



Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 
"Delete file in chunk A". If you are referring to same file thats 
growing, then deletion is not possible, because individual parts of any 
file in any chunk cannot be deleted.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-25 Thread Amit Gud

Andreas Dilger wrote:

How do you recover if fsfuzzer takes out a cnode in the chain? The
chunk is marked clean, but clearly corrupted and needs fixing and
you don't know what it was pointing at.  Hence you have a pointer to
a trashed cnode *somewhere* that you need to find and fix, and a
bunch of orphaned cnodes that nobody points to *somewhere else* in
the filesystem that you have to find. That's a full scan fsck case,
isn't?


Presumably, the cnodes in the other chunks contain forward and back
references.  Those need to contain at minimum inode + generation + chunk
to avoid problem of pointing to a _different_ inode after such corruption
caused the old inode to be deleted and a new one allocated in its place.

If the cnode in each chunk is more than just a singly-linked list, the
file as a whole could survive multiple chunk corruptions, though there
would now be holes in the file.


It seems that any sort of damage to the underlying storage (e.g.
media error, I/O error or user brain explosion) results in the need
to do a full fsck and hence chunkfs gives you no benefit in this
case.




Yes, what originated from discussions on #linuxfs is that redundancy is 
required for cnodes, in order to avoid checking the entire file system 
in search of a dangling cnode reference or for "parent" of a cnode.


If corruption, due to any reason, occurs in any other part of the file 
system, it would be localized for that chunk. Even if entire fsck is 
needed, chances of which are rare, full fsck of chunked file system is 
no worse than fsck of non-chunked file system. Passes 3, 4, and 5 of 
fsck take only 10-15% of total fsck run time and almost no-I/O Pass 6 
for chunkfs wouldn't add whole lot.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Amit Gud

Nikita Danilov wrote:

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.



Perhaps, I should talk about how continuation inodes are managed / 
located on disk. (This is how it is in my current implementation)


Right now, there is no distinction between an inode and continuation 
inode (also referred to as 'cnode' below), except for the 
EXT2_IS_CONT_FL flag. Every inode holds a list of static number of 
inodes, currently limited to 4.


The structure looks like this:

 -- --
| cnode 0  |-->| cnode 0  |--> to another cnode or NULL
 -- --
| cnode 1  |-  | cnode 1  |-
 -- |   --  |
| cnode 2  |--  |  | cnode 2  |--   |
 --  |  |   --  |   |
| cnode 3  | |  |  | cnode 3  | |   |
 --  |  |   --  |   |
  |  |  ||  |   |

   inodes   inodes or NULL

I.e. only first cnode in the list carries forward the chain if all the 
slots are occupied.


Every cnode# field contains
{
ino_t cnode;
__u32 start;/* starting logical block number */
__u32 end;  /* ending logical block number */
}

(current implementation has just one field: cnode)

I thought of this structure to avoid recursion and / or use of any data 
structure while traversing the continuation inodes.


Additional flag, EXT2_SPARSE_CONT_FL would indicate whether the inode 
has any sparse portions. 'start' and 'end' fields are used to speed-up 
finding a cnode given a logical block number without the need of 
actually reading the inode - this can be done away with, perhaps more 
conveniently by, pinning the cnodes in memory as and when read.


Now going back to the Nikita's question, all the cnodes for an inode 
need to be scanned iff 'start' field or number of blocks or flag 
EXT2_SPARSE_CONT_FL in any of its cnodes is altered.


And yes, the whole attempt is to reduce the number of continuation inodes.

Comments, suggestions welcome.


AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Amit Gud

On Mon, 23 Apr 2007, Amit Gud wrote:


On Mon, 23 Apr 2007, Arjan van de Ven wrote:



>  The other thing which we should consider is that chunkfs really
>  requires a 64-bit inode number space, which means either we only allow

 does it?
 I'd think it needs a "chunk space" number and a 32 bit local inode
 number ;) (same for blocks)



For inodes, yes, either 64-bit inode or some field for the chunk id in which 
the inode is. But for block numbers, you don't. Because individual chunks 
manage part of the whole file system in an independent way. They have their 
block bitmaps starting at an offset. Inode bitmaps, however, remains same.




In that sense, we also can do away without having chunk identifier encoded 
into inode number and chunkfs would still be fine with it. But we will 
then loose inode uniqueness property, which could well be OK as it is with 
other file systems in which inode number is not sufficient for unique 
identification of an inode.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Amit Gud

On Mon, 23 Apr 2007, Arjan van de Ven wrote:




The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow


does it?
I'd think it needs a "chunk space" number and a 32 bit local inode
number ;) (same for blocks)



For inodes, yes, either 64-bit inode or some field for the chunk id in 
which the inode is. But for block numbers, you don't. Because individual 
chunks manage part of the whole file system in an independent way. They 
have their block bitmaps starting at an offset. Inode bitmaps, however, 
remains same.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Amit Gud

Suparna Bhattacharya wrote:

Could you send this out as a patch to ext2 codebase, so we can just look
at the changes for chunkfs ? That might also make it small enough
to inline your patch in email for review. 


What kind of results are you planning to gather to evaluate/optimize this ?



Mainly I'm trying to gather following:

- Graph of continuation inodes vs. the file system fragmentation (or 
aging) factor with varying configurations of chunk sizes


- Graph of wall clock time vs. disk size + data on the disk with both 
chunkfs and native ext2, and/or other file systems



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-23 Thread Amit Gud


This is an initial implementation of ChunkFS technique, briefly discussed
at: http://lwn.net/Articles/190222 and 
http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf


This implementation is done within ext2 driver. Every chunk is an 
independent ext2 file system. The knowledge about chunks is kept within 
ext2 and 'continuation inodes', which are used to allow files and 
directories span across multiple chunks, are managed within ext2.


At mount time, super blocks for all the chunks are created and linked with 
the global super_blocks list maintained by VFS. This allows independent 
behavior or individual chunks and also helps writebacks to happen 
seamlessly.


Apart from this, chunkfs code in ext2 effectively only provides knowledge of:

- what inode's which block number to look for, for a given file's logical 
block number

- in which chunk to allocate next inode / block
- number of inodes to scan when a directory is being read

To maintain the ext2's inode number uniqueness property, 8 msb bits of 
inode number are used to indicate the chunk number in which it resides.


As said, this is a preliminary implementation and lots of changes are 
expected before this code is even sanely usable. Some known issues and 
obvious optimizations are listed in the TODO file in the chunkfs patch.


http://cis.ksu.edu/~gud/patches/chunkfs-v0.0.8.patch
- one big patch
- applies to 2.6.18

Attached - ext2-chunkfs-diff.patch.gz
- since the code is a spin-off of ext2, this patch explains better what
  has changed from the ext2.

git://cislinux.cis.ksu.edu/chunkfs-tools
- mkfs, and fsck for chunkfs.

http://cis.ksu.edu/~gud/patches/config-chunkfs-2.6.18-uml
- config file used; tested mostly on UML with loopback file systems.

NOTE: No xattrs and xips yet, CONFIG_EXT2_FS_XATTR and CONFIG_EXT2_FS_XIP 
should be "no" for clean compile.



Please comment, suggest, criticize. Patches most welcome.


Best,
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

ext2-chunkfs-diff.patch.gz
Description: Binary data


Re: ChunkFS - measuring cross-chunk references

2007-04-22 Thread Amit Gud

Karuna sagar K wrote:

Hi,

The attached code contains program to estimate the cross-chunk
references for ChunkFS file system (idea from Valh). Below are the
results:



Nice to see some numbers! But would be really nice to know:

- what the chunk size is
- how the files were created or, more vaguely, how 'aged' the fs is
- what is the chunk allocation algorithm


Best,
AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Remove dead ext2 code

2007-03-25 Thread Amit Gud


AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

Remove dead code. Unless used for padding (which doesn't seem so), no users of 
s_dir_count.

Signed-off-by: Amit Gud <[EMAIL PROTECTED]>

--- ext2_fs_sb.h.copy   2007-03-25 20:36:45.0 -0500
+++ ext2_fs_sb.h2007-03-25 20:37:55.0 -0500
@@ -47,7 +47,6 @@ struct ext2_sb_info {
int s_first_ino;
spinlock_t s_next_gen_lock;
u32 s_next_generation;
-   unsigned long s_dir_count;
u8 *s_debts;
struct percpu_counter s_freeblocks_counter;
struct percpu_counter s_freeinodes_counter;