Re: e2fsprogs-1.39-tyt3

2007-05-01 Thread Andreas Dilger
On Apr 30, 2007  11:22 -0400, Theodore Ts'o wrote:
> One concern I still have is the fact that we're exposing a lot of
> interfaces in libext2fs.so which are very specifically tied to the
> current 48-bit physical/32-bit logical on-disk extent data structure.
> If/when we add support for the 64/64-bit extent data structure, and some
> kind of compressed/bit-packed extent format, those interfaces will have
> to be extended.

Some other comments:
- it appears you have all of extents.c copied into both "block.c" and
  in "extent.c"
- it appears you are missing a couple of extent sanity checks for headers
  and such that were added more recently, but those might also have been
  removed by you
- do the patches as they stand actually fix the duplicate block test
  cases yet?  It seems all of the BLOCK_CHANGED handling was removed
  from block_iterate_extents().
- it would be easier to read if the patches were generated with "diff -up"
  (can be set in .quiltrc via 'export QUILT_DIFF_OPTS="-up"'.  I also like
  "export QUILT_NO_DIFF_TIMESTAMPS=1" to avoid tons of gratuitous changes
  to patches when they are refreshed.

> Another problem is that while external extent blocks are getting byte
> swapped, there is no byte swapping going on for the extents stored in
> the inode.  I haven't tried it on a big endian system, such as Power
> machine, but I'm pretty sure it's going to blow up spectacularly.

Ah, interesting.  We have 64-bit machines for testing, but no big-endian
ones.

> The patches can be found at:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/e2fsprogs-1.39-tyt3

What else I think is important to get into this patch series is a change
to "mkfs.ext4" so that the default inode size is increased to 256.  I'm
not sure how/if this can be done via mke2fs.conf.

I also wouldn't object to changing the default inode_ratio on large
filesystems to be a lot fewer than 1/8192 bytes.  At a minimum, allowing
an inode_ratio > 1/4MB should be allowed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext4 devel interlock meeting minutes (April 23, 2007)

2007-05-01 Thread Kalpak Shah
On Mon, 2007-04-30 at 16:36 +0530, Aneesh Kumar wrote:
> On 4/24/07, Avantika Mathur <[EMAIL PROTECTED]> wrote:
> > Ext4 Developer Interlock Call: 04/23/2007 Meeting Minutes
> >
> > TESTING
> > - extents testing
> > - Discussed methods for testing extents on highly fragmented
> > filesystems.
> > - Jose will look into possible tests, including perhaps using the
> > 'aged' option in FFSB
> > - Ted suggested creating a mountoption that creates a bad block
> > allocator which it jumps to a new block group every 8 blocks.  This
> > would force a very large number of extents, and may be a good test for
> > extents.
> 
> 
> What i am doing for creating a large number of extents is
> 
> dd if=/dev/zero of=myfile count=10
> seek=20
> while [ 1 ]; do dd if=/dev/zero of=myfile count=10 seek=$seek;
> seek=`expr $seek + 20`; done
> 
> 

I had written a simple tool "bitmap_manip" with which you can actually
manipulate the number of free chunks and their sizes in a filesystem. It
uses libext2fs to set the bits in block bitmaps thereby leaving the
desired free extents. I had written it to test the allocators
performance. 

It can be used as:
 ./bitmap_manip /dev/sda9 1MA 4 16K 1 12K 3 8K 4 4K 6
 
This will leave only 1 16K chunk, 3 12K chunks,  free in the
filesystem. "1MA" 4 will get us 4 1Mb free ALIGNED chunks.

It isn't very beautiful code since it was only used for testing but
maybe it can help.

Thanks,
Kalpak.

> -aneesh
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
/* Manipulate block bitmap directly for mballoc testing */

/* USAGE:
 * ./bitmap_manip /dev/volmballoc/test 16K 1 12K 3 8K 4 4K 6
 * This will leave 1 16K chunk, 3 12K chunks,  in the filesystem specified.
 * Ideally give the inputs in ascending order. 
 * 1MA 4 will get us 4 1Mb ALIGNED chunks.	
 */

#include 
#include 
#include 
#include 
#include 

#define ONE_MB (1024 * 1024)
#define ONE_KB 1024

#define SETTING 0
#define FREEING 1

#define NO_ALIGN 0
#define ALIGN 1

struct chunk_arg {
	int chunk_size;
	int num_chunks;
	int align;
};

int main(int argc, char **argv)
{
	ext2_filsys fs;
	ext2fs_block_bitmap *map = NULL;
	int bg_num = 0, retval, arg_num, multiply, chunk_num;
	int i, start_blk, set_bit, test_bit, j;
	struct chunk_arg chunk[50];
	int free_blocks_req = 0, free_blocks_avail, num_of_chunks_req = 0, group;
	char str[10];
	float orig_avail_req, avail_req;
	int set_till_now, free_till_now, num_blks_to_set, num_blks_to_free, phase;
	int  current, align_flag = 0, align = 0, curr = 0;

	if (argc < 2) {
		printf("Please give name of a filesystem. Exiting...\n");
		return -1;
	}
	
	/* Even from user's perspective */
	if(argc & 0x01) {
		printf("This utility cannot have even number of arguments.\n");
		return -1;
	}

	if ((retval = ext2fs_open(argv[1], EXT2_FLAG_RW, 0, 0, unix_io_manager, &fs))) {
		com_err("ext2fs open:", retval, "while opening %s\n", argv[1]);
		return retval;
	}	

	srand(1234567);
	chunk_num = 0;
	for (arg_num = 2; arg_num < argc; arg_num += 2, chunk_num++) {
		strcpy(str, argv[arg_num]);

		/* Check if we have to align */
		if (toupper(str[strlen(str) - 1 ]) == 'A') {
			chunk[chunk_num].align = ALIGN;
			str[strlen(str) - 1] = '\0';	
			align = 1;
		}
		else
			chunk[chunk_num].align = NO_ALIGN;
		if (toupper(str[strlen(str) - 1]) == 'K')
			multiply = ONE_KB;
		else if(toupper(str[strlen(str) - 1]) == 'M') 
			multiply = ONE_MB;

		str[strlen(str) - 1] = '\0';
		chunk[chunk_num].chunk_size = ((strtod(str, NULL)) * multiply)/ (fs->blocksize);
		chunk[chunk_num].num_chunks = strtod(argv[arg_num + 1], NULL); 
			
		free_blocks_req += chunk[chunk_num].chunk_size * chunk[chunk_num].num_chunks;
		num_of_chunks_req += chunk[chunk_num].num_chunks;
	}

	ext2fs_read_block_bitmap(fs);			
	map = &fs->block_map;	

	start_blk = fs->super->s_first_data_block;
	free_blocks_avail = fs->super->s_free_blocks_count;

	orig_avail_req = free_blocks_avail / free_blocks_req;
	current = 0;
	i = start_blk;
	
	num_blks_to_set = (orig_avail_req / 4) * chunk[current].chunk_size;
	num_blks_to_free = chunk[current].chunk_size;
	phase = SETTING;
	do {
		test_bit = i;
		if (!ext2fs_fast_test_block_bitmap(*map, test_bit)) {
			if (phase == SETTING) {
if (chunk[current].align == ALIGN && chunk[current].num_chunks > 0) {
	if (align_flag == 0) {
		num_blks_to_set = (i / chunk[current].chunk_size + 1) * 
			chunk[current].chunk_size - i;
		align_flag = 1;
	}
	else if (i % chunk[current].chunk_size == 0) {
		num_blks_to_set = 0;
		phase = FREEING;
	}
}
set_bit = i;
		ext2fs_mark_block_bitmap(*map, set_bit); 
group = (set_bit - fs->super->s_first_data_block) / fs->super->s_blocks_per_group;
fs->group_desc[group].bg_free_blocks_count--;
fs->super->s_free_blocks_count--;		
num_blks_to_set--;
if (num_blks_to_set == 

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread David Chinner
On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote:
> On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote:
> > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> > > This is actually for future use.  Any flags that are added into this
> > > range must be understood by both sides or it should be considered an
> > > error.  Flags outside the FIEMAP_FLAG_INCOMPAT do not necessarily need
> > > to be supported.  If it turns out that 8 bits is too small a range for
> > > INCOMPAT flags, then we can make 0x0100 an incompat flag that means
> > > e.g. 0x00ff are also incompat flags also.
> > 
> > Ah, ok. So it's not really a set of "compatibility" flags, it's more a
> > "compulsory" set. Under those terms, i don't really see why this is
> > necessary - either the filesystem will understand the flags or it will
> > return EINVAL or ignore them...
> > 
> > > I'm assuming that all flags that will be in the original FIEMAP proposal
> > > will be understood by the implementations.  Most filesystems can safely
> > > ignore FLAG_HSM_READ, for example, since they don't support HSM, and for
> > > that matter FLAG_SYNC is probably moot for most filesystems also because
> > > they do block allocation at preprw time.
> > 
> > Exactly my point - so why do we really need to encode a compulsory set of
> > 
> 
> Because flags have meaning, independent of whether or not the filesystem
> understands them. And if the filesystem chooses to ignore critically
> important flags (instead of returning EINVAL), bad things may happen.
> 
> So, either the filesystem will understand the flag or iff the unknown flag
> is in the incompat set, it will return EINVAL or else the unknown flag will
> be safely ignored.

My point was that there is a difference between specification and
implementation - if the specification says something is compulsory,
then they must be implemented in the filesystem. This is easy
enough to ensure by code review - we don't need additional interface
complexity for this

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs-1.39-tyt3

2007-05-01 Thread Theodore Tso
On Tue, May 01, 2007 at 01:56:41AM -0600, Andreas Dilger wrote:
> On Apr 30, 2007  11:22 -0400, Theodore Ts'o wrote:
> > One concern I still have is the fact that we're exposing a lot of
> > interfaces in libext2fs.so which are very specifically tied to the
> > current 48-bit physical/32-bit logical on-disk extent data structure.
> > If/when we add support for the 64/64-bit extent data structure, and some
> > kind of compressed/bit-packed extent format, those interfaces will have
> > to be extended.
> 
> Some other comments:
> - it appears you have all of extents.c copied into both "block.c" and
>   in "extent.c"
> - it appears you are missing a couple of extent sanity checks for headers
>   and such that were added more recently, but those might also have been
>   removed by you

Yeah, I know.  That's also part of the patches being in process; it's
why they haven't been merged into the source tree yet.  Your newer
code with the extent sanity checks have been moved into block.c, and
mostly marked as static to prevent the function signatures from
leaking out.  That's what's actually being used for most of e2fsprogs
today.

My revamped abstractions are in extent.c, and it's missing the extent
sanity checks, and the ability to modify the on-disk extents given
changes in the abstract (disk-format-independent) extent structure.
It's not actually _used_ for anything other than debugfs's stat()
structure right now.  The goal is to take the extent functions in
block.c and migrate the functionality over into extent.c, but right
now I'm more focused on making sure the interfaces are right, since
once we commit them, we have to live with them forever.

> - do the patches as they stand actually fix the duplicate block test
>   cases yet?  It seems all of the BLOCK_CHANGED handling was removed
>   from block_iterate_extents().

Yes.  With all of the patches applied in my mercurial tree, only three
tests are failing: f_extents_eh_depth, f_extents_shrd_blk,
f_lotsbad.failed, which are the same as with your most recent
e2fsprogs patchset that I've seen.  There are 8 more tests that fail
in e2fsprogs-1.39-tyt-3, but those seem to be because the diff+patch
approach screwed up with the image.gz files, and I didn't think it was
worth it to fix that up for the interim patchset.

> - it would be easier to read if the patches were generated with "diff -up"
>   (can be set in .quiltrc via 'export QUILT_DIFF_OPTS="-up"'.  I also like
>   "export QUILT_NO_DIFF_TIMESTAMPS=1" to avoid tons of gratuitous changes
>   to patches when they are refreshed.

Thanks for the tip.  I'll do that next time.

> > Another problem is that while external extent blocks are getting byte
> > swapped, there is no byte swapping going on for the extents stored in
> > the inode.  I haven't tried it on a big endian system, such as Power
> > machine, but I'm pretty sure it's going to blow up spectacularly.
> 
> Ah, interesting.  We have 64-bit machines for testing, but no big-endian
> ones.

Three are public Linux/Power systems for development purposes where
anyone who is interested in developing Linux applications for Power
can get accounts at the Open Source Labs at Oregon State University:

http://powerdev.osuosl.org/

(Guess who's one of the primary sponsors of the OSLOSU. :-)

I haven't tried to use them since I normally use Power machines
available at Debian, but these should be sufficient at least for
running e2fprogs regression tests.  Testing of the ext4 kernel code on
a Power machine is going to be more difficult; there I think we will
have to depend on the IBM team to do testing under ABAT (which
hopefully should be easier anyway).

> > The patches can be found at:
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/e2fsprogs-1.39-tyt3
> 
> What else I think is important to get into this patch series is a change
> to "mkfs.ext4" so that the default inode size is increased to 256.  I'm
> not sure how/if this can be done via mke2fs.conf.

I'd probably add code so that certain defaults could be overriden
based on argv[0] in /etc/mke2fs.conf.  Say,

[progname_defaults]
mkfs.ext4 = {
inode_size = 256
}

I'll put it in my todo list, or patches would be welcome.  (This is
one that I'd just commit into the hg SCM since it's a clean change.)

Speaking of SCM's, how wedded is Clustrefs to mercurial?  I've been
considering migrating e2fsprogs development to git, but one of the
reasons I haven't is because I was concerned that might be incovenient
for you folks.

> I also wouldn't object to changing the default inode_ratio on large
> filesystems to be a lot fewer than 1/8192 bytes.  

Yeah, maybe for a new fs_type "big" which is larger than 100GB, the
inode ratio should be 16384, and change the cutoff for "small" to be
under 2GB, perhaps.  Does that sound reasonable?

> At a minimum, allowing an inode_ratio > 1/4MB should be allowed.

Agreed; an average size of 4MB seemed outlandishly large 10 

Re: Ext2/3 block remapping tool

2007-05-01 Thread Theodore Tso
On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote:
> Except one other issue with online shrinking is that we need to move
> inodes on occasion and this poses a bunch of other problems over just
> remapping the data blocks.

Well, I did say "necessary", and not "sufficient".  But yes, moving
inodes, especially if the inode is currently open gets interesting.  I
don't think there are that many user space applications that would
notice or care if the st_ino of an open file changed out from under
them, but there are obviously userspace applications, such as tar,
that would most definitely care.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Anton Altaparmakov

On 1 May 2007, at 05:22, David Chinner wrote:

On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:

  The FIBMAP ioctl is for privileged users
  only, and I wonder if FIEMAP should be the same, or at least  
disallow
  mapping files that the user can't access especially with  
FLAG_SYNC and/or

  FLAG_HSM_READ.


I see little reason for restricting FI[BE]MAP to privileged users -
anyone should be able to determine if files they have permission to
access are fragmented.


Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the  
machine.  Perhaps for non-privileged users FIEMAP has to be read- 
only?  As soon as any of the FLAG_* flags come into play you make it  
privileged.  For example fancy any user being able to fill up your  
file system by calling FIEMAP with FLAG_HSM_READ on all files  
recursively?  This should certainly not be simply dismissed as a non- 
issue without thinking about it first...


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Anton Altaparmakov

On 1 May 2007, at 15:20, David Chinner wrote:

On Mon, Apr 30, 2007 at 09:39:06PM -0700, Nicholas Miell wrote:

On Tue, 2007-05-01 at 14:22 +1000, David Chinner wrote:

On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
This is actually for future use.  Any flags that are added into  
this
range must be understood by both sides or it should be  
considered an
error.  Flags outside the FIEMAP_FLAG_INCOMPAT do not  
necessarily need
to be supported.  If it turns out that 8 bits is too small a  
range for
INCOMPAT flags, then we can make 0x0100 an incompat flag  
that means

e.g. 0x00ff are also incompat flags also.


Ah, ok. So it's not really a set of "compatibility" flags, it's  
more a

"compulsory" set. Under those terms, i don't really see why this is
necessary - either the filesystem will understand the flags or it  
will

return EINVAL or ignore them...

I'm assuming that all flags that will be in the original FIEMAP  
proposal
will be understood by the implementations.  Most filesystems can  
safely
ignore FLAG_HSM_READ, for example, since they don't support HSM,  
and for
that matter FLAG_SYNC is probably moot for most filesystems also  
because

they do block allocation at preprw time.


Exactly my point - so why do we really need to encode a  
compulsory set of


Because flags have meaning, independent of whether or not the  
filesystem

understands them. And if the filesystem chooses to ignore critically
important flags (instead of returning EINVAL), bad things may happen.

So, either the filesystem will understand the flag or iff the  
unknown flag
is in the incompat set, it will return EINVAL or else the unknown  
flag will

be safely ignored.


My point was that there is a difference between specification and
implementation - if the specification says something is compulsory,
then they must be implemented in the filesystem. This is easy
enough to ensure by code review - we don't need additional interface
complexity for this


You are wrong about this because you are missing the point that you  
have no code to review.  The users that will use those flags are  
going to be applications that run in user space.  Chances are you  
will never see their code.  Heck, they might not even be open source  
applications...  And all applications will run against a multitude of  
kernels.  So version X of the application will run on kernel 2.4.*,  
2.6.*, a.b.*, etc...  For future expandability of the interface I  
think it is important to have both compulsory and non-compulsory flags.


For example there is no reason why FIEMAP_HSM_READ needs to be  
compulsory.  Most filesystems do not support HSM so can safely ignore  
it.  And applications that want to read/write the data locations that  
are obtained with the FIEMAP call will likely always supply  
FIEMAP_HSM_READ because they want to ensure the file is brought in if  
it is off line so they definitely want file systems that do not  
support this flag to ignore it.


And vice versa, an application might specify some weird and funky yet  
to be developed feature that it expects the FS to perform and if the  
FS cannot do it (either because it does not support it or because it  
failed to perform the operation) the application expects the FS to  
return an error and not to ignore the flag.  An example could be the  
asked for FIEMAP_XATTR_FORK flag.  If that is implemented, and the FS  
ignores it it will return the extent map for the file data instead of  
the XATTR_FORK!  Not what the application wanted at all.  Ouch!  So  
this is definitely a compulsory flag if I ever saw one.


So as you see you must support both voluntary and compulsory flags...

Also consider what I said above about different kernels.  A new  
feature is implemented in kernel 2.8.13 say that was not there before  
and an application is updated to use that feature.  There will be  
lots of instances where that application will still be run on older  
kernels where this feature does not exist.  Depending on the feature  
it may be quite sensible to simply ignore in the kernel that the  
application set an unknown flag whilst for a different feature it may  
be the opposite.


Best regards,

Anton
--
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e2fsprogs-1.39-tyt3

2007-05-01 Thread Andreas Dilger
On May 01, 2007  11:05 -0400, Theodore Tso wrote:
> Speaking of SCM's, how wedded is Clustrefs to mercurial?  I've been
> considering migrating e2fsprogs development to git, but one of the
> reasons I haven't is because I was concerned that might be incovenient
> for you folks.

Not at all, we just generate a 1.39 to 1.39-WIP patch, and then use
quilt + CVS to manage the resulting patches.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-05-01 Thread Andreas Dilger
On May 01, 2007  11:28 -0400, Theodore Tso wrote:
> On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote:
> > Except one other issue with online shrinking is that we need to move
> > inodes on occasion and this poses a bunch of other problems over just
> > remapping the data blocks.
> 
> Well, I did say "necessary", and not "sufficient".  But yes, moving
> inodes, especially if the inode is currently open gets interesting.  I
> don't think there are that many user space applications that would
> notice or care if the st_ino of an open file changed out from under
> them, but there are obviously userspace applications, such as tar,
> that would most definitely care.

I think "rm -r" does a LOT of this kind of operation, like:

stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.)

I think "find" does the same to avoid security problems with malicious
path manipulation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] add I/O stats to e2fsck -tt

2007-05-01 Thread Jim Garlick
This patch instruments the libext2fs unix I/O manager and adds bytes 
read/written and data rate to e2fsck -tt pass/overall timing output.


Signed-off-by: Jim Garlick <[EMAIL PROTECTED]>

Index: e2fsprogs+chaos/lib/ext2fs/unix_io.c
===
--- e2fsprogs+chaos.orig/lib/ext2fs/unix_io.c
+++ e2fsprogs+chaos/lib/ext2fs/unix_io.c
@@ -70,6 +70,7 @@ struct unix_private_data {
int access_time;
ext2_loff_t offset;
struct unix_cache cache[CACHE_SIZE];
+   struct struct_io_stats io_stats;
 };

 static errcode_t unix_open(const char *name, int flags, io_channel *channel);
@@ -84,6 +85,8 @@ static errcode_t unix_write_byte(io_chan
int size, const void *data);
 static errcode_t unix_set_option(io_channel channel, const char *option,
 const char *arg);
+static errcode_t unix_get_stats(io_channel channel, 
+struct struct_io_stats *stats);


 static void reuse_cache(io_channel channel, struct unix_private_data *data,
 struct unix_cache *cache, unsigned long block);
@@ -110,11 +113,29 @@ static struct struct_io_manager struct_u
 #else
unix_write_byte,
 #endif
-   unix_set_option
+   unix_set_option,
+   unix_get_stats,
 };

 io_manager unix_io_manager = &struct_unix_manager;

+static errcode_t unix_get_stats(io_channel channel, 
+struct struct_io_stats *stats)

+{
+   errcode_t   retval = 0;
+
+   struct unix_private_data *data;
+
+   EXT2_CHECK_MAGIC(channel, EXT2_ET_MAGIC_IO_CHANNEL);
+   data = (struct unix_private_data *) channel->private_data;
+   EXT2_CHECK_MAGIC(data, EXT2_ET_MAGIC_UNIX_IO_CHANNEL);
+
+   if (stats)
+   *stats = data->io_stats;
+
+   return retval;
+}
+
 /*
  * Here are the raw I/O functions
  */
@@ -130,6 +151,7 @@ static errcode_t raw_read_blk(io_channel
int actual = 0;

size = (count < 0) ? -count : count * channel->block_size;
+   data->io_stats.reads += size;
location = ((ext2_loff_t) block * channel->block_size) + data->offset;
if (ext2fs_llseek(data->dev, location, SEEK_SET) != location) {
retval = errno ? errno : EXT2_ET_LLSEEK_FAILED;
@@ -168,6 +190,7 @@ static errcode_t raw_read_blk(io_channel
charsector[BLOCKALIGN];

size = (count < 0) ? -count : count * channel->block_size;
+   data->io_stats.reads += size;
location = ((ext2_loff_t) block * channel->block_size) + data->offset;
 #ifdef DEBUG
printf("count=%d, size=%d, block=%lu, blk_size=%d, location=%llx\n",
@@ -224,6 +247,7 @@ static errcode_t raw_write_blk(io_channe
else
size = count * channel->block_size;
}
+   data->io_stats.writes += size;

location = ((ext2_loff_t) block * channel->block_size) + data->offset;
if (ext2fs_llseek(data->dev, location, SEEK_SET) != location) {
Index: e2fsprogs+chaos/lib/ext2fs/ext2_io.h
===
--- e2fsprogs+chaos.orig/lib/ext2fs/ext2_io.h
+++ e2fsprogs+chaos/lib/ext2fs/ext2_io.h
@@ -55,6 +55,11 @@ struct struct_io_channel {
void*app_data;
 };

+struct struct_io_stats {
+   unsigned long long reads;
+   unsigned long long writes;
+};
+
 struct struct_io_manager {
errcode_t magic;
const char *name;
@@ -70,6 +75,8 @@ struct struct_io_manager {
int count, const void *data);
errcode_t (*set_option)(io_channel channel, const char *option,
const char *arg);
+	errcode_t (*get_stats)(io_channel channel, 
+struct struct_io_stats *io_stats);

int reserved[14];
 };

Index: e2fsprogs+chaos/e2fsck/e2fsck.h
===
--- e2fsprogs+chaos.orig/e2fsck/e2fsck.h
+++ e2fsprogs+chaos/e2fsck/e2fsck.h
@@ -135,6 +135,7 @@ struct resource_track {
struct timeval user_start;
struct timeval system_start;
void*brk_start;
+   struct struct_io_stats io_start;
 };
 #endif

@@ -473,8 +474,10 @@ extern void preenhalt(e2fsck_t ctx);
 extern char *string_copy(e2fsck_t ctx, const char *str, int len);
 #ifdef RESOURCE_TRACK
 extern void print_resource_track(const char *desc,
-struct resource_track *track);
-extern void init_resource_track(struct resource_track *track);
+struct resource_track *track,
+io_channel channel);
+extern void init_resource_track(struct resource_track *track, 
+io_channel channel);

 #endif
 extern int inode_has_valid_blocks(struct ext2_inode *inode);
 extern void e2fsck_read_inode(e2fsck_t ctx, unsigned long ino,
Index: e2fsprogs+chaos/e2fsck/util.c
===
-

Re: Ext2/3 block remapping tool

2007-05-01 Thread Theodore Tso
On Tue, May 01, 2007 at 12:52:49PM -0600, Andreas Dilger wrote:
> I think "rm -r" does a LOT of this kind of operation, like:
> 
> stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.)
> 
> I think "find" does the same to avoid security problems with malicious
> path manipulation.

Yep, so if you're doing an rm -rf (or any other recursive descent)
while we're doing an on-line shrink, it's going to fail.  I suppose we
could have an in-core inode mapping table that would continue to remap
inode numbers until the next reboot.  I'm not sure we would want to
keep the inode remapping indefinitely, although if we don't it could
also end up screwing up NFS as well.  Not sure I care, though.  :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Andreas Dilger
On May 01, 2007  14:22 +1000, David Chinner wrote:
> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't
> 
> I disagree - why would you want to indicate the state is unknown when we know
> very well that it is offline?

If you don't like "UNKNOWN", what about "UNMAPPED"?  I just want a
catch-all flag that indicates "this extent contains data but there is
nothing sensible to be returned for the extent mapping."

> Effectively, when your extent is offline in the HSM, it is inaccessable, and
> you have to bring it back from tape so it becomes accessible again. i.e. some
> action is necessary on behalf of the user to make it accessible. So I think
> that OFFLINE is a good name for this state because it really is inaccessible.

What you are calling OFFLINE I would prefer to call UNMAPPED, since that
can be used by applications as a catch-all for "no mapping".  There can
be further flags that give refinements to UNMAPPED that some applications
might care about them (e.g. HSM_RESIDENT), but many users/apps will not
if they just want the number of fragments in a given file.

> Also, I don't think "secondary" is a good term because most large systems
> have more than one tier of storage. One possibility is "HSM_RESIDENT"
> which indicates the extent is current and resident with a HSM's archive

Sure.

> > Can you propose reasonable flag names for these (I can't think of anything
> > very good) and a clear explanation of what they mean.  I suspect it will
> > only be XFS that uses them initially.  In mke2fs and ext4+mballoc there is
> > the concept of stripe unit and stripe width, but as yet they are not
> > communicated between the two very well.  I'd be much happier if this info
> > could be queried in a standard way from the block layer instead of the
> > user having to specify it and the filesystem having to track it.
> 
> My preference is definitely for a separate ioctl to grab the
> filesystem geometry so this stuff can be calculated in userspace.
> i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't
> bother trying to define names until we decide which appraoch we take
> to implement this.

Hmm, previously you wrote "This information could be easily passed up in the
flags fields if the filesystem has geometry information".  So, I _think_
what you are saying is that you want 4 flags to convey this start/end
alignment information, but the exact semantics of what a "stripe unit" and
a "stripe width" is filesystem specific?

I definitely do NOT want to get into any issues of querying the block
device geometry here.  I was just making a passing comment that ext4+mballoc
can already do RAID-specific allocation alignment, but it depends on the
admin to specify this information and it would be nice if there was some
easy way to get this from userspace/kernel interfaces.

Having an API that can request "tell me the number of blocks from this
offset until the next physical disk boundary" or similar would be useful
to any allocator, and the block layer already needs to know this when
submitting IO.

> In XFS, mkfs.xfs does the work of getting this information
> to see in the filesystem superblock. Here's the code for getting
> sunit/swidth from the underlying block device:
> 
> http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/
> 
> Not much in common there ;)

It looks like this might be just what e2fsprogs needs also.

> > It does make sense to specify zero for the fm_extent_count array and a
> > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
> > extent data itself, for the non-verbose mode of filefrag, and for
> > pre-allocating a buffer large enough to hold the file if that is important.
> 
> Rather than rely on implicit behaviour of "pass in extent count of
> zero and a don't try to return any extents" to return the number of
> extents on the file, why not just explicitly define this as a valid
> input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS

That's what I said, isn't it?  FIEMAP_FLAG_NO_EXTENTS.  I wonder if my
clever-clever for "return no extents" and "return number of extents"
is wasted :-/.

> > - does XFS return an extent for the metadata parts of the file (e.g. btree)?
> 
> No, but we can return the extent map for the attribute fork (i.e.
> extended attrs) if asked for (XFS_IOC_GETBMAPA).

This seems like it would be a useful addition to the interface also, having
FIEMAP_FLAG_METADATA request the return of metadata allocations too.

> > - does XFS return preallocated extents beyond EOF?
> 
> Yes - they are part of the extent map for the file.

OK.

> > - does XFS allow non-root users to call xfs_bmap on files they don't own, or
> >   use by non-root users at all?
> 
> Users can run xfs_bmap on any file they have permission to
> open(O_RDONLY).
> 
> >   The FIBMAP ioctl is for privileged users
> >   only, and I wonder if FIEMAP should be the same, or at least disall

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Andreas Dilger
On May 02, 2007  00:20 +1000, David Chinner wrote:
> My point was that there is a difference between specification and
> implementation - if the specification says something is compulsory,
> then they must be implemented in the filesystem. This is easy
> enough to ensure by code review - we don't need additional interface
> complexity for this

What you seem to be missing about my proposal is that the FLAG_INCOMPAT
is for future use by that part of the specification we haven't thought
of yet...  Having COMPAT/INCOMPAT flags has been very useful for ext2/3/4,
and is much better than having version numbers for the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread David Chinner
On Tue, May 01, 2007 at 07:37:20PM +0100, Anton Altaparmakov wrote:
> On 1 May 2007, at 05:22, David Chinner wrote:
> >On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> >>  The FIBMAP ioctl is for privileged users
> >>  only, and I wonder if FIEMAP should be the same, or at least  
> >>disallow
> >>  mapping files that the user can't access especially with  
> >>FLAG_SYNC and/or
> >>  FLAG_HSM_READ.
> >
> >I see little reason for restricting FI[BE]MAP to privileged users -
> >anyone should be able to determine if files they have permission to
> >access are fragmented.
> 
> Allowing anyone to run FI[BE]MAP creates potential for DOS-ing the  
> machine.  Perhaps for non-privileged users FIEMAP has to be read- 
> only?  As soon as any of the FLAG_* flags come into play you make it  
> privileged.  For example fancy any user being able to fill up your  
> file system by calling FIEMAP with FLAG_HSM_READ on all files  
> recursively?

By that reasoning, users should not be allowed to recall any files
without root privileges. HSMs don't work that way, though - any user
is allowed to recall any files they have permission to access either
by manual command or by trying to read the file daata.

If that runs the filesytem out of space, then the HSM either hasn't
been configured properly or it's failed to manage the space
correctly. Either way, that's not the fault of the user for
recalling their own files.

Hence allowing FIEMAP to be executed by the user does not open up
any DOS conditions that don't already exist in normal HSM-managed
filesystem.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread David Chinner
On Tue, May 01, 2007 at 03:30:40PM -0700, Andreas Dilger wrote:
> On May 01, 2007  14:22 +1000, David Chinner wrote:
> > On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> > > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't
> > 
> > I disagree - why would you want to indicate the state is unknown when we 
> > know
> > very well that it is offline?
> 
> If you don't like "UNKNOWN", what about "UNMAPPED"?  I just want a
> catch-all flag that indicates "this extent contains data but there is
> nothing sensible to be returned for the extent mapping."

Yes, I like that much more. Good suggestion. ;)

> > Effectively, when your extent is offline in the HSM, it is inaccessable, and
> > you have to bring it back from tape so it becomes accessible again. i.e. 
> > some
> > action is necessary on behalf of the user to make it accessible. So I think
> > that OFFLINE is a good name for this state because it really is 
> > inaccessible.
> 
> What you are calling OFFLINE I would prefer to call UNMAPPED, since that
> can be used by applications as a catch-all for "no mapping".  There can
> be further flags that give refinements to UNMAPPED that some applications
> might care about them (e.g. HSM_RESIDENT), but many users/apps will not
> if they just want the number of fragments in a given file.

Agreed - UNMAPPED does make a lot more sense in this case.

> > > Can you propose reasonable flag names for these (I can't think of anything
> > > very good) and a clear explanation of what they mean.  I suspect it will
> > > only be XFS that uses them initially.  In mke2fs and ext4+mballoc there is
> > > the concept of stripe unit and stripe width, but as yet they are not
> > > communicated between the two very well.  I'd be much happier if this info
> > > could be queried in a standard way from the block layer instead of the
> > > user having to specify it and the filesystem having to track it.
> > 
> > My preference is definitely for a separate ioctl to grab the
> > filesystem geometry so this stuff can be calculated in userspace.
> > i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't
> > bother trying to define names until we decide which appraoch we take
> > to implement this.
> 
> Hmm, previously you wrote "This information could be easily passed up in the
> flags fields if the filesystem has geometry information".  So, I _think_
> what you are saying is that you want 4 flags to convey this start/end
> alignment information, but the exact semantics of what a "stripe unit" and
> a "stripe width" is filesystem specific?

Right.

> I definitely do NOT want to get into any issues of querying the block
> device geometry here.  I was just making a passing comment that ext4+mballoc
> can already do RAID-specific allocation alignment, but it depends on the
> admin to specify this information and it would be nice if there was some
> easy way to get this from userspace/kernel interfaces.
> 
> Having an API that can request "tell me the number of blocks from this
> offset until the next physical disk boundary" or similar would be useful
> to any allocator, and the block layer already needs to know this when
> submitting IO.

The block layer knows this once you get inside the volume manager. I
think the issue is that there is no common export interface for this
information.

> > In XFS, mkfs.xfs does the work of getting this information
> > to see in the filesystem superblock. Here's the code for getting
> > sunit/swidth from the underlying block device:
> > 
> > http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/
> > 
> > Not much in common there ;)
> 
> It looks like this might be just what e2fsprogs needs also.

More than likely.

> > > It does make sense to specify zero for the fm_extent_count array and a
> > > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
> > > extent data itself, for the non-verbose mode of filefrag, and for
> > > pre-allocating a buffer large enough to hold the file if that is 
> > > important.
> > 
> > Rather than rely on implicit behaviour of "pass in extent count of
> > zero and a don't try to return any extents" to return the number of
> > extents on the file, why not just explicitly define this as a valid
> > input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS
> 
> That's what I said, isn't it?  FIEMAP_FLAG_NO_EXTENTS.  I wonder if my
> clever-clever for "return no extents" and "return number of extents"
> is wasted :-/.

Too clever for an API, I think. ;)

My point is mainly that if you are going to use an API for a
specific function (e.g. query the number of extents) I think that
the API should have an obvious method for executing that specific
function. Using a command of "get no extents" to provide the query
of "how many extents in this file" is kind of obscure. When you read
the code it doesn't make a lot of sense, as opposed to seeing a
clear statement of intent from the code itself.

i.e. FIEMAP_FLAG_GET_NUMEXTENT