Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Timothy Shimmin

--On 18 April 2007 6:21:39 PM -0600 Andreas Dilger <[EMAIL PROTECTED]> wrote:


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
# define FIEMAP_FLAG_SYNC   0x0001  /* flush delalloc data to disk*/
# define FIEMAP_FLAG_HSM_READ   0x0002  /* retrieve data from HSM */
# define FIEMAP_FLAG_INCOMPAT0xff00 /* must understand these flags*/

/* flags for the returned extents */
# define FIEMAP_EXTENT_HOLE 0x0001  /* no space allocated */
# define FIEMAP_EXTENT_UNWRITTEN0x0002  /* uninitialized space 
*/
# define FIEMAP_EXTENT_UNKNOWN  0x0004  /* in use, location unknown */
# define FIEMAP_EXTENT_ERROR0x0008  /* error mapping space */
# define FIEMAP_EXTENT_NO_DIRECT0x0010  /* no direct data 
access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)


I much prefer that - it makes it a lot clearer to me to have fiemap_extent
just for fm_extents (no different meanings now).
(Don't like the word "offset" in comment without "physical" or some such but 
whatever;-)
I also prefer the flags as separate fields too :)

--Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread David Chinner
On Wed, Apr 18, 2007 at 06:21:39PM -0600, Andreas Dilger wrote:
> On Apr 16, 2007  21:22 +1000, David Chinner wrote:
> > On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
> > > struct fiemap_extent {
> > >   __u64 fe_start; /* starting offset in bytes */
> > >   __u64 fe_len;   /* length in bytes */
> > > }
> > > 
> > > struct fiemap {
> > >   struct fiemap_extent fm_start;  /* offset, length of desired mapping */
> > >   __u32 fm_extent_count;  /* number of extents in array */
> > >   __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> > >   __u64 unused;
> > >   struct fiemap_extent fm_extents[0];
> > > }
> > > 
> > > #define FIEMAP_LEN_MASK   0xff
> > > #define FIEMAP_LEN_HOLE   0x01
> > > #define FIEMAP_LEN_UNWRITTEN  0x02
> > 
> > I'm not sure I like stealing bits from the length to use a flags -
> > I'd prefer an explicit field per fiemap_extent for this.
> 
> Christoph expressed the same concern.  I'm not dead set against having an
> extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may
> mean the need for 50% more ioctls if the file is large.

I don't think this overhead is a huge problem - just pass in a
larger buffer (e.g. xfs_bmap can ask for thousands of extents in
a single ioctl call as we can extract the number of extents in
an inode via XFS_IOC_FSGETXATTRA).

> Below is an aggregation of the comments in this thread:
> 
> struct fiemap_extent {
>   __u64 fe_start; /* starting offset in bytes */
>   __u64 fe_len;   /* length in bytes */
>   __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
>   __u32 fe_lun;   /* logical storage device number in array */
> }

Oh, I missed the bit about the fe_lun - I was thinking something like
that might be useful in future

> struct fiemap {
>   __u64 fm_start; /* logical start offset of mapping (in/out) */
>   __u64 fm_len;   /* logical length of mapping (in/out) */
>   __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
>   __u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
>   __u64 fm_unused;
>   struct fiemap_extent fm_extents[0];
> }
> 
> /* flags for the fiemap request */
> #define FIEMAP_FLAG_SYNC  0x0001  /* flush delalloc data to disk*/
> #define FIEMAP_FLAG_HSM_READ  0x0002  /* retrieve data from HSM */
> #define FIEMAP_FLAG_INCOMPAT0xff00/* must understand these flags*/

No flags in the INCOMPAT range - shouldn't it be 0x3 at this point?

> /* flags for the returned extents */
> #define FIEMAP_EXTENT_HOLE0x0001  /* no space allocated */
> #define FIEMAP_EXTENT_UNWRITTEN   0x0002  /* uninitialized space 
> */
> #define FIEMAP_EXTENT_UNKNOWN 0x0004  /* in use, location unknown */
> #define FIEMAP_EXTENT_ERROR   0x0008  /* error mapping space */
> #define FIEMAP_EXTENT_NO_DIRECT   0x0010  /* no direct data 
> access */

SO, there's a HSM_READ flag above. If we are going to make this interface
useful for filesystems that have HSMs interacting with their extents, the
HSM needs to be able to query whether the extent is online (on disk), 
has been migrated offline (on tape) or in dual-state (i.e. both online and
offline).

> SUMMARY OF CHANGES
> ==
> - use fm_* fields directly in request instead of making it a fiemap_extent
>   (though they are layed out identically)
> 
> - separate flags word for fm_flags:
>   - FIEMAP_FLAG_SYNC = range should be synced to disk before returning
> mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise
>   - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified
> (this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag)
>   - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future
> if there is agreement on whether that is desirable to have or if it is
> better to call ioctl(FIEMAP) on an XATTR fd.
>   - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel
> must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we
> don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it
> 
> - __u64 fm_unused does not take up an extra space on all power-of-two buffer
>   sizes (would otherwise be at end of buffer), and may be handy in the future.
> 
> - add separate fe_flags word with flags from various suggestions:
>   - FIEMAP_EXTENT_HOLE = extent has no space allocation
>   - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
>   - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
> (e.g. HSM, delalloc awaiting sync, etc)

I'd like an explicit delalloc flag, not lumping it in with "unknown".
we *know* the extent is delalloc ;)

>   - FIEMAP_EXTENT_ERROR = error mapping extent.  Should fe_lun == e

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger
On Apr 16, 2007  21:22 +1000, David Chinner wrote:
> On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
> > struct fiemap_extent {
> > __u64 fe_start; /* starting offset in bytes */
> > __u64 fe_len;   /* length in bytes */
> > }
> > 
> > struct fiemap {
> > struct fiemap_extent fm_start;  /* offset, length of desired mapping */
> > __u32 fm_extent_count;  /* number of extents in array */
> > __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
> > __u64 unused;
> > struct fiemap_extent fm_extents[0];
> > }
> > 
> > #define FIEMAP_LEN_MASK 0xff
> > #define FIEMAP_LEN_HOLE 0x01
> > #define FIEMAP_LEN_UNWRITTEN0x02
> 
> I'm not sure I like stealing bits from the length to use a flags -
> I'd prefer an explicit field per fiemap_extent for this.

Christoph expressed the same concern.  I'm not dead set against having an
extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may
mean the need for 50% more ioctls if the file is large.


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
#define FIEMAP_FLAG_SYNC0x0001  /* flush delalloc data to disk*/
#define FIEMAP_FLAG_HSM_READ0x0002  /* retrieve data from HSM */
#define FIEMAP_FLAG_INCOMPAT0xff00  /* must understand these flags*/

/* flags for the returned extents */
#define FIEMAP_EXTENT_HOLE  0x0001  /* no space allocated */
#define FIEMAP_EXTENT_UNWRITTEN 0x0002  /* uninitialized space */
#define FIEMAP_EXTENT_UNKNOWN   0x0004  /* in use, location unknown */
#define FIEMAP_EXTENT_ERROR 0x0008  /* error mapping space */
#define FIEMAP_EXTENT_NO_DIRECT 0x0010  /* no direct data access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)

- separate flags word for fm_flags:
  - FIEMAP_FLAG_SYNC = range should be synced to disk before returning
mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise
  - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified
(this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag)
  - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future
if there is agreement on whether that is desirable to have or if it is
better to call ioctl(FIEMAP) on an XATTR fd.
  - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel
must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we
don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it

- __u64 fm_unused does not take up an extra space on all power-of-two buffer
  sizes (would otherwise be at end of buffer), and may be handy in the future.

- add separate fe_flags word with flags from various suggestions:
  - FIEMAP_EXTENT_HOLE = extent has no space allocation
  - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
  - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
(e.g. HSM, delalloc awaiting sync, etc)
  - FIEMAP_EXTENT_ERROR = error mapping extent.  Should fe_lun == errno?
  - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data
encrypted, compressed, etc), may want separate flags for these?

- add new fe_lun word per extent for filesystems that manage multiple devices
  (e.g. OCFS, GFS, ZFS, Lustre).  This would otherwise have been unused.


> Given that xfs_bmap uses extra information from the filesystem
> (geometry) to display extra (and frequently used) information
> about the alignment of extents. ie:
> 
> chook 681% xfs_bmap -vv fred
> fred:
>  EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL FLAGS
>0: [0..151]:288444888..288445039  8 (1696536..1696687)   152 00010
>  FLAG Values:
> 01 Unwritten preallocated extent
> 001000 Doesn't begin on stripe unit
> 000100 Doesn't end   on stripe unit
> 10 Doesn't begin on stripe width
> 01 Doesn't end   on stripe width

Can you clarify the terminology here?  What is a "stripe unit" and what is
a "stripe width"?  Are

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger
On Apr 16, 2007  18:01 +1000, Timothy Shimmin wrote:
> --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger <[EMAIL PROTECTED]> 
> wrote:
> >struct fiemap_extent {
> > __u64 fe_start; /* starting offset in bytes */
> > __u64 fe_len;   /* length in bytes */
> >}
> >
> >struct fiemap {
> > struct fiemap_extent fm_start;  /* offset, length of desired mapping 
> > */
> > __u32 fm_extent_count;  /* number of extents in array */
> > __u32 fm_flags; /* flags (similar to 
> > XFS_IOC_GETBMAP) */
> > __u64 unused;
> > struct fiemap_extent fm_extents[0];
> >}
> >
> ># define FIEMAP_LEN_MASK 0xff
> ># define FIEMAP_LEN_HOLE 0x01
> ># define FIEMAP_LEN_UNWRITTEN0x02
> >
> >All offsets are in bytes to allow cases where filesystems are not going
> >block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
> >returned contains the packed list of allocation extents for the file,
> >including entries for holes (which have fe_start == 0, and a flag).
> >
> >The ->fm_extents[] array includes all of the holes in addition to
> >allocated extents because this avoids the need to return both the logical
> >and physical address for every extent and does not make processing any
> >harder.
> 
> Well, that's what stood out for me. I was wondering where the "fe_block" 
> field had gone - the "physical address".
> So is your "fe_start; /* starting offset */" actually the disk location
> (not a logical file offset)
> _except_ in the header (fiemap) where it is the desired logical offset.

Correct.  The fm_extent in the request contains the logical start offset
and length in bytes of the requested fiemap region.  In the returned header
it represents the logical start offset of the extent that contained the
requested start offset, and the logical length of all the returned extents.
I haven't decided whether the returned length should be until EOF, or have
the "virtual hole" at the end of the file.  I think EOF makes more sense.

The fe_start + fe_len in the fm_extents represent the physical location on
the block device for that extent.  fm_extent[i].fe_start (per Anton) is
undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole.

> Okay, looking at your example use below that's what it looks like.
> And when you refer to fm_start below, you mean fm_start.fe_start?
> Sorry, I realise this is just an approximation but this part confused me.

Right, I'll write up a new RFC based on feedback here, and correcting the
various errors in the original proposal.

> So you get rid of all the logical file offsets in the extents because we
> report holes explicitly (and we know everything is contiguous if you
> include the holes).

Correct.  It saves space in the common case.

> >Caller works something like:
> >
> > char buf[4096];
> > struct fiemap *fm = (struct fiemap *)buf;
> > int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
> > 
> > fm->fm_start.fe_start = 0; /* start of file */
> > fm->fm_start.fe_len = -1;   /* end of file */
> > fm->fm_extent_count = count; /* max extents in fm_extents[] array */
> > fm->fm_flags = 0;   /* maybe "no DMAPI", etc like XFS */
> >
> > fd = open(path, O_RDONLY);
> > printf("logical\t\tphysical\t\tbytes\n");
> >
> > /* The last entry will have less extents than the maximum */
> > while (fm->fm_extent_count == count) {
> > rc = ioctl(fd, FIEMAP, fm);
> > if (rc)
> > break;
> >
> > /* kernel filled in fm_extents[] array, set fm_extent_count
> >  * to be actual number of extents returned, leaves
> >  * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */
> >
> > for (i = 0; i < fm->fm_extent_count; i++) {
> > __u64 len = fm->fm_extents[i].fe_len & 
> > FIEMAP_LEN_MASK;
> > __u64 fm_next = fm->fm_start.fe_start + len;
> > int hole = fm->fm_extents[i].fe_len & 
> > FIEMAP_LEN_HOLE;
> > int unwr = fm->fm_extents[i].fe_len & 
> > FIEMAP_LEN_UNWRITTEN;
> >
> > printf("%llu-%llu\t%llu-%llu\t%llu\t%s%s\n",
> > fm->fm_start.fe_start, fm_next - 1,
> > hole ? 0 : fm->fm_extents[i].fe_start,
> > hole ? 0 : fm->fm_extents[i].fe_start +
> >fm->fm_extents[i].fe_len - 1,
> > len, hole ? "(hole) " : "",
> > unwr ? "(unwritten) " : "");
> >
> > /* get ready for printing next extent, or next ioctl 
> > */
> > fm->fm_start.fe_start = fm_next;
> > }
> > }
> >

Cheers, Andreas

Re: Ext3 behavior on power failure

2007-04-18 Thread Bruno Wolff III
On Wed, Mar 28, 2007 at 09:17:27 -0400,
  "John Anthony Kazos Jr." <[EMAIL PROTECTED]> wrote:
> > If you fsync() your data, you are guaranteed that also your data are
> >safely on disk when fsync returns. So what is the question here?
> 
> Pardon a newbie's intrusion, but I do know this isn't true. There is a 
> window of possible loss because of the multitude of layers of caching, 
> especially within the drive itself. Unless there is a super_duper_fsync() 
> that is able to actually poll the hardware and get a confirmation that the 
> internal buffers are purged?

That is why you need to disable write caching of the drives or use cache
flushes via write barriers (if the stack of block devices all support them)
if the hardware cache isn't battery backed or the device doesn't support
returning the status of particular commands.

Of course nothing is perfectly safe.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7

2007-04-18 Thread Andrew Morton
> On Wed, 18 Apr 2007 15:54:00 +0200 Valerie Clement <[EMAIL PROTECTED]> wrote:
> 
> Running benchmark tests (FFSB) on an ext4 filesystem, I noticed a 
> performance degradation (about 15-20 percent) in sequential write tests 
> between 2.6.19-rc6 and 2.6.21-rc4 kernels.
> 
> I ran the same tests on ext3 and XFS filesystems and I saw the same 
> performance difference between the two kernel versions for these two 
> filesystems.
> 
> I have also reproduced it between 2.6.20.7 and 2.6.21-rc7.
> The FFSB tests run 16 threads, each creating 1GB files. The tests were 
> done on the same x86_64 system, with the same kernel configuration and 
> on the same scsi device. Below are the throughput values given by FFSB.
> 
>kernel   XFSext3
> --
>   2.6.20.748 MB/sec 44 MB/sec
> 
>   2.6.21-rc7  38 MB/sec 37 MB/sec
> 
> Did anyone else run across the problem?
> Is there a known issue?
> 

That's a new discovery, thanks.

It could be due to I/O scheduler changes.  Which one are you using?  CFQ?

Or it could be that there has been some changed behaviour at the VFS/pagecache
layer: the VFS might be submitting little hunks of lots of files, rather than
large hunks of few files.

Or it could be a block-layer thing: perhaps some driver change has caused
us to be placing less data into the queue.  Which device driver is that machine
using?

Being a simple soul, the first thing I'll try when I get near a test box
will be

for i in $(seq 1 16)
do
time dd if=/dev/zero of=$i bs=1M count=1024 &
done
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tiny e2fsprogs fix, bitmaps question

2007-04-18 Thread Theodore Tso
On Wed, Apr 18, 2007 at 12:54:44AM -0600, Andreas Dilger wrote: 

  [ I'm quoting out of order here, and cc'ing the linux-ext4 list with
permission since I think the topics under discussion have a more
general interest.  --Ted]

> Just reading the updated e2fsck.conf.5 man page, and noticed in [scratch 
> files]:
> 
> numdirs_threshold:
> s/numbers of directory/number of directories/

Oops, thanks for catching that.  I implemented the in-core memory
reduction patches in somewhat of a hurry because I had a number of
users who had been using BackupPC or other similar hard-link intensive
backup programs, and they were running into massive memory usage
issues.  So this took priority over the extents refactorization work
(which is currently on the top of my e2fsprogs work queue).

> We are also looking to implement something better than raw bitmaps for
> cases where the set bits are expected to be sparse (e.g. block_dup_map,
> inode_bad_map, inode_bb_map, inode_imagic_map), instead of just going
> wholesale to on-disk storage (which is just going to slow things down).

The other type of bitmap implementation which I had been thinking
about asking people to implement is one which works well in the case
where the set bits are expected to be mostly be contiguous --- i.e.,
the block allocation map --- so some kind of extent-based data
structure indexed using an in-memory b-tree would be ideal.

Note that for block_dup_map, inode_bad_map, inode_bb_map,
inode_imagic_map, et. al., they are usually not allocated at all, and
if they are allocated, so while there are usually a very few numbers
of set bits, so using a tree-indexed, extent-based data structure
should work just fine for this sort of implementation.  Yes, it's not
as efficient as a array of integers, but it's much more efficient.

> The current API doesn't match that of the normal bitmap routines,
> but I think that is desirable.  What do you think?  The other
> thing that is lacking is an efficient and generic bitmap iterator
> instead of just walking [0 ... nbits], because in the sparse case
> the full-range walking can be grossly inefficient.

I agree it's desirable, and I had been planning a bitmap API revision
anyway.  The big changes I had been thinking for the new interface were:

* 64-bit enabled
* no inline functions
* pluggable back-ends (for multiple implementations, 
traditional, tree-based extents, disk-backed, etc.)
* extent-based set/clear functions (so you can take an extent map
from an inode and mark the entire extent as allocated in a 
block 
bitmap)
* To provide backwards ABI compatibility, if the magic number
in the first word of the bitmap indicates an old-style
bitmap, dispatch to the old inline-style bitmap operators

An iterator makes a lot of sense and I hadn't thought of it, but we
should definitely add it.  It might also be a good idea to add an
extent-based iterator, as well, since that would be even more CPU
efficient for some callers.

> Some things we are targetting with the design:
> - use less RAM for sparsely populated bitmaps
> - be not much worse than bitmaps if they turn out not to be sparse
> - avoid allocating gigantic or many tiny chunks of memory
> - be dynamic in chosing the method for storing "set" bits

Yep, all good things.  I hadn't really considered the requirement of
dynamically choosing a method, but that's because I figured the
tree-indexed extents data structure would hopefully be general purpose
enough to work with a very large range of filesystems, and dynamism
wasn't something I wanted to try to solve the first time around.

My current thinking favors a design like the io_manager, where you can
have one io_manager (test_io) provide services where the actual back
end work is done by another io_manager (i.e., unix_io).  So I could
imagine a "auto" bitmap type which automatically converts bitmap
representations behind the scenes from an in-memory to an on-disk
format, hopefully using a single in-memory format which is generally
applicable to most cases (such as tree-indexed extents), and then once
you go out to disk, it's all about correctness and completing the
task, and only secondarily about performance.  

But part of that is that while your ebitmap implementation has
desirable properties in terms of scaling from in-memory sparse arrays
to full bitmaps, I suspect a tree-indexed extents implementation has a
wider range of applicability, so I was assuming that we wouldn't have
to get to the dynamic switching part of the program for quite some 
time.  (IIRC, all xfs_repair has right now is a simple bit-count
compression scheme, and that seems to have been sufficient for them.)

BTW, If you're interested in implementing an extent-based b-tree,
which will be the next low-hanging fruit in terms of reducing
e2fsprogs's memory usage, not that we already have a red-black tree
implementat

Performance degradation with FFSB between 2.6.20 and 2.6.21-rc7

2007-04-18 Thread Valerie Clement


Running benchmark tests (FFSB) on an ext4 filesystem, I noticed a 
performance degradation (about 15-20 percent) in sequential write tests 
between 2.6.19-rc6 and 2.6.21-rc4 kernels.


I ran the same tests on ext3 and XFS filesystems and I saw the same 
performance difference between the two kernel versions for these two 
filesystems.


I have also reproduced it between 2.6.20.7 and 2.6.21-rc7.
The FFSB tests run 16 threads, each creating 1GB files. The tests were 
done on the same x86_64 system, with the same kernel configuration and 
on the same scsi device. Below are the throughput values given by FFSB.


  kernel   XFSext3
--
 2.6.20.748 MB/sec 44 MB/sec

 2.6.21-rc7  38 MB/sec 37 MB/sec

Did anyone else run across the problem?
Is there a known issue?

   Valérie


(in attachment, my kernel configuration file)
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.21-rc7
# Wed Apr 18 11:29:53 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_CPUSETS is not set
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_KMOD is not set
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
# CONFIG_IOSCHED_AS is not set
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_L1_CACHE_BYTES=128
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_INTERNODE_CACHE_BYTES=128
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_HT=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_NODES_SHIFT=6
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NUMA_EMU=y
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
CONFIG_DISCONTIGMEM_MANUAL=y
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_DISCONTIGMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_NEED_MULTIPLE_NODES=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MIGRATION=y
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y
CONFIG_NR_CPUS=32
CONFIG_HOTPLUG_CPU=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x20
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
# CONFIG_REORD

Re: Interface for the new fallocate() system call

2007-04-18 Thread Andreas Dilger
On Apr 17, 2007  18:25 +0530, Amit K. Arora wrote:
> On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> > Wouldn't
> > int fallocate(loff_t offset, loff_t len, int fd, int mode)
> > work on both s390 and ppc/arm?  glibc will certainly wrap it and
> > reorder the arguments as needed, so there is no need to keep fd first.
> 
> I think more people are comfirtable with this approach.

Really?  I thought from the last postings that "fd first, wrap on s390"
was better.

> Since glibc
> will wrap the system call and export the "conventional" interface
> (with fd first) to applications, we may not worry about keeping fd first
> in kernel code. I am personally fine with this approach.

It would seem to make more sense to wrap the syscall on those architectures
that can't handle the "conventional" interface (fd first).

> Still, if people have major concerns, we can think of getting rid of the
> "mode" argument itself. Anyhow we may, in future, need to have a policy
> based system call (say, for providing the goal block by applications for
> performance reasons). "mode" can then be made part of it.

We need at least mode="unallocate" or a separate funallocate() call to
allow allocated-but-unwritten blocks to be unallocated without actually
punching out written data.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html