Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Jörn Engel
On Wed, 28 November 2007 20:27:11 -0700, Andreas Dilger wrote:
> On Nov 29, 2007  00:48 +0100, Jörn Engel wrote:
> > I didn't follow the discussion much, since it didn't appear to suit
> > logfs too well.  In a nutshell, logfs is purely block-based, prepends
> > every block with a header, may compress blocks and packs them as tightly
> > as possible (byte alignment).
> 
> The FIEMAP interface in some ways well suited to your needs, because it
> uses byte offsets instead of FIBMAP, which uses block offsets.  I now see
> that one hole in the specification is that your physical extent is not the
> same length as the logical extent that the data represents.
> 
> I'm not sure that is in itself a reason not to use FIEMAP.  There is already
> a provision for logical extents that do not map directly to the physical
> layout (i.e. FIEMAP_EXTENT_NO_DIRECT flag on the extent).  In the case of
> logfs, the extent fe_length would represent the logical length of the data
> that the extent covers, fe_offset is the start of the physical extent, and
> FIEMAP_EXTENT_NO_DIRECT indicates that this extent cannot be directly mapped.
> This is useful for applications like LILO that would otherwise assume the
> physical offset can be used to access the data from the block device.

That could work.  There really isn't a reason for any application to
mess with physical locations when dealing with logfs.

> SEEK_HOLE/SEEK_DATA only provides a fraction of the useful information
> that FIEMAP does.  It won't give users or developers any information about
> the on disk layout (which is quite important for knowing if allocation
> algorithms are good).

It has the advantage of being easy to use.  My completely untested
attempt at a copy loop is just 14 lines.  Add some error handling and it
should still be quite small.

off_t data_ofs, hole_ofs;
long count;

for (data_ofs = hole_ofs = 0; ; ) {
if (data_ofs >= hole_ofs) {
data_ofs = llseek(in_fd, data_ofs, SEEK_DATA);
hole_ofs = llseek(in_fd, data_ofs, SEEK_HOLE);
}
count = splice(in_fd, &data_ofs, out_fd, &data_ofs,
hole_ofs - data_ofs, 0);
if (count == 0)
break;
data_ofs += count;
}

And when trying to optimize away some of the system calls, my personal
preference would be to teach splice about seek_hole and seek_data and
just splice the complete file range in one go.

Jörn

-- 
"Security vulnerabilities are here to stay."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Andreas Dilger
On Nov 29, 2007  00:48 +0100, Jörn Engel wrote:
> I didn't follow the discussion much, since it didn't appear to suit
> logfs too well.  In a nutshell, logfs is purely block-based, prepends
> every block with a header, may compress blocks and packs them as tightly
> as possible (byte alignment).

The FIEMAP interface in some ways well suited to your needs, because it
uses byte offsets instead of FIBMAP, which uses block offsets.  I now see
that one hole in the specification is that your physical extent is not the
same length as the logical extent that the data represents.

I'm not sure that is in itself a reason not to use FIEMAP.  There is already
a provision for logical extents that do not map directly to the physical
layout (i.e. FIEMAP_EXTENT_NO_DIRECT flag on the extent).  In the case of
logfs, the extent fe_length would represent the logical length of the data
that the extent covers, fe_offset is the start of the physical extent, and
FIEMAP_EXTENT_NO_DIRECT indicates that this extent cannot be directly mapped.
This is useful for applications like LILO that would otherwise assume the
physical offset can be used to access the data from the block device.

It would still provide very useful information about the layout of files on
disk for filefrag, and if tar or cp used FIEMAP they could know to read the
data from the start up to the end of an extent to avoid waiting for a seek
in the middle of the IO, and of course skipping holes during copy.  A very
smart tar might even FIEMAP a whole bunch of files and then read the extents
in physical block offset order to reduce seeking.

I don't think most applications will care about the actual physical location
of an extent, so much as the relative locations of extents within a file and
between files.

> Maybe the "MAP" part fooled me to believe FIEMAP would also expose
> physical location of extends on the medium.  But reading the proposal
> again, I am unsure about that part.  If physical locations are exposed,
> SEEK_HOLE/SEEK_DATA is significantly more elegant for logfs.  If not,
> FIEMAP could be useful.

SEEK_HOLE/SEEK_DATA only provides a fraction of the useful information
that FIEMAP does.  It won't give users or developers any information about
the on disk layout (which is quite important for knowing if allocation
algorithms are good).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Jörn Engel
On Wed, 28 November 2007 16:06:25 -0800, Nicholas Miell wrote:
> 
> I'd have to reread the original proposal, but I remember FIEMAP as being
> a generalized way of getting information about a files extents. I think
> the original proposal only dealt with mapping file offsets to physical
> extents, but IIRC the interface was flexible enough to implement a
> "where are the holes" request.
> 
> Regardless, SEEK_HOLE/SEEK_DATA being a better suited interface for the
> needs of logfs doesn't make it the best interface for that need.

I'm not sure we are looking for the best interface.  A significant
improvement without significant drawbacks would be enough for me.

Having to provide "physical extends" complete with device offsets would
be a significant drawback.  Even if such an interface may have greater
advantages under ideal circumstances.

Jörn

-- 
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Nicholas Miell

On Thu, 2007-11-29 at 00:48 +0100, Jörn Engel wrote:
> On Wed, 28 November 2007 16:39:59 -0700, Andreas Dilger wrote:
> > On Nov 28, 2007  14:56 -0800, Nicholas Miell wrote:
> > > 
> > > type: one of EXTENT_TYPE_HOLE, EXTENT_TYPE_DATA, EXTENT_TYPE_EXTENTS,
> > > EXTENT_TYPE_COMPRESSED, EXTENT_TYPE_UNCOMPRESSED etc.
> > 
> > This is what FIEMAP is supposed to do.  We wrote a spec and implemented
> > a prototype for ext4, but haven't had time to make it generic to move
> > the large part of the code into the VFS.  If someone wanted to take that
> > up, it would be much appreciated.
> > 
> > See "[RFC] add FIEMAP ioctl to efficiently map file allocation" in
> > linux-fsdevel for details on this interface.
> 
> I didn't follow the discussion much, since it didn't appear to suit
> logfs too well.  In a nutshell, logfs is purely block-based, prepends
> every block with a header, may compress blocks and packs them as tightly
> as possible (byte alignment).
> 
> Maybe the "MAP" part fooled me to believe FIEMAP would also expose
> physical location of extends on the medium.  But reading the proposal
> again, I am unsure about that part.  If physical locations are exposed,
> SEEK_HOLE/SEEK_DATA is significantly more elegant for logfs.  If not,
> FIEMAP could be useful.
> 
> Jörn
> 

I'd have to reread the original proposal, but I remember FIEMAP as being
a generalized way of getting information about a files extents. I think
the original proposal only dealt with mapping file offsets to physical
extents, but IIRC the interface was flexible enough to implement a
"where are the holes" request.

Regardless, SEEK_HOLE/SEEK_DATA being a better suited interface for the
needs of logfs doesn't make it the best interface for that need.

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Jörn Engel
On Wed, 28 November 2007 16:39:59 -0700, Andreas Dilger wrote:
> On Nov 28, 2007  14:56 -0800, Nicholas Miell wrote:
> > 
> > type: one of EXTENT_TYPE_HOLE, EXTENT_TYPE_DATA, EXTENT_TYPE_EXTENTS,
> > EXTENT_TYPE_COMPRESSED, EXTENT_TYPE_UNCOMPRESSED etc.
> 
> This is what FIEMAP is supposed to do.  We wrote a spec and implemented
> a prototype for ext4, but haven't had time to make it generic to move
> the large part of the code into the VFS.  If someone wanted to take that
> up, it would be much appreciated.
> 
> See "[RFC] add FIEMAP ioctl to efficiently map file allocation" in
> linux-fsdevel for details on this interface.

I didn't follow the discussion much, since it didn't appear to suit
logfs too well.  In a nutshell, logfs is purely block-based, prepends
every block with a header, may compress blocks and packs them as tightly
as possible (byte alignment).

Maybe the "MAP" part fooled me to believe FIEMAP would also expose
physical location of extends on the medium.  But reading the proposal
again, I am unsure about that part.  If physical locations are exposed,
SEEK_HOLE/SEEK_DATA is significantly more elegant for logfs.  If not,
FIEMAP could be useful.

Jörn

-- 
Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Nicholas Miell

On Wed, 2007-11-28 at 18:33 -0500, Josef Bacik wrote:
> On Wed, Nov 28, 2007 at 02:56:54PM -0800, Nicholas Miell wrote:
> > 
> > On Wed, 2007-11-28 at 15:02 -0500, Josef Bacik wrote:
> > > Hello,
> > > 
> > > This is my first pass at implementing SEEK_HOLE/SEEK_DATA.  This has been 
> > > in
> > > solaris for about a year now, and is described here
> > > 
> > > http://docs.sun.com/app/docs/doc/819-2241/lseek-2?l=en&a=view&q=SEEK_HOLE
> > > http://blogs.sun.com/roller/page/bonwick?entry=seek_hole_and_seek_data
> > > 
> > > I've added a file operation to allow filesystems to override the default
> > > seek_hole_data function, which just loops through bmap looking for either 
> > > a hole
> > > or data.  I've tested this and it seems to work well.  I ran my testcase 
> > > on a
> > > solaris box to make sure I got consistent results (I just ran my test 
> > > script on
> > > the solaris box, I haven't looked at any of their code in case thats a 
> > > concern).
> > > All comments welcome.  Thank you,
> > > 
> > > Josef
> > 
> > I stand by my belief that SEEK_HOLE/SEEK_DATA is a lousy interface.
> > 
> > It abuses the seek operation to become a query operation, it requires a
> > total number of system calls proportional to the number holes+data and
> > it isn't general enough for other similar uses (e.g. total number of
> > contiguous extents, compressed extents, offline extents, extents
> > currently shared with other inodes, extents embedded in the inode
> > (tails), etc.)
> > 
> > Something like the following would be much better:
> > 
> > int getfilextents(int fd, off_t offset, int type, size_t *length, struct
> > extent *extents)
> > 
> > with
> > 
> > int fd: open file
> > 
> > offset: offset in file to start reporting extents
> > 
> > type: one of EXTENT_TYPE_HOLE, EXTENT_TYPE_DATA, EXTENT_TYPE_EXTENTS,
> > EXTENT_TYPE_COMPRESSED, EXTENT_TYPE_UNCOMPRESSED etc.
> > 
> > length: in/out parameter, on entry contains length of extents array, on
> > exit contains number of valid entries in the extents array or total
> > number of extents remaining in the file, whichever is larger
> > 
> > extents: array of struct extent { off_t offset; off_t length }, only
> > updated if non-NULL
> > 
> > Making the type parameter a bitmask and adding a type member to struct
> > extent could be useful so that multiple types of extents could be
> > reported at once could be useful, too. (But you end up with weird cases
> > like data extents overlapping with compressed extents.)
> > 
> > Actually, now that I've searched my mailbox, Andreas Dilger's FIEMAP
> > proposal is pretty much what I suggest here and is certainly superior to
> > Sun's SEEK_HOLE/SEEK_DATA.
> >
> 
> Agreed, however in speaking hch and others the consensus was FIEMAP was good,
> however there was no reason why SEEK_HOLE/SEEK_DATA shouldn't also be
> implemented, and then at some point down the road when a generic FIEMAP is in
> place either change the SEEK_HOLE/SEEK_DATA implementation to try to use 
> FIEMAP
> by default and then fall back on bmap if it has to, or some other such
> operation.  I'm cool with passing on this implementation in preference for
> FIEMAP, but given the discussion I had earlier this week with some of the 
> other
> fs people the general thought was go ahead and do this for now.
> 

Well, there's no demand specifically for SEEK_HOLE/SEEK_DATA[1] and the
interface is ugly, so killing it before it spreads beyond Solaris seems
like a good idea to me. OTOH, if you implement SEEK_HOLE/SEEK_DATA,
nobody is going to bother using the good interface if SEEK_HOLE/
SEEK_DATA is the only portable interface.



[1] The only user appears to be Joerg Schilling.
http://www.google.com/codesearch?hl=en&q=+%5CWSEEK_(DATA%7CHOLE)&sa=N&as_case=y

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Andreas Dilger
On Nov 28, 2007  15:02 -0500, Josef Bacik wrote:
> +static loff_t generic_seek_hole_data(struct file *file, loff_t offset,
> +  int origin)
> +{
> + loff_t retval = -ENXIO;
> + struct inode *inode = file->f_mapping->host;
> + sector_t block, found_block;
> + sector_t last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
> + int want = (origin == SEEK_HOLE) ? 0 : 1;
> +
> + for (block = offset >> inode->i_blkbits; block <= last_block; block++) {
> + found_block = bmap(inode, block);
> +
> + if (!!found_block == want) {
> + retval = block << inode->i_blkbits;
> + break;
> + }
> + }

The problem with this is that it doesn't know anything about the filesystem,
and might spin for ages in a tight loop for a large sparse or dense file.

Almost any fs would have to implement ->seek_hole_data() for this to be
useful, at which point we might as well just report -EINVAL to the
application or glibc (since this needs to be handled anyways).

Also, what about network filesystems that don't support bmap?  They will
show up as entirely holes, since bmap() returns 0 if ->bmap is not
supported.  That could lead to serious data loss, if e.g. a filesystem
is being backed up with a SEEK_DATA-aware tool.

In light of this, it makes much more sense to only support this
functionality if the filesystem explicitly adds it, instead of pretending
that it works when there are sharp pointy things in the dark corners.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Josef Bacik
On Wed, Nov 28, 2007 at 01:56:49PM -0800, Brad Boyer wrote:
> 
> I have a few comments. Some of them are inline with the code.
> 
> The one generic thing is that the Sun documentation specifically
> says to check pathconf(_PC_MIN_HOLE_SIZE) on the filesystem to see
> if it supports this request. Have you looked into adding this to
> the Linux version of pathconf? I haven't tried to look at the latest
> glibc, but 2.3.2 doesn't appear to have it. If we do have it, this
> request should really go to the kernel to ask the individual
> filesystem.  However, it looks like pathconf is entirely implemented
> in glibc.  Should we add a system call or some other way to pass
> pathconf requests into the kernel? I know pathconf currently returns
> some inaccurate results in some cases due to this limitation. There
> are some existing ones like _PC_LINK_MAX that glibc tries to guess
> the correct result based on statfs and can get wrong in any
> non-standard setup.
>

Since it appears pathconf doesn't actually interface with the kernel I think
that its not really worth chasing down.  I don't think every fs maintainer is
really going to want to chase down a glibc person in order to change what
pathconf returns for their particular fs.
 
> 
> On Wed, Nov 28, 2007 at 03:02:07PM -0500, Josef Bacik wrote:
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index ea1f94c..cf61e1e 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -31,10 +31,32 @@ const struct file_operations generic_ro_fops = {
> >  
> >  EXPORT_SYMBOL(generic_ro_fops);
> >  
> > +static loff_t generic_seek_hole_data(struct file *file, loff_t offset,
> > +int origin)
> > +{
> > +   loff_t retval = -ENXIO;
> > +   struct inode *inode = file->f_mapping->host;
> > +   sector_t block, found_block;
> > +   sector_t last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
> > +   int want = (origin == SEEK_HOLE) ? 0 : 1;
> > +
> > +   for (block = offset >> inode->i_blkbits; block <= last_block; block++) {
> > +   found_block = bmap(inode, block);
> > +
> > +   if (!!found_block == want) {
> > +   retval = block << inode->i_blkbits;
> > +   break;
> > +   }
> > +   }
> > +
> > +   return retval;
> > +}
> > +
> 
> It looks like this function can return an offset that is before the
> one passed in as an argument due to losing the lower bits. I think
> it needs to do somthing more like this in the loop initializer:
> 
> block = (offset + (1 << inode->i_blkbits) - 1) >> inode->i_blkbits;
> 
> By adding 1 block if it wasn't already on an even block, this assures
> us that it won't go backwards from the original offset while allowing
> someone to get a block that really does start exactly on the offset.
> 
> >  loff_t generic_file_llseek(struct file *file, loff_t offset, int origin)
> >  {
> > long long retval;
> > struct inode *inode = file->f_mapping->host;
> > +   loff_t (*fn)(struct file *, loff_t, int);
> >  
> > mutex_lock(&inode->i_mutex);
> > switch (origin) {
> > @@ -43,15 +65,24 @@ loff_t generic_file_llseek(struct file *file, loff_t 
> > offset, int origin)
> > break;
> > case SEEK_CUR:
> > offset += file->f_pos;
> > +   break;
> > +   case SEEK_HOLE:
> > +   case SEEK_DATA:
> > +   fn = generic_seek_hole_data;
> > +   if (file->f_op->seek_hole_data)
> > +   fn = file->f_op->seek_hole_data;
> > +   offset = fn(file, offset, origin);
> > }
> > retval = -EINVAL;
> > if (offset>=0 && offset<=inode->i_sb->s_maxbytes) {
> > -   if (offset != file->f_pos) {
> > +   if (offset != file->f_pos && origin != SEEK_HOLE) {
> 
> Why is SEEK_HOLE special in this case? The description of SEEK_HOLE
> and SEEK_DATA in the Sun document would imply to me that they should
> be handled the same.
> 

This was my interpretation of the man page, however Joern looked at what solaris
does and it seems that SEEK_HOLE does update f_pos, so I will change this.

> > file->f_pos = offset;
> > file->f_version = 0;
> > }
> > retval = offset;
> > -   }
> > +   } else if (offset < 0)
> > +   retval = offset;
> > +
> > mutex_unlock(&inode->i_mutex);
> > return retval;

Thanks much for your comments,

Josef
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Andreas Dilger
On Nov 28, 2007  14:56 -0800, Nicholas Miell wrote:
> I stand by my belief that SEEK_HOLE/SEEK_DATA is a lousy interface.
> 
> It abuses the seek operation to become a query operation, it requires a
> total number of system calls proportional to the number holes+data and
> it isn't general enough for other similar uses (e.g. total number of
> contiguous extents, compressed extents, offline extents, extents
> currently shared with other inodes, extents embedded in the inode
> (tails), etc.)
> 
> Something like the following would be much better:
> 
> int getfilextents(int fd, off_t offset, int type, size_t *length, struct
> extent *extents)
> 
> with
> 
> int fd: open file
> 
> offset: offset in file to start reporting extents
> 
> type: one of EXTENT_TYPE_HOLE, EXTENT_TYPE_DATA, EXTENT_TYPE_EXTENTS,
> EXTENT_TYPE_COMPRESSED, EXTENT_TYPE_UNCOMPRESSED etc.

This is what FIEMAP is supposed to do.  We wrote a spec and implemented
a prototype for ext4, but haven't had time to make it generic to move
the large part of the code into the VFS.  If someone wanted to take that
up, it would be much appreciated.

See "[RFC] add FIEMAP ioctl to efficiently map file allocation" in
linux-fsdevel for details on this interface.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Josef Bacik
On Wed, Nov 28, 2007 at 02:56:54PM -0800, Nicholas Miell wrote:
> 
> On Wed, 2007-11-28 at 15:02 -0500, Josef Bacik wrote:
> > Hello,
> > 
> > This is my first pass at implementing SEEK_HOLE/SEEK_DATA.  This has been in
> > solaris for about a year now, and is described here
> > 
> > http://docs.sun.com/app/docs/doc/819-2241/lseek-2?l=en&a=view&q=SEEK_HOLE
> > http://blogs.sun.com/roller/page/bonwick?entry=seek_hole_and_seek_data
> > 
> > I've added a file operation to allow filesystems to override the default
> > seek_hole_data function, which just loops through bmap looking for either a 
> > hole
> > or data.  I've tested this and it seems to work well.  I ran my testcase on 
> > a
> > solaris box to make sure I got consistent results (I just ran my test 
> > script on
> > the solaris box, I haven't looked at any of their code in case thats a 
> > concern).
> > All comments welcome.  Thank you,
> > 
> > Josef
> 
> I stand by my belief that SEEK_HOLE/SEEK_DATA is a lousy interface.
> 
> It abuses the seek operation to become a query operation, it requires a
> total number of system calls proportional to the number holes+data and
> it isn't general enough for other similar uses (e.g. total number of
> contiguous extents, compressed extents, offline extents, extents
> currently shared with other inodes, extents embedded in the inode
> (tails), etc.)
> 
> Something like the following would be much better:
> 
> int getfilextents(int fd, off_t offset, int type, size_t *length, struct
> extent *extents)
> 
> with
> 
> int fd: open file
> 
> offset: offset in file to start reporting extents
> 
> type: one of EXTENT_TYPE_HOLE, EXTENT_TYPE_DATA, EXTENT_TYPE_EXTENTS,
> EXTENT_TYPE_COMPRESSED, EXTENT_TYPE_UNCOMPRESSED etc.
> 
> length: in/out parameter, on entry contains length of extents array, on
> exit contains number of valid entries in the extents array or total
> number of extents remaining in the file, whichever is larger
> 
> extents: array of struct extent { off_t offset; off_t length }, only
> updated if non-NULL
> 
> Making the type parameter a bitmask and adding a type member to struct
> extent could be useful so that multiple types of extents could be
> reported at once could be useful, too. (But you end up with weird cases
> like data extents overlapping with compressed extents.)
> 
> Actually, now that I've searched my mailbox, Andreas Dilger's FIEMAP
> proposal is pretty much what I suggest here and is certainly superior to
> Sun's SEEK_HOLE/SEEK_DATA.
>

Agreed, however in speaking hch and others the consensus was FIEMAP was good,
however there was no reason why SEEK_HOLE/SEEK_DATA shouldn't also be
implemented, and then at some point down the road when a generic FIEMAP is in
place either change the SEEK_HOLE/SEEK_DATA implementation to try to use FIEMAP
by default and then fall back on bmap if it has to, or some other such
operation.  I'm cool with passing on this implementation in preference for
FIEMAP, but given the discussion I had earlier this week with some of the other
fs people the general thought was go ahead and do this for now.

Josef 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Nicholas Miell

On Wed, 2007-11-28 at 15:02 -0500, Josef Bacik wrote:
> Hello,
> 
> This is my first pass at implementing SEEK_HOLE/SEEK_DATA.  This has been in
> solaris for about a year now, and is described here
> 
> http://docs.sun.com/app/docs/doc/819-2241/lseek-2?l=en&a=view&q=SEEK_HOLE
> http://blogs.sun.com/roller/page/bonwick?entry=seek_hole_and_seek_data
> 
> I've added a file operation to allow filesystems to override the default
> seek_hole_data function, which just loops through bmap looking for either a 
> hole
> or data.  I've tested this and it seems to work well.  I ran my testcase on a
> solaris box to make sure I got consistent results (I just ran my test script 
> on
> the solaris box, I haven't looked at any of their code in case thats a 
> concern).
> All comments welcome.  Thank you,
> 
> Josef

I stand by my belief that SEEK_HOLE/SEEK_DATA is a lousy interface.

It abuses the seek operation to become a query operation, it requires a
total number of system calls proportional to the number holes+data and
it isn't general enough for other similar uses (e.g. total number of
contiguous extents, compressed extents, offline extents, extents
currently shared with other inodes, extents embedded in the inode
(tails), etc.)

Something like the following would be much better:

int getfilextents(int fd, off_t offset, int type, size_t *length, struct
extent *extents)

with

int fd: open file

offset: offset in file to start reporting extents

type: one of EXTENT_TYPE_HOLE, EXTENT_TYPE_DATA, EXTENT_TYPE_EXTENTS,
EXTENT_TYPE_COMPRESSED, EXTENT_TYPE_UNCOMPRESSED etc.

length: in/out parameter, on entry contains length of extents array, on
exit contains number of valid entries in the extents array or total
number of extents remaining in the file, whichever is larger

extents: array of struct extent { off_t offset; off_t length }, only
updated if non-NULL

Making the type parameter a bitmask and adding a type member to struct
extent could be useful so that multiple types of extents could be
reported at once could be useful, too. (But you end up with weird cases
like data extents overlapping with compressed extents.)

Actually, now that I've searched my mailbox, Andreas Dilger's FIEMAP
proposal is pretty much what I suggest here and is certainly superior to
Sun's SEEK_HOLE/SEEK_DATA.

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Brad Boyer

I have a few comments. Some of them are inline with the code.

The one generic thing is that the Sun documentation specifically
says to check pathconf(_PC_MIN_HOLE_SIZE) on the filesystem to see
if it supports this request. Have you looked into adding this to
the Linux version of pathconf? I haven't tried to look at the latest
glibc, but 2.3.2 doesn't appear to have it. If we do have it, this
request should really go to the kernel to ask the individual
filesystem.  However, it looks like pathconf is entirely implemented
in glibc.  Should we add a system call or some other way to pass
pathconf requests into the kernel? I know pathconf currently returns
some inaccurate results in some cases due to this limitation. There
are some existing ones like _PC_LINK_MAX that glibc tries to guess
the correct result based on statfs and can get wrong in any
non-standard setup.

Brad Boyer
[EMAIL PROTECTED]

On Wed, Nov 28, 2007 at 03:02:07PM -0500, Josef Bacik wrote:
> diff --git a/fs/read_write.c b/fs/read_write.c
> index ea1f94c..cf61e1e 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -31,10 +31,32 @@ const struct file_operations generic_ro_fops = {
>  
>  EXPORT_SYMBOL(generic_ro_fops);
>  
> +static loff_t generic_seek_hole_data(struct file *file, loff_t offset,
> +  int origin)
> +{
> + loff_t retval = -ENXIO;
> + struct inode *inode = file->f_mapping->host;
> + sector_t block, found_block;
> + sector_t last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
> + int want = (origin == SEEK_HOLE) ? 0 : 1;
> +
> + for (block = offset >> inode->i_blkbits; block <= last_block; block++) {
> + found_block = bmap(inode, block);
> +
> + if (!!found_block == want) {
> + retval = block << inode->i_blkbits;
> + break;
> + }
> + }
> +
> + return retval;
> +}
> +

It looks like this function can return an offset that is before the
one passed in as an argument due to losing the lower bits. I think
it needs to do somthing more like this in the loop initializer:

block = (offset + (1 << inode->i_blkbits) - 1) >> inode->i_blkbits;

By adding 1 block if it wasn't already on an even block, this assures
us that it won't go backwards from the original offset while allowing
someone to get a block that really does start exactly on the offset.

>  loff_t generic_file_llseek(struct file *file, loff_t offset, int origin)
>  {
>   long long retval;
>   struct inode *inode = file->f_mapping->host;
> + loff_t (*fn)(struct file *, loff_t, int);
>  
>   mutex_lock(&inode->i_mutex);
>   switch (origin) {
> @@ -43,15 +65,24 @@ loff_t generic_file_llseek(struct file *file, loff_t 
> offset, int origin)
>   break;
>   case SEEK_CUR:
>   offset += file->f_pos;
> + break;
> + case SEEK_HOLE:
> + case SEEK_DATA:
> + fn = generic_seek_hole_data;
> + if (file->f_op->seek_hole_data)
> + fn = file->f_op->seek_hole_data;
> + offset = fn(file, offset, origin);
>   }
>   retval = -EINVAL;
>   if (offset>=0 && offset<=inode->i_sb->s_maxbytes) {
> - if (offset != file->f_pos) {
> + if (offset != file->f_pos && origin != SEEK_HOLE) {

Why is SEEK_HOLE special in this case? The description of SEEK_HOLE
and SEEK_DATA in the Sun document would imply to me that they should
be handled the same.

>   file->f_pos = offset;
>   file->f_version = 0;
>   }
>   retval = offset;
> - }
> + } else if (offset < 0)
> + retval = offset;
> +
>   mutex_unlock(&inode->i_mutex);
>   return retval;
>  }
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Josef Bacik
Hello,

This is my first pass at implementing SEEK_HOLE/SEEK_DATA.  This has been in
solaris for about a year now, and is described here

http://docs.sun.com/app/docs/doc/819-2241/lseek-2?l=en&a=view&q=SEEK_HOLE
http://blogs.sun.com/roller/page/bonwick?entry=seek_hole_and_seek_data

I've added a file operation to allow filesystems to override the default
seek_hole_data function, which just loops through bmap looking for either a hole
or data.  I've tested this and it seems to work well.  I ran my testcase on a
solaris box to make sure I got consistent results (I just ran my test script on
the solaris box, I haven't looked at any of their code in case thats a concern).
All comments welcome.  Thank you,

Josef

diff --git a/fs/read_write.c b/fs/read_write.c
index ea1f94c..cf61e1e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -31,10 +31,32 @@ const struct file_operations generic_ro_fops = {
 
 EXPORT_SYMBOL(generic_ro_fops);
 
+static loff_t generic_seek_hole_data(struct file *file, loff_t offset,
+int origin)
+{
+   loff_t retval = -ENXIO;
+   struct inode *inode = file->f_mapping->host;
+   sector_t block, found_block;
+   sector_t last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
+   int want = (origin == SEEK_HOLE) ? 0 : 1;
+
+   for (block = offset >> inode->i_blkbits; block <= last_block; block++) {
+   found_block = bmap(inode, block);
+
+   if (!!found_block == want) {
+   retval = block << inode->i_blkbits;
+   break;
+   }
+   }
+
+   return retval;
+}
+
 loff_t generic_file_llseek(struct file *file, loff_t offset, int origin)
 {
long long retval;
struct inode *inode = file->f_mapping->host;
+   loff_t (*fn)(struct file *, loff_t, int);
 
mutex_lock(&inode->i_mutex);
switch (origin) {
@@ -43,15 +65,24 @@ loff_t generic_file_llseek(struct file *file, loff_t 
offset, int origin)
break;
case SEEK_CUR:
offset += file->f_pos;
+   break;
+   case SEEK_HOLE:
+   case SEEK_DATA:
+   fn = generic_seek_hole_data;
+   if (file->f_op->seek_hole_data)
+   fn = file->f_op->seek_hole_data;
+   offset = fn(file, offset, origin);
}
retval = -EINVAL;
if (offset>=0 && offset<=inode->i_sb->s_maxbytes) {
-   if (offset != file->f_pos) {
+   if (offset != file->f_pos && origin != SEEK_HOLE) {
file->f_pos = offset;
file->f_version = 0;
}
retval = offset;
-   }
+   } else if (offset < 0)
+   retval = offset;
+
mutex_unlock(&inode->i_mutex);
return retval;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..a55d68e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,7 +30,9 @@
 #define SEEK_SET   0   /* seek relative to beginning of file */
 #define SEEK_CUR   1   /* seek relative to current file position */
 #define SEEK_END   2   /* seek relative to end of file */
-#define SEEK_MAX   SEEK_END
+#define SEEK_HOLE  3   /* seek relative to the next hole */
+#define SEEK_DATA  4   /* seek relative to the next block with data */
+#define SEEK_MAX   SEEK_DATA
 
 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {
@@ -1163,6 +1165,7 @@ typedef int (*read_actor_t)(read_descriptor_t *, struct 
page *, unsigned long, u
 struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
+   loff_t (*seek_hole_data) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned 
long, loff_t);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html