On Tue, Mar 08, 2016 at 07:33:26AM -0500, Jeff Cody wrote: > On Tue, Mar 08, 2016 at 05:21:48AM +0100, Niels de Vos wrote: > > On Mon, Mar 07, 2016 at 01:27:38PM -0500, Jeff Cody wrote: > > > On Mon, Mar 07, 2016 at 07:04:15PM +0100, Niels de Vos wrote: > > > > GlusterFS 3.8 contains support for SEEK_DATA and SEEK_HOLE. This makes > > > > it possible to detect sparse areas in files. > > > > > > > > Signed-off-by: Niels de Vos <nde...@redhat.com> > > > > > > > > -- > > > > Tested by compiling and running "qemu-img map gluster://..." with a > > > > build of the current master branch of glusterfs. Using a Fedora > > > > cloud image (in raw format) shows many SEEK procudure calls going back > > > > and forth over the network. The output of "qemu map" matches the output > > > > when run against the image on the local filesystem. > > > > --- > > > > block/gluster.c | 159 > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > configure | 25 +++++++++ > > > > 2 files changed, 184 insertions(+) > > > > > > > > diff --git a/block/gluster.c b/block/gluster.c > > > > index 65077a0..1430010 100644 > > > > --- a/block/gluster.c > > > > +++ b/block/gluster.c > > > > @@ -677,6 +677,153 @@ static int > > > > qemu_gluster_has_zero_init(BlockDriverState *bs) > > > > return 0; > > > > } > > > > > > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > > > > Why do we need to make this a compile-time option? Version checking > > > is problematic; for instance, different distributions may have > > > backported bug fixes / features, that are not reflected by the > > > reported version number, etc.. Ideally, we can determine > > > functionality during runtime, and behave accordingly. > > > > This will not get backported to older Gluster versions, it required a > > protocol change. > > > > > If SEEK_DATA and SEEK_HOLE are not supported, > > > qemu_gluster_co_get_block_status can return that sectors are all > > > allocated (which is what happens in block/io.c anyway if the driver > > > doesn't support the function). > > > > Ok, good to know. > > > > > As long as glfs_lseek() will return error (e.g. EINVAL) for an invalid > > > whence value, we can handle it runtime. Does glfs_lseek() behave > > > sanely? > > > > Unfortunately older versions of libgfapi do not return EINVAL when > > SEEK_DATA/HOLE is used. It is something we'll need to fix in the stable > > releases. We can not assume that all users have installed a version of > > the library that handles SEEK_DATA/HOLE correctly (return EINVAL) when > > there is no support in the network protocol or on the server. > > > > To be sure that we don't get some undefined behaviour, the compile time > > check is needed. > > > > That's unfortunate. > > However, we may still be able to work around that with some probing in > the qemu_gluster_open() function. I peeked at the libgfapi code, and > it looks like for an unknown whence, the current offset is just > returned by glfs_lseek() - is that correct?
Yes, that is correct. > With the same offset input, an lseek should always return something > different for a SEEK_DATA and SEEK_HOLE whence (SEEK_HOLE will return > EOF offset if there are no further holes). > > In qemu_gluster_open(), we can then probe for support. if we call > glfs_lseek() twice with the same offset, first with SEEK_DATA and > then with SEEK_HOLE, then we can see if the libgfapi version > supports SEEK_DATA/SEEK_HOLE. If the resulting offsets from the calls > are equal, it doesn't support it. If the offsets differ, then > presumably it does. We can then set and remember that in the > BDRVGlusterState struct (e.g. bool supports_seek_data). Hmm, yes, that should work. And it'll obviously catch the case where glfs_lseek() returns EINVAL too. Thanks for the idea, I'll try to get this done over the next few days. Niels > > > > > +/* > > > > + * Find allocation range in @bs around offset @start. > > > > + * May change underlying file descriptor's file offset. > > > > + * If @start is not in a hole, store @start in @data, and the > > > > + * beginning of the next hole in @hole, and return 0. > > > > + * If @start is in a non-trailing hole, store @start in @hole and the > > > > + * beginning of the next non-hole in @data, and return 0. > > > > + * If @start is in a trailing hole or beyond EOF, return -ENXIO. > > > > + * If we can't find out, return a negative errno other than -ENXIO. > > > > + * > > > > + * (Shamefully copied from raw-posix.c, only miniscule adaptions.) > > > > + */ > > > > +static int find_allocation(BlockDriverState *bs, off_t start, > > > > + off_t *data, off_t *hole) > > > > +{ > > > > + BDRVGlusterState *s = bs->opaque; > > > > + off_t offs; > > > > + > > > > + /* > > > > + * SEEK_DATA cases: > > > > + * D1. offs == start: start is in data > > > > + * D2. offs > start: start is in a hole, next data at offs > > > > + * D3. offs < 0, errno = ENXIO: either start is in a trailing hole > > > > + * or start is beyond EOF > > > > + * If the latter happens, the file has been truncated behind > > > > + * our back since we opened it. All bets are off then. > > > > + * Treating like a trailing hole is simplest. > > > > + * D4. offs < 0, errno != ENXIO: we learned nothing > > > > + */ > > > > + offs = glfs_lseek(s->fd, start, SEEK_DATA); > > > > + if (offs < 0) { > > > > + return -errno; /* D3 or D4 */ > > > > + } > > > > + assert(offs >= start); > > > > + > > > > + if (offs > start) { > > > > + /* D2: in hole, next data at offs */ > > > > + *hole = start; > > > > + *data = offs; > > > > + return 0; > > > > + } > > > > + > > > > + /* D1: in data, end not yet known */ > > > > + > > > > + /* > > > > + * SEEK_HOLE cases: > > > > + * H1. offs == start: start is in a hole > > > > + * If this happens here, a hole has been dug behind our back > > > > + * since the previous lseek(). > > > > + * H2. offs > start: either start is in data, next hole at offs, > > > > + * or start is in trailing hole, EOF at offs > > > > + * Linux treats trailing holes like any other hole: offs == > > > > + * start. Solaris seeks to EOF instead: offs > start (blech). > > > > + * If that happens here, a hole has been dug behind our back > > > > + * since the previous lseek(). > > > > + * H3. offs < 0, errno = ENXIO: start is beyond EOF > > > > + * If this happens, the file has been truncated behind our > > > > + * back since we opened it. Treat it like a trailing hole. > > > > + * H4. offs < 0, errno != ENXIO: we learned nothing > > > > + * Pretend we know nothing at all, i.e. "forget" about D1. > > > > + */ > > > > + offs = glfs_lseek(s->fd, start, SEEK_HOLE); > > > > + if (offs < 0) { > > > > + return -errno; /* D1 and (H3 or H4) */ > > > > + } > > > > + assert(offs >= start); > > > > + > > > > + if (offs > start) { > > > > + /* > > > > + * D1 and H2: either in data, next hole at offs, or it was in > > > > + * data but is now in a trailing hole. In the latter case, > > > > + * all bets are off. Treating it as if it there was data all > > > > + * the way to EOF is safe, so simply do that. > > > > + */ > > > > + *data = start; > > > > + *hole = offs; > > > > + return 0; > > > > + } > > > > + > > > > + /* D1 and H1 */ > > > > + return -EBUSY; > > > > +} > > > > + > > > > +/* > > > > + * Returns the allocation status of the specified sectors. > > > > + * > > > > + * If 'sector_num' is beyond the end of the disk image the return > > > > value is 0 > > > > + * and 'pnum' is set to 0. > > > > + * > > > > + * 'pnum' is set to the number of sectors (including and immediately > > > > following > > > > + * the specified sector) that are known to be in the same > > > > + * allocated/unallocated state. > > > > + * > > > > + * 'nb_sectors' is the max value 'pnum' should be set to. If > > > > nb_sectors goes > > > > + * beyond the end of the disk image it will be clamped. > > > > + * > > > > + * (Based on raw_co_get_block_status() from raw-posix.c.) > > > > + */ > > > > +static int64_t coroutine_fn qemu_gluster_co_get_block_status( > > > > + BlockDriverState *bs, int64_t sector_num, int nb_sectors, int > > > > *pnum) > > > > +{ > > > > + BDRVGlusterState *s = bs->opaque; > > > > + off_t start, data = 0, hole = 0; > > > > + int64_t total_size; > > > > + int ret = -EINVAL; > > > > + > > > > + if (!s->fd) { > > > > + return ret; > > > > + } > > > > + > > > > + start = sector_num * BDRV_SECTOR_SIZE; > > > > + total_size = bdrv_getlength(bs); > > > > + if (total_size < 0) { > > > > + return total_size; > > > > + } else if (start >= total_size) { > > > > + *pnum = 0; > > > > + return 0; > > > > + } else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) { > > > > + nb_sectors = DIV_ROUND_UP(total_size - start, > > > > BDRV_SECTOR_SIZE); > > > > + } > > > > + > > > > + ret = find_allocation(bs, start, &data, &hole); > > > > + if (ret == -ENXIO) { > > > > + /* Trailing hole */ > > > > + *pnum = nb_sectors; > > > > + ret = BDRV_BLOCK_ZERO; > > > > + } else if (ret < 0) { > > > > + /* No info available, so pretend there are no holes */ > > > > + *pnum = nb_sectors; > > > > + ret = BDRV_BLOCK_DATA; > > > > + } else if (data == start) { > > > > + /* On a data extent, compute sectors to the end of the extent, > > > > + * possibly including a partial sector at EOF. */ > > > > + *pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, > > > > BDRV_SECTOR_SIZE)); > > > > + ret = BDRV_BLOCK_DATA; > > > > + } else { > > > > + /* On a hole, compute sectors to the beginning of the next > > > > extent. */ > > > > + assert(hole == start); > > > > + *pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE); > > > > + ret = BDRV_BLOCK_ZERO; > > > > + } > > > > + return ret | BDRV_BLOCK_OFFSET_VALID | start; > > > > +} > > > > +#endif /* CONFIG_GLUSTERFS_SEEK_DATA */ > > > > + > > > > + > > > > static QemuOptsList qemu_gluster_create_opts = { > > > > .name = "qemu-gluster-create-opts", > > > > .head = QTAILQ_HEAD_INITIALIZER(qemu_gluster_create_opts.head), > > > > @@ -719,6 +866,9 @@ static BlockDriver bdrv_gluster = { > > > > #ifdef CONFIG_GLUSTERFS_ZEROFILL > > > > .bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes, > > > > #endif > > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > > + .bdrv_co_get_block_status = qemu_gluster_co_get_block_status, > > > > +#endif > > > > .create_opts = &qemu_gluster_create_opts, > > > > }; > > > > > > > > @@ -746,6 +896,9 @@ static BlockDriver bdrv_gluster_tcp = { > > > > #ifdef CONFIG_GLUSTERFS_ZEROFILL > > > > .bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes, > > > > #endif > > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > > + .bdrv_co_get_block_status = qemu_gluster_co_get_block_status, > > > > +#endif > > > > .create_opts = &qemu_gluster_create_opts, > > > > }; > > > > > > > > @@ -773,6 +926,9 @@ static BlockDriver bdrv_gluster_unix = { > > > > #ifdef CONFIG_GLUSTERFS_ZEROFILL > > > > .bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes, > > > > #endif > > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > > + .bdrv_co_get_block_status = qemu_gluster_co_get_block_status, > > > > +#endif > > > > .create_opts = &qemu_gluster_create_opts, > > > > }; > > > > > > > > @@ -800,6 +956,9 @@ static BlockDriver bdrv_gluster_rdma = { > > > > #ifdef CONFIG_GLUSTERFS_ZEROFILL > > > > .bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes, > > > > #endif > > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > > + .bdrv_co_get_block_status = qemu_gluster_co_get_block_status, > > > > +#endif > > > > .create_opts = &qemu_gluster_create_opts, > > > > }; > > > > > > > > diff --git a/configure b/configure > > > > index 0c0472a..ca3821c 100755 > > > > --- a/configure > > > > +++ b/configure > > > > @@ -3351,6 +3351,9 @@ if test "$glusterfs" != "no" ; then > > > > if $pkg_config --atleast-version=6 glusterfs-api; then > > > > glusterfs_zerofill="yes" > > > > fi > > > > + if $pkg_config --atleast-version=7.3.8 glusterfs-api; then > > > > + glusterfs_seek_data="yes" > > > > + fi > > > > else > > > > if test "$glusterfs" = "yes" ; then > > > > feature_not_found "GlusterFS backend support" \ > > > > @@ -3660,6 +3663,24 @@ if compile_prog "" "" ; then > > > > fiemap=yes > > > > fi > > > > > > > > +# check for SEEK_DATA and SEEK_HOLE > > > > +seek_data=no > > > > +cat > $TMPC << EOF > > > > +#define _GNU_SOURCE > > > > +#include <sys/types.h> > > > > +#include <unistd.h> > > > > + > > > > +int main(void) > > > > +{ > > > > + lseek(0, 0, SEEK_DATA); > > > > + lseek(0, 0, SEEK_HOLE); > > > > + return 0; > > > > +} > > > > +EOF > > > > +if compile_prog "" "" ; then > > > > + seek_data=yes > > > > +fi > > > > + > > > > # check for dup3 > > > > dup3=no > > > > cat > $TMPC << EOF > > > > @@ -5278,6 +5299,10 @@ if test "$glusterfs_zerofill" = "yes" ; then > > > > echo "CONFIG_GLUSTERFS_ZEROFILL=y" >> $config_host_mak > > > > fi > > > > > > > > +if test "$glusterfs_seek_data" = "yes" && test "$seek_data" = "yes" ; > > > > then > > > > + echo "CONFIG_GLUSTERFS_SEEK_DATA=y" >> $config_host_mak > > > > +fi > > > > + > > > > if test "$archipelago" = "yes" ; then > > > > echo "CONFIG_ARCHIPELAGO=m" >> $config_host_mak > > > > echo "ARCHIPELAGO_LIBS=$archipelago_libs" >> $config_host_mak > > > > -- > > > > 2.5.0 > > > > > > > >