On Tue, Mar 08, 2016 at 05:21:48AM +0100, Niels de Vos wrote: > On Mon, Mar 07, 2016 at 01:27:38PM -0500, Jeff Cody wrote: > > On Mon, Mar 07, 2016 at 07:04:15PM +0100, Niels de Vos wrote: > > > GlusterFS 3.8 contains support for SEEK_DATA and SEEK_HOLE. This makes > > > it possible to detect sparse areas in files. > > > > > > Signed-off-by: Niels de Vos <nde...@redhat.com> > > > > > > -- > > > Tested by compiling and running "qemu-img map gluster://..." with a > > > build of the current master branch of glusterfs. Using a Fedora > > > cloud image (in raw format) shows many SEEK procudure calls going back > > > and forth over the network. The output of "qemu map" matches the output > > > when run against the image on the local filesystem. > > > --- > > > block/gluster.c | 159 > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > configure | 25 +++++++++ > > > 2 files changed, 184 insertions(+) > > > > > > diff --git a/block/gluster.c b/block/gluster.c > > > index 65077a0..1430010 100644 > > > --- a/block/gluster.c > > > +++ b/block/gluster.c > > > @@ -677,6 +677,153 @@ static int > > > qemu_gluster_has_zero_init(BlockDriverState *bs) > > > return 0; > > > } > > > > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > > Why do we need to make this a compile-time option? Version checking > > is problematic; for instance, different distributions may have > > backported bug fixes / features, that are not reflected by the > > reported version number, etc.. Ideally, we can determine > > functionality during runtime, and behave accordingly. > > This will not get backported to older Gluster versions, it required a > protocol change. > > > If SEEK_DATA and SEEK_HOLE are not supported, > > qemu_gluster_co_get_block_status can return that sectors are all > > allocated (which is what happens in block/io.c anyway if the driver > > doesn't support the function). > > Ok, good to know. > > > As long as glfs_lseek() will return error (e.g. EINVAL) for an invalid > > whence value, we can handle it runtime. Does glfs_lseek() behave > > sanely? > > Unfortunately older versions of libgfapi do not return EINVAL when > SEEK_DATA/HOLE is used. It is something we'll need to fix in the stable > releases. We can not assume that all users have installed a version of > the library that handles SEEK_DATA/HOLE correctly (return EINVAL) when > there is no support in the network protocol or on the server. > > To be sure that we don't get some undefined behaviour, the compile time > check is needed. >
That's unfortunate. However, we may still be able to work around that with some probing in the qemu_gluster_open() function. I peeked at the libgfapi code, and it looks like for an unknown whence, the current offset is just returned by glfs_lseek() - is that correct? With the same offset input, an lseek should always return something different for a SEEK_DATA and SEEK_HOLE whence (SEEK_HOLE will return EOF offset if there are no further holes). In qemu_gluster_open(), we can then probe for support. if we call glfs_lseek() twice with the same offset, first with SEEK_DATA and then with SEEK_HOLE, then we can see if the libgfapi version supports SEEK_DATA/SEEK_HOLE. If the resulting offsets from the calls are equal, it doesn't support it. If the offsets differ, then presumably it does. We can then set and remember that in the BDRVGlusterState struct (e.g. bool supports_seek_data). > > > +/* > > > + * Find allocation range in @bs around offset @start. > > > + * May change underlying file descriptor's file offset. > > > + * If @start is not in a hole, store @start in @data, and the > > > + * beginning of the next hole in @hole, and return 0. > > > + * If @start is in a non-trailing hole, store @start in @hole and the > > > + * beginning of the next non-hole in @data, and return 0. > > > + * If @start is in a trailing hole or beyond EOF, return -ENXIO. > > > + * If we can't find out, return a negative errno other than -ENXIO. > > > + * > > > + * (Shamefully copied from raw-posix.c, only miniscule adaptions.) > > > + */ > > > +static int find_allocation(BlockDriverState *bs, off_t start, > > > + off_t *data, off_t *hole) > > > +{ > > > + BDRVGlusterState *s = bs->opaque; > > > + off_t offs; > > > + > > > + /* > > > + * SEEK_DATA cases: > > > + * D1. offs == start: start is in data > > > + * D2. offs > start: start is in a hole, next data at offs > > > + * D3. offs < 0, errno = ENXIO: either start is in a trailing hole > > > + * or start is beyond EOF > > > + * If the latter happens, the file has been truncated behind > > > + * our back since we opened it. All bets are off then. > > > + * Treating like a trailing hole is simplest. > > > + * D4. offs < 0, errno != ENXIO: we learned nothing > > > + */ > > > + offs = glfs_lseek(s->fd, start, SEEK_DATA); > > > + if (offs < 0) { > > > + return -errno; /* D3 or D4 */ > > > + } > > > + assert(offs >= start); > > > + > > > + if (offs > start) { > > > + /* D2: in hole, next data at offs */ > > > + *hole = start; > > > + *data = offs; > > > + return 0; > > > + } > > > + > > > + /* D1: in data, end not yet known */ > > > + > > > + /* > > > + * SEEK_HOLE cases: > > > + * H1. offs == start: start is in a hole > > > + * If this happens here, a hole has been dug behind our back > > > + * since the previous lseek(). > > > + * H2. offs > start: either start is in data, next hole at offs, > > > + * or start is in trailing hole, EOF at offs > > > + * Linux treats trailing holes like any other hole: offs == > > > + * start. Solaris seeks to EOF instead: offs > start (blech). > > > + * If that happens here, a hole has been dug behind our back > > > + * since the previous lseek(). > > > + * H3. offs < 0, errno = ENXIO: start is beyond EOF > > > + * If this happens, the file has been truncated behind our > > > + * back since we opened it. Treat it like a trailing hole. > > > + * H4. offs < 0, errno != ENXIO: we learned nothing > > > + * Pretend we know nothing at all, i.e. "forget" about D1. > > > + */ > > > + offs = glfs_lseek(s->fd, start, SEEK_HOLE); > > > + if (offs < 0) { > > > + return -errno; /* D1 and (H3 or H4) */ > > > + } > > > + assert(offs >= start); > > > + > > > + if (offs > start) { > > > + /* > > > + * D1 and H2: either in data, next hole at offs, or it was in > > > + * data but is now in a trailing hole. In the latter case, > > > + * all bets are off. Treating it as if it there was data all > > > + * the way to EOF is safe, so simply do that. > > > + */ > > > + *data = start; > > > + *hole = offs; > > > + return 0; > > > + } > > > + > > > + /* D1 and H1 */ > > > + return -EBUSY; > > > +} > > > + > > > +/* > > > + * Returns the allocation status of the specified sectors. > > > + * > > > + * If 'sector_num' is beyond the end of the disk image the return value > > > is 0 > > > + * and 'pnum' is set to 0. > > > + * > > > + * 'pnum' is set to the number of sectors (including and immediately > > > following > > > + * the specified sector) that are known to be in the same > > > + * allocated/unallocated state. > > > + * > > > + * 'nb_sectors' is the max value 'pnum' should be set to. If nb_sectors > > > goes > > > + * beyond the end of the disk image it will be clamped. > > > + * > > > + * (Based on raw_co_get_block_status() from raw-posix.c.) > > > + */ > > > +static int64_t coroutine_fn qemu_gluster_co_get_block_status( > > > + BlockDriverState *bs, int64_t sector_num, int nb_sectors, int > > > *pnum) > > > +{ > > > + BDRVGlusterState *s = bs->opaque; > > > + off_t start, data = 0, hole = 0; > > > + int64_t total_size; > > > + int ret = -EINVAL; > > > + > > > + if (!s->fd) { > > > + return ret; > > > + } > > > + > > > + start = sector_num * BDRV_SECTOR_SIZE; > > > + total_size = bdrv_getlength(bs); > > > + if (total_size < 0) { > > > + return total_size; > > > + } else if (start >= total_size) { > > > + *pnum = 0; > > > + return 0; > > > + } else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) { > > > + nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE); > > > + } > > > + > > > + ret = find_allocation(bs, start, &data, &hole); > > > + if (ret == -ENXIO) { > > > + /* Trailing hole */ > > > + *pnum = nb_sectors; > > > + ret = BDRV_BLOCK_ZERO; > > > + } else if (ret < 0) { > > > + /* No info available, so pretend there are no holes */ > > > + *pnum = nb_sectors; > > > + ret = BDRV_BLOCK_DATA; > > > + } else if (data == start) { > > > + /* On a data extent, compute sectors to the end of the extent, > > > + * possibly including a partial sector at EOF. */ > > > + *pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, > > > BDRV_SECTOR_SIZE)); > > > + ret = BDRV_BLOCK_DATA; > > > + } else { > > > + /* On a hole, compute sectors to the beginning of the next > > > extent. */ > > > + assert(hole == start); > > > + *pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE); > > > + ret = BDRV_BLOCK_ZERO; > > > + } > > > + return ret | BDRV_BLOCK_OFFSET_VALID | start; > > > +} > > > +#endif /* CONFIG_GLUSTERFS_SEEK_DATA */ > > > + > > > + > > > static QemuOptsList qemu_gluster_create_opts = { > > > .name = "qemu-gluster-create-opts", > > > .head = QTAILQ_HEAD_INITIALIZER(qemu_gluster_create_opts.head), > > > @@ -719,6 +866,9 @@ static BlockDriver bdrv_gluster = { > > > #ifdef CONFIG_GLUSTERFS_ZEROFILL > > > .bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes, > > > #endif > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > + .bdrv_co_get_block_status = qemu_gluster_co_get_block_status, > > > +#endif > > > .create_opts = &qemu_gluster_create_opts, > > > }; > > > > > > @@ -746,6 +896,9 @@ static BlockDriver bdrv_gluster_tcp = { > > > #ifdef CONFIG_GLUSTERFS_ZEROFILL > > > .bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes, > > > #endif > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > + .bdrv_co_get_block_status = qemu_gluster_co_get_block_status, > > > +#endif > > > .create_opts = &qemu_gluster_create_opts, > > > }; > > > > > > @@ -773,6 +926,9 @@ static BlockDriver bdrv_gluster_unix = { > > > #ifdef CONFIG_GLUSTERFS_ZEROFILL > > > .bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes, > > > #endif > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > + .bdrv_co_get_block_status = qemu_gluster_co_get_block_status, > > > +#endif > > > .create_opts = &qemu_gluster_create_opts, > > > }; > > > > > > @@ -800,6 +956,9 @@ static BlockDriver bdrv_gluster_rdma = { > > > #ifdef CONFIG_GLUSTERFS_ZEROFILL > > > .bdrv_co_write_zeroes = qemu_gluster_co_write_zeroes, > > > #endif > > > +#ifdef CONFIG_GLUSTERFS_SEEK_DATA > > > + .bdrv_co_get_block_status = qemu_gluster_co_get_block_status, > > > +#endif > > > .create_opts = &qemu_gluster_create_opts, > > > }; > > > > > > diff --git a/configure b/configure > > > index 0c0472a..ca3821c 100755 > > > --- a/configure > > > +++ b/configure > > > @@ -3351,6 +3351,9 @@ if test "$glusterfs" != "no" ; then > > > if $pkg_config --atleast-version=6 glusterfs-api; then > > > glusterfs_zerofill="yes" > > > fi > > > + if $pkg_config --atleast-version=7.3.8 glusterfs-api; then > > > + glusterfs_seek_data="yes" > > > + fi > > > else > > > if test "$glusterfs" = "yes" ; then > > > feature_not_found "GlusterFS backend support" \ > > > @@ -3660,6 +3663,24 @@ if compile_prog "" "" ; then > > > fiemap=yes > > > fi > > > > > > +# check for SEEK_DATA and SEEK_HOLE > > > +seek_data=no > > > +cat > $TMPC << EOF > > > +#define _GNU_SOURCE > > > +#include <sys/types.h> > > > +#include <unistd.h> > > > + > > > +int main(void) > > > +{ > > > + lseek(0, 0, SEEK_DATA); > > > + lseek(0, 0, SEEK_HOLE); > > > + return 0; > > > +} > > > +EOF > > > +if compile_prog "" "" ; then > > > + seek_data=yes > > > +fi > > > + > > > # check for dup3 > > > dup3=no > > > cat > $TMPC << EOF > > > @@ -5278,6 +5299,10 @@ if test "$glusterfs_zerofill" = "yes" ; then > > > echo "CONFIG_GLUSTERFS_ZEROFILL=y" >> $config_host_mak > > > fi > > > > > > +if test "$glusterfs_seek_data" = "yes" && test "$seek_data" = "yes" ; > > > then > > > + echo "CONFIG_GLUSTERFS_SEEK_DATA=y" >> $config_host_mak > > > +fi > > > + > > > if test "$archipelago" = "yes" ; then > > > echo "CONFIG_ARCHIPELAGO=m" >> $config_host_mak > > > echo "ARCHIPELAGO_LIBS=$archipelago_libs" >> $config_host_mak > > > -- > > > 2.5.0 > > > > > >