Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-27 Thread Stefan Hajnoczi
On Fri, Oct 23, 2015 at 02:27:22PM +0100, John Spray wrote:
> On Mon, Oct 19, 2015 at 5:13 PM, John Spray  wrote:
> >> If you try this, send feedback.
> >>
> 
> OK, got this up and running.
> 
> I've shared the kernel/qemu/nfsutils packages I built here:
> https://copr.fedoraproject.org/coprs/jspray/vsock-nfs/builds/
> 
> (at time of writing the kernel one is still building, and I'm running
> with ganesha out of a source tree)
> 
> Observations:
>  * Running VM as qemu user gives EPERM opening vsock device, even
> after changing permissions on the device node (for which I guess we'll
> want udev rules at some stage) -- is there a particular capability
> that we need to grant the qemu user?  Was looking into this to make it
> convenient to run inside libvirt.

libvirtd runs as root and opens /dev/vhost-*.  It passes file
descriptors to the unprivileged QEMU process.  I think this is how
things work in production (with SELinux enabled too).

On a development machine it is easier to either run QEMU as root or set
uid:gid on /dev/vhost-vsock.

So we don't need to do anything, although libvirt code will need to be
written to support vhost-vsock.  It should be very similar to the
existing vhost-net code in libvirt.

>  * NFS writes from the guest are lagging for like a minute before
> completing, my hunch is that this is something in the NFS client
> recovery stuff (in ganesha) that's not coping with vsock, the
> operations seem to complete at the point where the server declares
> itself "NOT IN GRACE".
>  * For those (like myself) unaccustomed to running ganesha, do not run
> it straight out of a source tree and expect everything to work, by
> default even VFS exports won't work that way (mounts work but clients
> see an empty tree) because it can't find the built FSAL .so.  You can
> write a config file that works, but it's easier just to make install
> it.
>  * (Anecdotal, seen while messing with other stuff) client mount seems
> to hang if I kill ganesha and then start it again, not sure if this is
> a ganesha issue or a general vsock issue.

If you experience hangs when the other side closes the connection you
may need:
https://github.com/stefanha/linux/commit/ae3c6c9b1534c1df5213a72f38e377ecd0852e14

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread Stefan Hajnoczi
On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote:
> One of Sage's possible plans for Manilla integration would use nfs over the 
> new Linux  vmci sockets transport integration in qemu (below) to access 
> Cephfs via an nfs-ganesha server running in the host vm.

Excellent job!  Nice to see you were able to add AF_VSOCK support to
nfs-ganesha so quickly.

I'm currently working on kernel nfsd support and will send the patches
to linux-nfs and CC you.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Hajnoczi
On Thu, Nov 29, 2012 at 10:37 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 @@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB 
 *blockacb)
  {
  RBDAIOCB *acb = (RBDAIOCB *) blockacb;
  acb-cancelled = 1;
 +
 +while (acb-status == -EINPROGRESS) {
 +qemu_aio_wait();
 +}
  }

  static const AIOCBInfo rbd_aiocb_info = {
 @@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque)
  acb-common.cb(acb-common.opaque, (acb-ret  0 ? 0 : acb-ret));
  qemu_bh_delete(acb-bh);
  acb-bh = NULL;
 +acb-status = 0;

  qemu_aio_release(acb);
  }

We cannot release acb in rbd_aio_bh_cb() when acb-cancelled == 1
because qemu_rbd_aio_cancel() still accesses it.  This was discussed
in an early version of the patch.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv6] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Hajnoczi
On Fri, Nov 30, 2012 at 9:55 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 This one fixes a race which qemu had also in iscsi block driver
 between cancellation and io completition.

 qemu_rbd_aio_cancel was not synchronously waiting for the end of
 the command.

 To archieve this it introduces a new status flag which uses
 -EINPROGRESS.

 Changes since PATCHv5:
 - qemu_aio_release has to be done in qemu_rbd_aio_cancel if I/O
   was cancelled

 Changes since PATCHv4:
 - removed unnecessary qemu_vfree of acb-bounce as BH will always
   run

 Changes since PATCHv3:
 - removed unnecessary if condition in rbd_start_aio as we
   haven't start io yet
 - moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always
   waits until BH was executed

 Changes since PATCHv2:
 - fixed missing braces
 - added vfree for bounce

 Signed-off-by: Stefan Priebe s.pri...@profihost.ag

 ---
  block/rbd.c |   20 
  1 file changed, 12 insertions(+), 8 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd block driver fix race between aio completition and aio cancel

2012-11-29 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 11:00:19AM +0100, Stefan Priebe wrote:
 @@ -406,10 +401,11 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
  acb-ret = r;
  }
  }
 +acb-status = 0;
 +

I suggest doing this in the BH.  The qemu_aio_wait() loop in
qemu_rbd_aio_cancel() needs to wait until the BH has executed.  By
clearing status in the BH we ensure that no matter in which order
qemu_aio_wait() invokes BHs and callbacks, we'll always wait until the
BH has completed before ending the while loop in qemu_rbd_aio_cancel().

 @@ -737,7 +741,8 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState 
 *bs,
  failed:
  g_free(rcb);
  s-qemu_aio_count--;
 -qemu_aio_release(acb);
 +if (!acb-cancelled)
 +qemu_aio_release(acb);
  return NULL;
  }

This scenario is impossible.  We haven't returned the acb back to the
caller yet so they could not have invoked qemu_aio_cancel().

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] overflow of int ret: use ssize_t for ret

2012-11-23 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 10:07 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 diff --git a/block/rbd.c b/block/rbd.c
 index 5a0f79f..0384c6c 100644
 --- a/block/rbd.c
 +++ b/block/rbd.c
 @@ -69,7 +69,7 @@ typedef enum {
  typedef struct RBDAIOCB {
  BlockDriverAIOCB common;
  QEMUBH *bh;
 -int ret;
 +ssize_t ret;
  QEMUIOVector *qiov;
  char *bounce;
  RBDAIOCmd cmd;
 @@ -86,7 +86,7 @@ typedef struct RADOSCB {
  int done;
  int64_t size;
  char *buf;
 -int ret;
 +ssize_t ret;
  } RADOSCB;

  #define RBD_FD_READ 0

I preferred your previous patch:

ssize_t on 32-bit hosts has sizeof(ssize_t) == 4.  In
qemu_rbd_complete_aio() we may assign acb-ret = rcb-size.  Here the
size field is int64_t, so ssize_t ret would truncate the value.

But BlockDriverCompetionFunc only takes an int argument so we're back
to square one.

The types are busted and changing the type of ret won't fix that :(.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] overflow of int ret: use ssize_t for ret

2012-11-23 Thread Stefan Hajnoczi
On Fri, Nov 23, 2012 at 3:15 PM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 23 November 2012 14:11, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Thu, Nov 22, 2012 at 10:07 AM, Stefan Priebe s.pri...@profihost.ag 
 wrote:
 diff --git a/block/rbd.c b/block/rbd.c
 index 5a0f79f..0384c6c 100644
 --- a/block/rbd.c
 +++ b/block/rbd.c
 @@ -69,7 +69,7 @@ typedef enum {
  typedef struct RBDAIOCB {
  BlockDriverAIOCB common;
  QEMUBH *bh;
 -int ret;
 +ssize_t ret;
  QEMUIOVector *qiov;
  char *bounce;
  RBDAIOCmd cmd;
 @@ -86,7 +86,7 @@ typedef struct RADOSCB {
  int done;
  int64_t size;
  char *buf;
 -int ret;
 +ssize_t ret;
  } RADOSCB;

  #define RBD_FD_READ 0

 I preferred your previous patch:

 ssize_t on 32-bit hosts has sizeof(ssize_t) == 4.  In
 qemu_rbd_complete_aio() we may assign acb-ret = rcb-size.  Here the
 size field is int64_t, so ssize_t ret would truncate the value.

 The rcb size field should be a size_t: it is used for calling
 rbd_aio_write and rbd_aio_read so if we've overflowed 32 bits
 then we've already got a problem there.

You are right that size_t is fine for read/write.

But size_t is not fine for discard:
1. librbd takes uint64_t for discard.
2. Completing a discard request uses size.
3. BlockDriverCompletionFunc only passes int ret value --- broken for
large discard

Stefan Priebe has been discarding large regions (e.g. 4 GB must be
possible on a 32-bit host).

The previous int64_t patch didn't clean up types completely but it
made things better.  AFAICT we need to be able to discard 4 GB so
ssize_t/size_t doesn't cut it on 32-bit hosts.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Hajnoczi
On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote:
 Am 21.11.2012 07:41, schrieb Stefan Hajnoczi:
 On Tue, Nov 20, 2012 at 8:16 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi Stefan,
 
 Am 20.11.2012 17:29, schrieb Stefan Hajnoczi:
 
 On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote:
 
 rbd / rados tends to return pretty often length of writes
 or discarded blocks. These values might be bigger than int.
 
 Signed-off-by: Stefan Priebe s.pri...@profihost.ag
 ---
block/rbd.c |4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
 
 
 Looks good but I want to check whether this fixes an bug you've hit?
 Please indicate details of the bug and how to reproduce it in the commit
 message.
 
 
 you get various I/O errors in client. As negative return values indicate I/O
 errors. When now a big positive value is returned by librbd block/rbd tries
 to store this one in acb-ret which is an int. Then it wraps around and is
 negative. After that block/rbd thinks this is an I/O error and report this
 to the guest.
 
 It's still not clear whether this is a bug that you can reproduce.
 After all, the ret value would have to be 2^31 which is a 2+ GB
 request!
 Yes and that is the fact.
 
 Look here:
if (acb-cmd == RBD_AIO_WRITE ||
 acb-cmd == RBD_AIO_DISCARD) {
 if (r  0) {
 acb-ret = r;
 acb-error = 1;
 } else if (!acb-error) {
 acb-ret = rcb-size;
 }
 
 It sets acb-ret to rcb-size. But the size from a DISCARD if you
 DISCARD a whole device might be 500GB or today even some TB.

We're going in circles here.  I know the types are wrong in the code and
your patch fixes it, that's why I said it looks good in my first reply.

QEMU is currently in hard freeze and only critical patches should go in.
Providing steps to reproduce the bug helps me decide that this patch
should still be merged for QEMU 1.3-rc1.

Anyway, the patch is straightforward, I have applied it to my block tree
and it will be in QEMU 1.3-rc1:
https://github.com/stefanha/qemu/commits/block

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Hajnoczi
On Wed, Nov 21, 2012 at 09:33:08AM +0100, Stefan Priebe - Profihost AG wrote:
 Am 21.11.2012 09:26, schrieb Stefan Hajnoczi:
 On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote:
 Am 21.11.2012 07:41, schrieb Stefan Hajnoczi:
 QEMU is currently in hard freeze and only critical patches should go in.
 Providing steps to reproduce the bug helps me decide that this patch
 should still be merged for QEMU 1.3-rc1.
 
 Anyway, the patch is straightforward, I have applied it to my block tree
 and it will be in QEMU 1.3-rc1:
 https://github.com/stefanha/qemu/commits/block
 
 Thanks!
 
 The steps to reproduce are:
 mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends
 a discard. Important is that you use scsi-hd and set
 discard_granularity=512. Otherwise rbd disabled discard support.

Excellent, thanks!  I will add it to the commit description.

 Might you have a look at my other rbd fix too? It fixes a race
 between task cancellation and writes. The same race was fixed in
 iscsi this summer.

Yes.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-20 Thread Stefan Hajnoczi
On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote:
 rbd / rados tends to return pretty often length of writes
 or discarded blocks. These values might be bigger than int.
 
 Signed-off-by: Stefan Priebe s.pri...@profihost.ag
 ---
  block/rbd.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

Looks good but I want to check whether this fixes an bug you've hit?
Please indicate details of the bug and how to reproduce it in the commit
message.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-20 Thread Stefan Hajnoczi
On Tue, Nov 20, 2012 at 8:16 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi Stefan,

 Am 20.11.2012 17:29, schrieb Stefan Hajnoczi:

 On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote:

 rbd / rados tends to return pretty often length of writes
 or discarded blocks. These values might be bigger than int.

 Signed-off-by: Stefan Priebe s.pri...@profihost.ag
 ---
   block/rbd.c |4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)


 Looks good but I want to check whether this fixes an bug you've hit?
 Please indicate details of the bug and how to reproduce it in the commit
 message.


 you get various I/O errors in client. As negative return values indicate I/O
 errors. When now a big positive value is returned by librbd block/rbd tries
 to store this one in acb-ret which is an int. Then it wraps around and is
 negative. After that block/rbd thinks this is an I/O error and report this
 to the guest.

It's still not clear whether this is a bug that you can reproduce.
After all, the ret value would have to be 2^31 which is a 2+ GB
request!

I'm asking if this is a critical bug fix that needs to go into QEMU
1.3-rc1 because of a real-world issue?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH]: set up rbd snapshot handling

2012-01-11 Thread Stefan Hajnoczi
On Tue, Jan 10, 2012 at 8:01 PM, Gregory Farnum
gregory.far...@dreamhost.com wrote:
 +static int qemu_rbd_snap_remove(BlockDriverState *bs,
 +                                const char *snapshot_name)
 +{
 +    BDRVRBDState *s = bs-opaque;
 +    int r;
 +
 +    r = rbd_snap_remove(s-image, snapshot_name);
 +    if (r  0) {
 +        error_report(failed to remove snap: %s, strerror(-r));
 +        return r;

There's no need to report an error message here.  This function should
return -errno and let the caller decide how to show the error to the
user.  If you look at callers in the codebase they already print an
equivalent error message.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/2] improve qemu-img conversion performance

2011-09-08 Thread Stefan Hajnoczi
On Wed, Sep 07, 2011 at 04:06:51PM -0700, Yehuda Sadeh wrote:
 The following set of patches improve the qemu-img conversion process
 performance. When using a higher latency backend, small writes have a
 severe impact on the time it takes to do image conversion. 
 We switch to using async writes, and we avoid splitting writes due to
 holes when the holes are small enough.
 
 Yehuda Sadeh (2):
   qemu-img: async write to block device when converting image
   qemu-img: don't skip writing small holes
 
  qemu-img.c |   34 +++---
  1 files changed, 27 insertions(+), 7 deletions(-)
 
 -- 
 2.7.5.1

This has nothing to do with the patch itself, but I've been curious
about the existence of both a QEMU and a Linux kernel rbd block driver.

The I/O latency with qemu-img has been an issue for rbd users.  But they
have the option of using the Linux kernel rbd block driver, where
qemu-img can take advantage of the page cache instead of performing
direct I/O.

Does this mean you intend to support both QEMU block/rbd.c and Linux
drivers/block/rbd.c?  As a user I would go with the Linux kernel driver
instead of the QEMU block driver because it offers page cache and host
block device features.  On the other hand a userspace driver is nice
because it does not require privileges.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/3] rbd: allow client id to be specified in config string

2011-09-08 Thread Stefan Hajnoczi
On Wed, Sep 7, 2011 at 5:28 PM, Sage Weil s...@newdream.net wrote:
 Allow the client id to be specified in the config string via 'id=' so that
 users can control who they authenticate as.  Currently they are stuck with
 the default ('admin').  This is necessary for anyone using authentication
 in their environment.

 Signed-off-by: Sage Weil s...@newdream.net
 ---
  block/rbd.c |   52 
  1 files changed, 44 insertions(+), 8 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/3] rbd: clean up, fix style

2011-09-08 Thread Stefan Hajnoczi
On Wed, Sep 7, 2011 at 5:28 PM, Sage Weil s...@newdream.net wrote:
 No assignment in condition.  Remove duplicate ret  0 check.

 Signed-off-by: Sage Weil s...@newdream.net
 ---
  block/rbd.c |   17 -
  1 files changed, 8 insertions(+), 9 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 3/3] rbd: fix leak in qemu_rbd_open failure paths

2011-09-08 Thread Stefan Hajnoczi
On Wed, Sep 7, 2011 at 5:28 PM, Sage Weil s...@newdream.net wrote:
 Fix leak of s-snap in failure path.  Simplify error paths for the whole
 function.

 Reported-by: Stefan Hajnoczi stefa...@gmail.com
 Signed-off-by: Sage Weil s...@newdream.net
 ---
  block/rbd.c |   28 +---
  1 files changed, 13 insertions(+), 15 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados

2011-04-12 Thread Stefan Hajnoczi
On Tue, Apr 12, 2011 at 1:18 AM, Josh Durgin josh.dur...@dreamhost.com wrote:
 On 04/08/2011 01:43 AM, Stefan Hajnoczi wrote:

 On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote:

 librbd stacks on top of librados to provide access
 to rbd images.

 Using librbd simplifies the qemu code, and allows
 qemu to use new versions of the rbd format
 with few (if any) changes.

 Signed-off-by: Josh Durginjosh.dur...@dreamhost.com
 Signed-off-by: Yehuda Sadehyeh...@hq.newdream.net
 ---
  block/rbd.c       |  785
 +++--
  block/rbd_types.h |   71 -
  configure         |   33 +--
  3 files changed, 221 insertions(+), 668 deletions(-)
  delete mode 100644 block/rbd_types.h

 Hi Josh,
 I have applied your patches onto qemu.git/master and am running
 ceph.git/master.

 Unfortunately qemu-iotests fails for me.


 Test 016 seems to hang in qemu-io -g -c write -P 66 128M 512
 rbd:rbd/t.raw.  I can reproduce this consistently.  Here is the
 backtrace of the hung process (not consuming CPU, probably deadlocked):

 This hung because it wasn't checking the return value of rbd_aio_write.
 I've fixed this in the for-qemu branch of
 http://ceph.newdream.net/git/qemu-kvm.git. Also, the existing rbd
 implementation is not 'growable' - writing to a large offset will not expand
 the rbd image correctly. Should we implement bdrv_truncate to support this
 (librbd has a resize operation)? Is bdrv_truncate useful outside of qemu-img
 and qemu-io?

If librbd has a resize operation then it would be nice to wire up
bdrv_truncate() for completeness.  Note that bdrv_truncate() can also
be called online using the block_resize monitor command.

Since rbd devices are not growable we should fix qemu-iotests to skip
016 for rbd.

 Test 008 failed with an assertion but succeeded when run again.  I think
 this is a race condition:

 This is likely a use-after-free, but I haven't been able to find the race
 condition yet (or reproduce it). Could you get a backtrace from the core
 file?

Unfortunately I have no core file and wasn't able to reproduce it again.

Is qemu-iotests passing for you now?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados

2011-04-08 Thread Stefan Hajnoczi
On Fri, Apr 8, 2011 at 9:43 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote:
 librbd stacks on top of librados to provide access
 to rbd images.

 Using librbd simplifies the qemu code, and allows
 qemu to use new versions of the rbd format
 with few (if any) changes.

 Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
 Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
 ---
  block/rbd.c       |  785 
 +++--
  block/rbd_types.h |   71 -
  configure         |   33 +--
  3 files changed, 221 insertions(+), 668 deletions(-)
  delete mode 100644 block/rbd_types.h

 Hi Josh,
 I have applied your patches onto qemu.git/master and am running
 ceph.git/master.

 Unfortunately qemu-iotests fails for me.

I forgot to mention that qemu-iotests lives at:

git://git.kernel.org/pub/scm/linux/kernel/git/hch/qemu-iotests.git

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] rbd: allow configuration of rados from the rbd filename

2011-04-07 Thread Stefan Hajnoczi
On Thu, Apr 07, 2011 at 10:14:03AM +0900, Yoshiaki Tamura wrote:
 2011/3/29 Josh Durgin josh.dur...@dreamhost.com:
  The new format is 
  rbd:pool/image[@snapshot][:option1=value1[:option2=value2...]]
  Each option is used to configure rados, and may be any Ceph option, or 
  conf.
  The conf option specifies a Ceph configuration file to read.
 
  This allows rbd volumes from more than one Ceph cluster to be used by
  specifying different monitor addresses, as well as having different
  logging levels or locations for different volumes.
 
  Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
  ---
   block/rbd.c |  119 
  ++
   1 files changed, 102 insertions(+), 17 deletions(-)
 
  diff --git a/block/rbd.c b/block/rbd.c
  index cb76dd3..bc3323d 100644
  --- a/block/rbd.c
  +++ b/block/rbd.c
  @@ -22,13 +22,17 @@
   /*
   * When specifying the image filename use:
   *
  - * rbd:poolname/devicename
  + * 
  rbd:poolname/devicename[@snapshotname][:option1=value1[:option2=value2...]]
 
 I'm not sure IIUC, but currently this @snapshotname seems to be
 meaningless; it doesn't allow you to boot from a snapshot because it's
 read only.  Am I misunderstanding or tested incorrectly?

Read-only block devices are supported by QEMU and can be useful.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] rbd: use the higher level librbd instead of just librados

2011-03-28 Thread Stefan Hajnoczi
On Thu, Mar 24, 2011 at 03:51:36PM -0700, Josh Durgin wrote:
You have sent a malformed patch.  Please send patches that follow the
guidelines at http://wiki.qemu.org/Contribute/SubmitAPatch and test that
your mail client is not line wrapping or mangling whitespace.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] rbd storage pool support for libvirt

2010-11-19 Thread Stefan Hajnoczi
On Fri, Nov 19, 2010 at 9:50 AM, Daniel P. Berrange berra...@redhat.com wrote:
 On Fri, Nov 19, 2010 at 09:27:40AM +, Stefan Hajnoczi wrote:
 On Thu, Nov 18, 2010 at 5:13 PM, Sage Weil s...@newdream.net wrote:
  On Thu, 18 Nov 2010, Daniel P. Berrange wrote:
  On Wed, Nov 17, 2010 at 04:33:07PM -0800, Josh Durgin wrote:
   Hi Daniel,
  
   On 11/08/2010 05:16 AM, Daniel P. Berrange wrote:
   In any case, before someone goes off and implements something, does 
   this
   look like the right general approach to adding rbd support to 
   libvirt?
   
   I think this looks reasonable. I'd be inclined to get the storage 
   pool
   stuff working with the kernel RBD driver  UDEV rules for stable path
   names, since that avoids needing to make any changes to guest XML
   format. Support for QEMU with the native librados CEPH driver could
   be added as a second patch.
   
   Okay, that sounds reasonable.  Supporting the QEMU librados driver is
   definitely something we want to target, though, and seems to be route 
   that
   more users are interested in.  Is defining the XML syntax for a guest 
   VM
   something we can discuss now as well?
   
   (BTW this is biting NBD users too.  Presumably the guest VM XML should
   look similar?
   
   And also Sheepdog storage volumes. To define a syntax for all these we 
   need
   to determine what configuration metadata is required at a per-VM level 
   for
   each of them. Then try and decide how to represent that in the guest 
   XML.
   It looks like at a VM level we'd need a hostname, port number and a 
   volume
   name (or path).
  
   It looks like that's what Sheepdog needs from the patch that was
   submitted earlier today. For RBD, we would want to allow multiple hosts,
   and specify the pool and image name when the QEMU librados driver is
   used, e.g.:
  
       disk type=rbd device=disk
         driver name=qemu type=raw /
         source vdi=image_name pool=pool_name
           host name=mon1.example.org port=6000
           host name=mon2.example.org port=6000
           host name=mon3.example.org port=6000
         /source
         target dev=vda bus=virtio /
       /disk
  
   Does this seem like a reasonable format for the VM XML? Any suggestions?
 
  I'm basically wondering whether we should be going for separate types for
  each of NBD, RBD  Sheepdog, as per your proposal  the sheepdog one 
  earlier
  today. Or type to merge them into one type 'nework' which covers any kind 
  of
  network block device, and list a protocol on the  source element, eg
 
       disk type=network device=disk
         driver name=qemu type=raw /
         source protocol='rbd|sheepdog|nbd' name=...some image 
  identifier...
           host name=mon1.example.org port=6000
           host name=mon2.example.org port=6000
           host name=mon3.example.org port=6000
         /source
         target dev=vda bus=virtio /
       /disk
 
  That would work...
 
  One thing that I think should be considered, though, is that both RBD and
  NBD can be used for non-qemu instances by mapping a regular block device
  via the host's kernel.  And in that case, there's some sysfs-fu (at least
  in the rbd case; I'm not familiar with how the nbd client works) required
  to set up/tear down the block device.

 An nbd block device is attached using the nbd-client(1) userspace tool:
 $ nbd-client my-server 1234 /dev/nbd0 # host port nbd-device

 That program will open the socket, grab /dev/nbd0, and poke it with a
 few ioctls so the kernel has the socket and can take it from there.

 We don't need to worry about this for libvirt/QEMU. Since QEMU has native
 NBD client support there's no need to do anything with nbd client tools
 to setup the device for use with a VM.

I agree it's easier to use the built-in NBD support.  Just wanted to
provide the background on how NBD client works when using the kernel
implementation.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph/rbd block driver for qemu-kvm (v8)

2010-11-18 Thread Stefan Hajnoczi
Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v6)

2010-10-13 Thread Stefan Hajnoczi
On Wed, Oct 13, 2010 at 12:18 AM, Christian Brunner c...@muc.de wrote:
 +static int rbd_set_snapc(rados_pool_t pool, const char *snap, RbdHeader1 
 *header)
 +{
 +    uint32_t snap_count = header-snap_count;
 +    rados_snap_t *snaps = NULL;
 +    rados_snap_t seq;
 +    uint32_t i;
 +    uint64_t snap_names_len = header-snap_names_len;
 +    int r;
 +    rados_snap_t snapid = 0;
 +
 +    cpu_to_le32s(snap_count);
 +    cpu_to_le64s(snap_names_len);

It is clearer to do byteswapping immediately, rather than having the
variable take on different endianness at different times:
uint32_t snap_count = cpu_to_le32(header-snap_count);
uint64_t snap_names_len = cpu_to_le64(header-snap_names_len);

 +    if (snap_count) {
 +        const char *header_snap = (const char *)header-snaps[snap_count];
 +        const char *end = header_snap + snap_names_len;

snap_names_len is little-endian.  This won't work on big-endian hosts.
 Did you mean le64_to_cpu() instead of cpu_to_le64()?

 +        snaps = qemu_malloc(sizeof(rados_snap_t) * header-snap_count);

snaps is allocated here...

 +
 +        for (i=0; i  snap_count; i++) {
 +            snaps[i] = (uint64_t)header-snaps[i].id;
 +            cpu_to_le64s(snaps[i]);
 +
 +            if (snap  strcmp(snap, header_snap) == 0) {
 +                snapid = snaps[i];
 +            }
 +
 +            header_snap += strlen(header_snap) + 1;
 +            if (header_snap  end) {
 +                error_report(bad header, snapshot list broken);
 +            }
 +        }
 +    }
 +
 +    if (snap  !snapid) {
 +        error_report(snapshot not found);
 +        return -ENOENT;

...but never freed here.

 +    }
 +    seq = header-snap_seq;
 +    cpu_to_le32s((uint32_t *)seq);
 +
 +    r = rados_set_snap_context(pool, seq, snaps, snap_count);
 +
 +    rados_set_snap(pool, snapid);
 +
 +    qemu_free(snaps);
 +
 +    return r;
 +}

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-20 Thread Stefan Hajnoczi
On Thu, May 20, 2010 at 11:16 PM, Christian Brunner c...@muc.de wrote:
 2010/5/20 Anthony Liguori anth...@codemonkey.ws:
 Both sheepdog and ceph ultimately transmit I/O over a socket to a central
 daemon, right?  So could we not standardize a protocol for this that both
 sheepdog and ceph could implement?

 There is no central daemon. The concept is that they talk to many
 storage nodes at the same time. Data is distributed and replicated
 over many nodes in the network. The mechanism to do this is quite
 complex. I don't know about sheepdog, but in Ceph this is called RADOS
 (reliable autonomic distributed object store). Sheepdog and Ceph may
 look similar, but this is where they act different. I don't think that
 it would be possible to implement a common protocol.

I believe Sheepdog has a local daemon on each node.  The QEMU storage
backend talks to the daemon on the same node, which then does the real
network communication with the rest of the distributed storage system.
 So I think we're not talking about a network protocol here, we're
talking about a common interface that can be used by QEMU and other
programs to take advantage of Ceph, Sheepdog, etc services available
on the local node.

Haven't looked into your patch enough yet, but does librados talk
directly over the network or does it connect to a local daemon/driver?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html