Re: nfsv41 over AF_VSOCK (nfs-ganesha)
On Fri, Oct 23, 2015 at 02:27:22PM +0100, John Spray wrote: > On Mon, Oct 19, 2015 at 5:13 PM, John Spraywrote: > >> If you try this, send feedback. > >> > > OK, got this up and running. > > I've shared the kernel/qemu/nfsutils packages I built here: > https://copr.fedoraproject.org/coprs/jspray/vsock-nfs/builds/ > > (at time of writing the kernel one is still building, and I'm running > with ganesha out of a source tree) > > Observations: > * Running VM as qemu user gives EPERM opening vsock device, even > after changing permissions on the device node (for which I guess we'll > want udev rules at some stage) -- is there a particular capability > that we need to grant the qemu user? Was looking into this to make it > convenient to run inside libvirt. libvirtd runs as root and opens /dev/vhost-*. It passes file descriptors to the unprivileged QEMU process. I think this is how things work in production (with SELinux enabled too). On a development machine it is easier to either run QEMU as root or set uid:gid on /dev/vhost-vsock. So we don't need to do anything, although libvirt code will need to be written to support vhost-vsock. It should be very similar to the existing vhost-net code in libvirt. > * NFS writes from the guest are lagging for like a minute before > completing, my hunch is that this is something in the NFS client > recovery stuff (in ganesha) that's not coping with vsock, the > operations seem to complete at the point where the server declares > itself "NOT IN GRACE". > * For those (like myself) unaccustomed to running ganesha, do not run > it straight out of a source tree and expect everything to work, by > default even VFS exports won't work that way (mounts work but clients > see an empty tree) because it can't find the built FSAL .so. You can > write a config file that works, but it's easier just to make install > it. > * (Anecdotal, seen while messing with other stuff) client mount seems > to hang if I kill ganesha and then start it again, not sure if this is > a ganesha issue or a general vsock issue. If you experience hangs when the other side closes the connection you may need: https://github.com/stefanha/linux/commit/ae3c6c9b1534c1df5213a72f38e377ecd0852e14 Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote: > One of Sage's possible plans for Manilla integration would use nfs over the > new Linux vmci sockets transport integration in qemu (below) to access > Cephfs via an nfs-ganesha server running in the host vm. Excellent job! Nice to see you were able to add AF_VSOCK support to nfs-ganesha so quickly. I'm currently working on kernel nfsd support and will send the patches to linux-nfs and CC you. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5] rbd block driver fix race between aio completition and aio cancel
On Thu, Nov 29, 2012 at 10:37 PM, Stefan Priebe s.pri...@profihost.ag wrote: @@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} } static const AIOCBInfo rbd_aiocb_info = { @@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque) acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); qemu_bh_delete(acb-bh); acb-bh = NULL; +acb-status = 0; qemu_aio_release(acb); } We cannot release acb in rbd_aio_bh_cb() when acb-cancelled == 1 because qemu_rbd_aio_cancel() still accesses it. This was discussed in an early version of the patch. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv6] rbd block driver fix race between aio completition and aio cancel
On Fri, Nov 30, 2012 at 9:55 AM, Stefan Priebe s.pri...@profihost.ag wrote: This one fixes a race which qemu had also in iscsi block driver between cancellation and io completition. qemu_rbd_aio_cancel was not synchronously waiting for the end of the command. To archieve this it introduces a new status flag which uses -EINPROGRESS. Changes since PATCHv5: - qemu_aio_release has to be done in qemu_rbd_aio_cancel if I/O was cancelled Changes since PATCHv4: - removed unnecessary qemu_vfree of acb-bounce as BH will always run Changes since PATCHv3: - removed unnecessary if condition in rbd_start_aio as we haven't start io yet - moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always waits until BH was executed Changes since PATCHv2: - fixed missing braces - added vfree for bounce Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@gmail.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd block driver fix race between aio completition and aio cancel
On Thu, Nov 22, 2012 at 11:00:19AM +0100, Stefan Priebe wrote: @@ -406,10 +401,11 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) acb-ret = r; } } +acb-status = 0; + I suggest doing this in the BH. The qemu_aio_wait() loop in qemu_rbd_aio_cancel() needs to wait until the BH has executed. By clearing status in the BH we ensure that no matter in which order qemu_aio_wait() invokes BHs and callbacks, we'll always wait until the BH has completed before ending the while loop in qemu_rbd_aio_cancel(). @@ -737,7 +741,8 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, failed: g_free(rcb); s-qemu_aio_count--; -qemu_aio_release(acb); +if (!acb-cancelled) +qemu_aio_release(acb); return NULL; } This scenario is impossible. We haven't returned the acb back to the caller yet so they could not have invoked qemu_aio_cancel(). Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] overflow of int ret: use ssize_t for ret
On Thu, Nov 22, 2012 at 10:07 AM, Stefan Priebe s.pri...@profihost.ag wrote: diff --git a/block/rbd.c b/block/rbd.c index 5a0f79f..0384c6c 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -69,7 +69,7 @@ typedef enum { typedef struct RBDAIOCB { BlockDriverAIOCB common; QEMUBH *bh; -int ret; +ssize_t ret; QEMUIOVector *qiov; char *bounce; RBDAIOCmd cmd; @@ -86,7 +86,7 @@ typedef struct RADOSCB { int done; int64_t size; char *buf; -int ret; +ssize_t ret; } RADOSCB; #define RBD_FD_READ 0 I preferred your previous patch: ssize_t on 32-bit hosts has sizeof(ssize_t) == 4. In qemu_rbd_complete_aio() we may assign acb-ret = rcb-size. Here the size field is int64_t, so ssize_t ret would truncate the value. But BlockDriverCompetionFunc only takes an int argument so we're back to square one. The types are busted and changing the type of ret won't fix that :(. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] overflow of int ret: use ssize_t for ret
On Fri, Nov 23, 2012 at 3:15 PM, Peter Maydell peter.mayd...@linaro.org wrote: On 23 November 2012 14:11, Stefan Hajnoczi stefa...@gmail.com wrote: On Thu, Nov 22, 2012 at 10:07 AM, Stefan Priebe s.pri...@profihost.ag wrote: diff --git a/block/rbd.c b/block/rbd.c index 5a0f79f..0384c6c 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -69,7 +69,7 @@ typedef enum { typedef struct RBDAIOCB { BlockDriverAIOCB common; QEMUBH *bh; -int ret; +ssize_t ret; QEMUIOVector *qiov; char *bounce; RBDAIOCmd cmd; @@ -86,7 +86,7 @@ typedef struct RADOSCB { int done; int64_t size; char *buf; -int ret; +ssize_t ret; } RADOSCB; #define RBD_FD_READ 0 I preferred your previous patch: ssize_t on 32-bit hosts has sizeof(ssize_t) == 4. In qemu_rbd_complete_aio() we may assign acb-ret = rcb-size. Here the size field is int64_t, so ssize_t ret would truncate the value. The rcb size field should be a size_t: it is used for calling rbd_aio_write and rbd_aio_read so if we've overflowed 32 bits then we've already got a problem there. You are right that size_t is fine for read/write. But size_t is not fine for discard: 1. librbd takes uint64_t for discard. 2. Completing a discard request uses size. 3. BlockDriverCompletionFunc only passes int ret value --- broken for large discard Stefan Priebe has been discarding large regions (e.g. 4 GB must be possible on a 32-bit host). The previous int64_t patch didn't clean up types completely but it made things better. AFAICT we need to be able to discard 4 GB so ssize_t/size_t doesn't cut it on 32-bit hosts. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote: Am 21.11.2012 07:41, schrieb Stefan Hajnoczi: On Tue, Nov 20, 2012 at 8:16 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Stefan, Am 20.11.2012 17:29, schrieb Stefan Hajnoczi: On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote: rbd / rados tends to return pretty often length of writes or discarded blocks. These values might be bigger than int. Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Looks good but I want to check whether this fixes an bug you've hit? Please indicate details of the bug and how to reproduce it in the commit message. you get various I/O errors in client. As negative return values indicate I/O errors. When now a big positive value is returned by librbd block/rbd tries to store this one in acb-ret which is an int. Then it wraps around and is negative. After that block/rbd thinks this is an I/O error and report this to the guest. It's still not clear whether this is a bug that you can reproduce. After all, the ret value would have to be 2^31 which is a 2+ GB request! Yes and that is the fact. Look here: if (acb-cmd == RBD_AIO_WRITE || acb-cmd == RBD_AIO_DISCARD) { if (r 0) { acb-ret = r; acb-error = 1; } else if (!acb-error) { acb-ret = rcb-size; } It sets acb-ret to rcb-size. But the size from a DISCARD if you DISCARD a whole device might be 500GB or today even some TB. We're going in circles here. I know the types are wrong in the code and your patch fixes it, that's why I said it looks good in my first reply. QEMU is currently in hard freeze and only critical patches should go in. Providing steps to reproduce the bug helps me decide that this patch should still be merged for QEMU 1.3-rc1. Anyway, the patch is straightforward, I have applied it to my block tree and it will be in QEMU 1.3-rc1: https://github.com/stefanha/qemu/commits/block Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
On Wed, Nov 21, 2012 at 09:33:08AM +0100, Stefan Priebe - Profihost AG wrote: Am 21.11.2012 09:26, schrieb Stefan Hajnoczi: On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote: Am 21.11.2012 07:41, schrieb Stefan Hajnoczi: QEMU is currently in hard freeze and only critical patches should go in. Providing steps to reproduce the bug helps me decide that this patch should still be merged for QEMU 1.3-rc1. Anyway, the patch is straightforward, I have applied it to my block tree and it will be in QEMU 1.3-rc1: https://github.com/stefanha/qemu/commits/block Thanks! The steps to reproduce are: mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends a discard. Important is that you use scsi-hd and set discard_granularity=512. Otherwise rbd disabled discard support. Excellent, thanks! I will add it to the commit description. Might you have a look at my other rbd fix too? It fixes a race between task cancellation and writes. The same race was fixed in iscsi this summer. Yes. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote: rbd / rados tends to return pretty often length of writes or discarded blocks. These values might be bigger than int. Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Looks good but I want to check whether this fixes an bug you've hit? Please indicate details of the bug and how to reproduce it in the commit message. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int
On Tue, Nov 20, 2012 at 8:16 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Stefan, Am 20.11.2012 17:29, schrieb Stefan Hajnoczi: On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote: rbd / rados tends to return pretty often length of writes or discarded blocks. These values might be bigger than int. Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Looks good but I want to check whether this fixes an bug you've hit? Please indicate details of the bug and how to reproduce it in the commit message. you get various I/O errors in client. As negative return values indicate I/O errors. When now a big positive value is returned by librbd block/rbd tries to store this one in acb-ret which is an int. Then it wraps around and is negative. After that block/rbd thinks this is an I/O error and report this to the guest. It's still not clear whether this is a bug that you can reproduce. After all, the ret value would have to be 2^31 which is a 2+ GB request! I'm asking if this is a critical bug fix that needs to go into QEMU 1.3-rc1 because of a real-world issue? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH]: set up rbd snapshot handling
On Tue, Jan 10, 2012 at 8:01 PM, Gregory Farnum gregory.far...@dreamhost.com wrote: +static int qemu_rbd_snap_remove(BlockDriverState *bs, + const char *snapshot_name) +{ + BDRVRBDState *s = bs-opaque; + int r; + + r = rbd_snap_remove(s-image, snapshot_name); + if (r 0) { + error_report(failed to remove snap: %s, strerror(-r)); + return r; There's no need to report an error message here. This function should return -errno and let the caller decide how to show the error to the user. If you look at callers in the codebase they already print an equivalent error message. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 0/2] improve qemu-img conversion performance
On Wed, Sep 07, 2011 at 04:06:51PM -0700, Yehuda Sadeh wrote: The following set of patches improve the qemu-img conversion process performance. When using a higher latency backend, small writes have a severe impact on the time it takes to do image conversion. We switch to using async writes, and we avoid splitting writes due to holes when the holes are small enough. Yehuda Sadeh (2): qemu-img: async write to block device when converting image qemu-img: don't skip writing small holes qemu-img.c | 34 +++--- 1 files changed, 27 insertions(+), 7 deletions(-) -- 2.7.5.1 This has nothing to do with the patch itself, but I've been curious about the existence of both a QEMU and a Linux kernel rbd block driver. The I/O latency with qemu-img has been an issue for rbd users. But they have the option of using the Linux kernel rbd block driver, where qemu-img can take advantage of the page cache instead of performing direct I/O. Does this mean you intend to support both QEMU block/rbd.c and Linux drivers/block/rbd.c? As a user I would go with the Linux kernel driver instead of the QEMU block driver because it offers page cache and host block device features. On the other hand a userspace driver is nice because it does not require privileges. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/3] rbd: allow client id to be specified in config string
On Wed, Sep 7, 2011 at 5:28 PM, Sage Weil s...@newdream.net wrote: Allow the client id to be specified in the config string via 'id=' so that users can control who they authenticate as. Currently they are stuck with the default ('admin'). This is necessary for anyone using authentication in their environment. Signed-off-by: Sage Weil s...@newdream.net --- block/rbd.c | 52 1 files changed, 44 insertions(+), 8 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/3] rbd: clean up, fix style
On Wed, Sep 7, 2011 at 5:28 PM, Sage Weil s...@newdream.net wrote: No assignment in condition. Remove duplicate ret 0 check. Signed-off-by: Sage Weil s...@newdream.net --- block/rbd.c | 17 - 1 files changed, 8 insertions(+), 9 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 3/3] rbd: fix leak in qemu_rbd_open failure paths
On Wed, Sep 7, 2011 at 5:28 PM, Sage Weil s...@newdream.net wrote: Fix leak of s-snap in failure path. Simplify error paths for the whole function. Reported-by: Stefan Hajnoczi stefa...@gmail.com Signed-off-by: Sage Weil s...@newdream.net --- block/rbd.c | 28 +--- 1 files changed, 13 insertions(+), 15 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados
On Tue, Apr 12, 2011 at 1:18 AM, Josh Durgin josh.dur...@dreamhost.com wrote: On 04/08/2011 01:43 AM, Stefan Hajnoczi wrote: On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote: librbd stacks on top of librados to provide access to rbd images. Using librbd simplifies the qemu code, and allows qemu to use new versions of the rbd format with few (if any) changes. Signed-off-by: Josh Durginjosh.dur...@dreamhost.com Signed-off-by: Yehuda Sadehyeh...@hq.newdream.net --- block/rbd.c | 785 +++-- block/rbd_types.h | 71 - configure | 33 +-- 3 files changed, 221 insertions(+), 668 deletions(-) delete mode 100644 block/rbd_types.h Hi Josh, I have applied your patches onto qemu.git/master and am running ceph.git/master. Unfortunately qemu-iotests fails for me. Test 016 seems to hang in qemu-io -g -c write -P 66 128M 512 rbd:rbd/t.raw. I can reproduce this consistently. Here is the backtrace of the hung process (not consuming CPU, probably deadlocked): This hung because it wasn't checking the return value of rbd_aio_write. I've fixed this in the for-qemu branch of http://ceph.newdream.net/git/qemu-kvm.git. Also, the existing rbd implementation is not 'growable' - writing to a large offset will not expand the rbd image correctly. Should we implement bdrv_truncate to support this (librbd has a resize operation)? Is bdrv_truncate useful outside of qemu-img and qemu-io? If librbd has a resize operation then it would be nice to wire up bdrv_truncate() for completeness. Note that bdrv_truncate() can also be called online using the block_resize monitor command. Since rbd devices are not growable we should fix qemu-iotests to skip 016 for rbd. Test 008 failed with an assertion but succeeded when run again. I think this is a race condition: This is likely a use-after-free, but I haven't been able to find the race condition yet (or reproduce it). Could you get a backtrace from the core file? Unfortunately I have no core file and wasn't able to reproduce it again. Is qemu-iotests passing for you now? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados
On Fri, Apr 8, 2011 at 9:43 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote: librbd stacks on top of librados to provide access to rbd images. Using librbd simplifies the qemu code, and allows qemu to use new versions of the rbd format with few (if any) changes. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net --- block/rbd.c | 785 +++-- block/rbd_types.h | 71 - configure | 33 +-- 3 files changed, 221 insertions(+), 668 deletions(-) delete mode 100644 block/rbd_types.h Hi Josh, I have applied your patches onto qemu.git/master and am running ceph.git/master. Unfortunately qemu-iotests fails for me. I forgot to mention that qemu-iotests lives at: git://git.kernel.org/pub/scm/linux/kernel/git/hch/qemu-iotests.git Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] rbd: allow configuration of rados from the rbd filename
On Thu, Apr 07, 2011 at 10:14:03AM +0900, Yoshiaki Tamura wrote: 2011/3/29 Josh Durgin josh.dur...@dreamhost.com: The new format is rbd:pool/image[@snapshot][:option1=value1[:option2=value2...]] Each option is used to configure rados, and may be any Ceph option, or conf. The conf option specifies a Ceph configuration file to read. This allows rbd volumes from more than one Ceph cluster to be used by specifying different monitor addresses, as well as having different logging levels or locations for different volumes. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com --- block/rbd.c | 119 ++ 1 files changed, 102 insertions(+), 17 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index cb76dd3..bc3323d 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -22,13 +22,17 @@ /* * When specifying the image filename use: * - * rbd:poolname/devicename + * rbd:poolname/devicename[@snapshotname][:option1=value1[:option2=value2...]] I'm not sure IIUC, but currently this @snapshotname seems to be meaningless; it doesn't allow you to boot from a snapshot because it's read only. Am I misunderstanding or tested incorrectly? Read-only block devices are supported by QEMU and can be useful. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] rbd: use the higher level librbd instead of just librados
On Thu, Mar 24, 2011 at 03:51:36PM -0700, Josh Durgin wrote: You have sent a malformed patch. Please send patches that follow the guidelines at http://wiki.qemu.org/Contribute/SubmitAPatch and test that your mail client is not line wrapping or mangling whitespace. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [libvirt] rbd storage pool support for libvirt
On Fri, Nov 19, 2010 at 9:50 AM, Daniel P. Berrange berra...@redhat.com wrote: On Fri, Nov 19, 2010 at 09:27:40AM +, Stefan Hajnoczi wrote: On Thu, Nov 18, 2010 at 5:13 PM, Sage Weil s...@newdream.net wrote: On Thu, 18 Nov 2010, Daniel P. Berrange wrote: On Wed, Nov 17, 2010 at 04:33:07PM -0800, Josh Durgin wrote: Hi Daniel, On 11/08/2010 05:16 AM, Daniel P. Berrange wrote: In any case, before someone goes off and implements something, does this look like the right general approach to adding rbd support to libvirt? I think this looks reasonable. I'd be inclined to get the storage pool stuff working with the kernel RBD driver UDEV rules for stable path names, since that avoids needing to make any changes to guest XML format. Support for QEMU with the native librados CEPH driver could be added as a second patch. Okay, that sounds reasonable. Supporting the QEMU librados driver is definitely something we want to target, though, and seems to be route that more users are interested in. Is defining the XML syntax for a guest VM something we can discuss now as well? (BTW this is biting NBD users too. Presumably the guest VM XML should look similar? And also Sheepdog storage volumes. To define a syntax for all these we need to determine what configuration metadata is required at a per-VM level for each of them. Then try and decide how to represent that in the guest XML. It looks like at a VM level we'd need a hostname, port number and a volume name (or path). It looks like that's what Sheepdog needs from the patch that was submitted earlier today. For RBD, we would want to allow multiple hosts, and specify the pool and image name when the QEMU librados driver is used, e.g.: disk type=rbd device=disk driver name=qemu type=raw / source vdi=image_name pool=pool_name host name=mon1.example.org port=6000 host name=mon2.example.org port=6000 host name=mon3.example.org port=6000 /source target dev=vda bus=virtio / /disk Does this seem like a reasonable format for the VM XML? Any suggestions? I'm basically wondering whether we should be going for separate types for each of NBD, RBD Sheepdog, as per your proposal the sheepdog one earlier today. Or type to merge them into one type 'nework' which covers any kind of network block device, and list a protocol on the source element, eg disk type=network device=disk driver name=qemu type=raw / source protocol='rbd|sheepdog|nbd' name=...some image identifier... host name=mon1.example.org port=6000 host name=mon2.example.org port=6000 host name=mon3.example.org port=6000 /source target dev=vda bus=virtio / /disk That would work... One thing that I think should be considered, though, is that both RBD and NBD can be used for non-qemu instances by mapping a regular block device via the host's kernel. And in that case, there's some sysfs-fu (at least in the rbd case; I'm not familiar with how the nbd client works) required to set up/tear down the block device. An nbd block device is attached using the nbd-client(1) userspace tool: $ nbd-client my-server 1234 /dev/nbd0 # host port nbd-device That program will open the socket, grab /dev/nbd0, and poke it with a few ioctls so the kernel has the socket and can take it from there. We don't need to worry about this for libvirt/QEMU. Since QEMU has native NBD client support there's no need to do anything with nbd client tools to setup the device for use with a VM. I agree it's easier to use the built-in NBD support. Just wanted to provide the background on how NBD client works when using the kernel implementation. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/rbd block driver for qemu-kvm (v8)
Reviewed-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v6)
On Wed, Oct 13, 2010 at 12:18 AM, Christian Brunner c...@muc.de wrote: +static int rbd_set_snapc(rados_pool_t pool, const char *snap, RbdHeader1 *header) +{ + uint32_t snap_count = header-snap_count; + rados_snap_t *snaps = NULL; + rados_snap_t seq; + uint32_t i; + uint64_t snap_names_len = header-snap_names_len; + int r; + rados_snap_t snapid = 0; + + cpu_to_le32s(snap_count); + cpu_to_le64s(snap_names_len); It is clearer to do byteswapping immediately, rather than having the variable take on different endianness at different times: uint32_t snap_count = cpu_to_le32(header-snap_count); uint64_t snap_names_len = cpu_to_le64(header-snap_names_len); + if (snap_count) { + const char *header_snap = (const char *)header-snaps[snap_count]; + const char *end = header_snap + snap_names_len; snap_names_len is little-endian. This won't work on big-endian hosts. Did you mean le64_to_cpu() instead of cpu_to_le64()? + snaps = qemu_malloc(sizeof(rados_snap_t) * header-snap_count); snaps is allocated here... + + for (i=0; i snap_count; i++) { + snaps[i] = (uint64_t)header-snaps[i].id; + cpu_to_le64s(snaps[i]); + + if (snap strcmp(snap, header_snap) == 0) { + snapid = snaps[i]; + } + + header_snap += strlen(header_snap) + 1; + if (header_snap end) { + error_report(bad header, snapshot list broken); + } + } + } + + if (snap !snapid) { + error_report(snapshot not found); + return -ENOENT; ...but never freed here. + } + seq = header-snap_seq; + cpu_to_le32s((uint32_t *)seq); + + r = rados_set_snap_context(pool, seq, snaps, snap_count); + + rados_set_snap(pool, snapid); + + qemu_free(snaps); + + return r; +} Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
On Thu, May 20, 2010 at 11:16 PM, Christian Brunner c...@muc.de wrote: 2010/5/20 Anthony Liguori anth...@codemonkey.ws: Both sheepdog and ceph ultimately transmit I/O over a socket to a central daemon, right? So could we not standardize a protocol for this that both sheepdog and ceph could implement? There is no central daemon. The concept is that they talk to many storage nodes at the same time. Data is distributed and replicated over many nodes in the network. The mechanism to do this is quite complex. I don't know about sheepdog, but in Ceph this is called RADOS (reliable autonomic distributed object store). Sheepdog and Ceph may look similar, but this is where they act different. I don't think that it would be possible to implement a common protocol. I believe Sheepdog has a local daemon on each node. The QEMU storage backend talks to the daemon on the same node, which then does the real network communication with the rest of the distributed storage system. So I think we're not talking about a network protocol here, we're talking about a common interface that can be used by QEMU and other programs to take advantage of Ceph, Sheepdog, etc services available on the local node. Haven't looked into your patch enough yet, but does librados talk directly over the network or does it connect to a local daemon/driver? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html