Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686]

2013-08-13 Thread Sage Weil
On Mon, 5 Aug 2013, Mike Dawson wrote:
 Josh,
 
 Logs are uploaded to cephdrop with the file name mikedawson-rbd-qemu-deadlock.
 
 - At about 2013-08-05 19:46 or 47, we hit the issue, traffic went to 0
 - At about 2013-08-05 19:53:51, ran a 'virsh screenshot'
 
 
 Environment is:
 
 - Ceph 0.61.7 (client is co-mingled with three OSDs)
 - rbd cache = true and cache=writeback
 - qemu 1.4.0 1.4.0+dfsg-1expubuntu4
 - Ubuntu Raring with 3.8.0-25-generic
 
 This issue is reproducible in my environment, and I'm willing to run any wip
 branch you need. What else can I provide to help?

This looks like a different issue than Oliver's.  I see one anomaly in the 
log, where a rbd io completion is triggered a second time for no apparent 
reason.  I opened a separate bug 

http://tracker.ceph.com/issues/5955

and pushed wip-5955 that will hopefully shine some light on the weird 
behavior I saw.  Can you reproduce with this branch and

 debug objectcacher = 20
 debug ms = 1
 debug rbd = 20
 debug finisher = 20

Thanks!
sage


 
 Thanks,
 Mike Dawson
 
 
 On 8/5/2013 3:48 AM, Stefan Hajnoczi wrote:
  On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:
   Am 02.08.2013 um 23:47 schrieb Mike Dawson mike.daw...@cloudapt.com:
We can un-wedge the guest by opening a NoVNC session or running a
'virsh screenshot' command. After that, the guest resumes and runs as
expected. At that point we can examine the guest. Each time we'll see:
  
  If virsh screenshot works then this confirms that QEMU itself is still
  responding.  Its main loop cannot be blocked since it was able to
  process the screendump command.
  
  This supports Josh's theory that a callback is not being invoked.  The
  virtio-blk I/O request would be left in a pending state.
  
  Now here is where the behavior varies between configurations:
  
  On a Windows guest with 1 vCPU, you may see the symptom that the guest no
  longer responds to ping.
  
  On a Linux guest with multiple vCPUs, you may see the hung task message
  from the guest kernel because other vCPUs are still making progress.
  Just the vCPU that issued the I/O request and whose task is in
  UNINTERRUPTIBLE state would really be stuck.
  
  Basically, the symptoms depend not just on how QEMU is behaving but also
  on the guest kernel and how many vCPUs you have configured.
  
  I think this can explain how both problems you are observing, Oliver and
  Mike, are a result of the same bug.  At least I hope they are :).
  
  Stefan
  
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 



Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686]

2013-08-13 Thread Sage Weil
Hi Oliver,

(Posted this on the bug too, but:)

Your last log revealed a bug in the librados aio flush.  A fix is pushed 
to wip-librados-aio-flush (bobtail) and wip-5919 (master); can you retest 
please (with caching off again)?

Thanks!
sage


On Fri, 9 Aug 2013, Oliver Francke wrote:
 Hi Josh,
 
 just opened
 
 http://tracker.ceph.com/issues/5919
 
 with all collected information incl. debug-log.
 
 Hope it helps,
 
 Oliver.
 
 On 08/08/2013 07:01 PM, Josh Durgin wrote:
  On 08/08/2013 05:40 AM, Oliver Francke wrote:
   Hi Josh,
   
   I have a session logged with:
   
debug_ms=1:debug_rbd=20:debug_objectcacher=30
   
   as you requested from Mike, even if I think, we do have another story
   here, anyway.
   
   Host-kernel is: 3.10.0-rc7, qemu-client 1.6.0-rc2, client-kernel is
   3.2.0-51-amd...
   
   Do you want me to open a ticket for that stuff? I have about 5MB
   compressed logfile waiting for you ;)
  
  Yes, that'd be great. If you could include the time when you saw the guest
  hang that'd be ideal. I'm not sure if this is one or two bugs,
  but it seems likely it's a bug in rbd and not qemu.
  
  Thanks!
  Josh
  
   Thnx in advance,
   
   Oliver.
   
   On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote:
On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:
 Am 02.08.2013 um 23:47 schrieb Mike Dawson mike.daw...@cloudapt.com:
  We can un-wedge the guest by opening a NoVNC session or running a
  'virsh screenshot' command. After that, the guest resumes and runs
  as expected. At that point we can examine the guest. Each time we'll
  see:
If virsh screenshot works then this confirms that QEMU itself is still
responding.  Its main loop cannot be blocked since it was able to
process the screendump command.

This supports Josh's theory that a callback is not being invoked.  The
virtio-blk I/O request would be left in a pending state.

Now here is where the behavior varies between configurations:

On a Windows guest with 1 vCPU, you may see the symptom that the guest
no
longer responds to ping.

On a Linux guest with multiple vCPUs, you may see the hung task message
from the guest kernel because other vCPUs are still making progress.
Just the vCPU that issued the I/O request and whose task is in
UNINTERRUPTIBLE state would really be stuck.

Basically, the symptoms depend not just on how QEMU is behaving but also
on the guest kernel and how many vCPUs you have configured.

I think this can explain how both problems you are observing, Oliver and
Mike, are a result of the same bug.  At least I hope they are :).

Stefan
   
   
  
 
 
 -- 
 
 Oliver Francke
 
 filoo GmbH
 Moltkestra?e 25a
 0 G?tersloh
 HRB4355 AG G?tersloh
 
 Gesch?ftsf?hrer: J.Rehp?hler | C.Kunz
 
 Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
 
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 



Re: [Qemu-devel] Adding a persistent writeback cache to qemu

2013-06-21 Thread Sage Weil
On Fri, 21 Jun 2013, Alex Bligh wrote:
 Sage,
 
 --On 20 June 2013 08:58:19 -0700 Sage Weil s...@inktank.com wrote:
 
   I'd like to hear from Ceph folks what their position on kernel rbd vs
   librados is.  Why one do they recommend for QEMU guests and what are the
   pros/cons?
  
  I agree that a flashcache/bcache-like persistent cache would be a big win
  for qemu + rbd users.
 
 Great.
 
 I think Stefan was really after testing my received wisdom that
 ceph+librbd will be greater performance than ceph+blkdev+kernelrbd
 (even without the persistent cache), and if so why.

Oh, right.  At this point the performance differential is strictly related 
to the cache behavior.  If there were feature parity, I would not expect 
any significant difference.  There may be a net difference of a data copy, 
but I'm not sure it will be significant.

  There are few important issues with librbd vs kernel rbd:
  
   * librbd tends to get new features more quickly that the kernel rbd
 (although now that layering has landed in 3.10 this will be less
 painful than it was).
  
   * Using kernel rbd means users need bleeding edge kernels, a non-starter
 for many orgs that are still running things like RHEL.  Bug fixes are
 difficult to roll out, etc.
  
   * librbd has an in-memory cache that behaves similar to an HDD's cache
 (e.g., it forces writeback on flush).  This improves performance
 significantly for many workloads.  Of course, having a bcache-like
 layer mitigates this..
  
  I'm not really sure what the best path forward is.  Putting the
  functionality in qemu would benefit lots of other storage backends,
  putting it in librbd would capture various other librbd users (xen, tgt,
  and future users like hyper-v), and using new kernels works today but
  creates a lot of friction for operations.
 
 To be honest I'd not even thought of putting it in librbd (which might
 be simpler). I suspect it might be easier to get patches into librbd
 than into qemu, and that ensuring cache coherency might be simpler.
 If I get time to look at this, would you be interested in taking patches
 for this?

Certainly!  It will be a bit tricky to integrate is a lightweight way, 
however, so I would be sure to sketch out a design before diving too far 
into the coding.  I suspect the best path forward would be to extend the 
ObjectCacher.  This has the added benefit that ceph-fuse and libcephfs 
could benefit at well.

The dev summit for the next release (emperor) is coming up in a few 
weeks.. this would be a great project to submit a blueprint for so we can 
discuss it then.

sage




Re: [Qemu-devel] Adding a persistent writeback cache to qemu

2013-06-21 Thread Sage Weil
On Fri, 21 Jun 2013, Stefan Hajnoczi wrote:
  but if there's really a case for it with performance profiles then I
  guess it would be necessary.  But we should definitely get feedback from
  the Ceph folks too.
 
 
  The specific problem we are trying to solve (in case that's not
  obvious) is the non-locality of data read/written by ceph. Whilst
  you can use placement to localise data to the rack level, even if
  one of your OSDs is in the machine you end up waiting on network
  traffic. That is apparently hard to solve inside Ceph.
 
 I'm not up-to-speed on Ceph architecture, is this because you need to
 visit a metadata server before you access the storage.  Even when the
 data is colocated on the same machine you'll need to ask the metadata
 server first?

The data location is determined by a fancy hash function, so there is no 
metadata lookup step and the client can directly contact the right server.  
The trade-off is that the client doesn't get to choose where to write--the 
hash deterministically tells us that based on the object name and current 
cluster state.

In the end this means there is some lower bound on latency because we are 
reading/writing over the network...

sage





Re: [Qemu-devel] [PATCH v3 2/2] rbd: link and load librbd dynamically

2013-06-21 Thread Sage Weil
Hi Anthony, Stefan, Paolo,

[Resurrecting an old thread, here!]

On Wed, 10 Apr 2013 Anthony Liguori wrote:
 Stefan Hajnoczi stefa...@gmail.com writes:
  NACK
 
  I think we're solving the problem at the wrong level.  Writing our own
  dynamic linker and adding boilerplate to juggle function pointers
  every time we use a library dependency is ugly.
 
  There are two related problems here:
 
  1. Packagers do not want to enable niche dependencies since users will
  complain that the package is bloated and pulls in too much stuff.
 
  2. QEMU linked against a newer library version fails to run on hosts
  that have an older library.
 
  Problem #1 has several solutions:
 
  1. Let packagers take care of it.  For example, vim is often shipped
  in several packages that have various amounts of dependencies
  (vim-tiny, vim-gtk, etc).  Packagers create the specialized packages
  for specific groups of users to meet their demands without dragging in
  too many dependencies.
 
  2. Make QEMU modular - host devices should be shared libraries that
  are loaded at runtime.  There should be no stable API so that
  development stays flexible and we discourage binary-only modules.
  This lets packagers easily ship a qemu-rbd package, for example, that
  drops in a .so file that QEMU can load at runtime.
 
  Problem #2 is already solved:
 
  The dynamic linker will refuse to load the program if there are
  missing symbols.  It's not possible to mix and match binaries across
  environments while downgrading their library dependencies.  With
  effort, this could be doable but it's not an interesting use case that
  many users care about - they get their binaries from a distro or build
  them from source with correct dependencies.
 
  Maybe it's time to move block drivers and other components into
  modules?
 
 This is really a build system issue more than anything else.  There are
 no internal API changes needed.
 
 All that's needed is to something like (in module.h):
 
 /* This should not be used directly.  Use block_init etc. instead.  */
 #ifdef CONFIG_MODULE
 #define module_init(function, type)  \
 const gchar *g_module_check_init(GModule *module)  \
 {\
 register_module_init(function, type);\
 return NULL; \
 }
 #else
 #define module_init(function, type) \
 static void __attribute__((constructor)) do_qemu_init_ ## function(void) {  \
 register_module_init(function, type);   \
 }
 #endif

Has any progress been made toward a generic dynamic linking solution for 
block drivers?  It is a conceptually simple change, but non-trivial for 
anyone unfamiliar with the build system.

In the mean time, the inability to dynamically link librbd is continuing 
to cause significant pain for distro users.  Would you consider merging 
the rbd-specific linking as an interim solution until something is 
generically available to qemu?

sage



Re: [Qemu-devel] Adding a persistent writeback cache to qemu

2013-06-20 Thread Sage Weil
On Thu, 20 Jun 2013, Stefan Hajnoczi wrote:
  The concrete problem here is that flashcache/dm-cache/bcache don't
  work with the rbd (librbd) driver, as flashcache/dm-cache/bcache
  cache access to block devices (in the host layer), and with rbd
  (for instance) there is no access to a block device at all. block/rbd.c
  simply calls librbd which calls librados etc.
  
  So the context switches etc. I am avoiding are the ones that would
  be introduced by using kernel rbd devices rather than librbd.
 
 I understand the limitations with kernel block devices - their
 setup/teardown is an extra step outside QEMU and privileges need to be
 managed.  That basically means you need to use a management tool like
 libvirt to make it usable.
 
 But I don't understand the performance angle here.  Do you have profiles
 that show kernel rbd is a bottleneck due to context switching?
 
 We use the kernel page cache for -drive file=test.img,cache=writeback
 and no one has suggested reimplementing the page cache inside QEMU for
 better performance.
 
 Also, how do you want to manage QEMU page cache with multiple guests
 running?  They are independent and know nothing about each other.  Their
 process memory consumption will be bloated and the kernel memory
 management will end up having to sort out who gets to stay in physical
 memory.
 
 You can see I'm skeptical of this and think it's premature optimization,
 but if there's really a case for it with performance profiles then I
 guess it would be necessary.  But we should definitely get feedback from
 the Ceph folks too.
 
 I'd like to hear from Ceph folks what their position on kernel rbd vs
 librados is.  Why one do they recommend for QEMU guests and what are the
 pros/cons?

I agree that a flashcache/bcache-like persistent cache would be a big win 
for qemu + rbd users.  

There are few important issues with librbd vs kernel rbd:

 * librbd tends to get new features more quickly that the kernel rbd 
   (although now that layering has landed in 3.10 this will be less 
   painful than it was).

 * Using kernel rbd means users need bleeding edge kernels, a non-starter 
   for many orgs that are still running things like RHEL.  Bug fixes are 
   difficult to roll out, etc.

 * librbd has an in-memory cache that behaves similar to an HDD's cache 
   (e.g., it forces writeback on flush).  This improves performance 
   significantly for many workloads.  Of course, having a bcache-like 
   layer mitigates this..

I'm not really sure what the best path forward is.  Putting the 
functionality in qemu would benefit lots of other storage backends, 
putting it in librbd would capture various other librbd users (xen, tgt, 
and future users like hyper-v), and using new kernels works today but 
creates a lot of friction for operations.

sage



[Qemu-devel] [PATCH] qemu-iotests: send 'rbd rm ...' stderr to /dev/null

2013-04-16 Thread Sage Weil
The rbd cli tool now sends progress info to stderr; send that to the bit
bucket too.

Signed-off-by: Sage Weil s...@inktank.com
---
 tests/qemu-iotests/common.rc |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index a536bf7..d7b0ad1 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -152,7 +152,7 @@ _cleanup_test_img()
 ;;
 
 rbd)
-rbd rm $TEST_DIR/t.$IMGFMT  /dev/null
+rbd rm $TEST_DIR/t.$IMGFMT  /dev/null 2 /dev/null
 ;;
 
 sheepdog)
-- 
1.7.9.5




[Qemu-devel] [PATCH] rbd: allow escaping in config string

2011-09-19 Thread Sage Weil
The config string is variously delimited by =, @, and /, depending on the
field.  Allow these characters to be escaped by preceeding them with \.

Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   29 +++--
 1 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 98efd2c..3068c82 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -102,8 +102,15 @@ static int qemu_rbd_next_tok(char *dst, int dst_len,
 *p = NULL;
 
 if (delim != '\0') {
-end = strchr(src, delim);
-if (end) {
+for (end = src; *end; ++end) {
+if (*end == delim) {
+break;
+}
+if (*end == '\\'  end[1] != '\0') {
+end++;
+}
+}
+if (*end == delim) {
 *p = end + 1;
 *end = '\0';
 }
@@ -122,6 +129,19 @@ static int qemu_rbd_next_tok(char *dst, int dst_len,
 return 0;
 }
 
+static void qemu_rbd_unescape(char *src)
+{
+char *p;
+
+for (p = src; *src; ++src, ++p) {
+if (*src == '\\'  src[1] != '\0') {
+src++;
+}
+*p = *src;
+}
+*p = '\0';
+}
+
 static int qemu_rbd_parsename(const char *filename,
   char *pool, int pool_len,
   char *snap, int snap_len,
@@ -146,6 +166,7 @@ static int qemu_rbd_parsename(const char *filename,
 ret = -EINVAL;
 goto done;
 }
+qemu_rbd_unescape(pool);
 
 if (strchr(p, '@')) {
 ret = qemu_rbd_next_tok(name, name_len, p, '@', object name, p);
@@ -153,9 +174,11 @@ static int qemu_rbd_parsename(const char *filename,
 goto done;
 }
 ret = qemu_rbd_next_tok(snap, snap_len, p, ':', snap name, p);
+qemu_rbd_unescape(snap);
 } else {
 ret = qemu_rbd_next_tok(name, name_len, p, ':', object name, p);
 }
+qemu_rbd_unescape(name);
 if (ret  0 || !p) {
 goto done;
 }
@@ -211,6 +234,7 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 if (ret  0) {
 break;
 }
+qemu_rbd_unescape(name);
 
 if (!p) {
 error_report(conf option %s has no value, name);
@@ -223,6 +247,7 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 if (ret  0) {
 break;
 }
+qemu_rbd_unescape(value);
 
 if (strcmp(name, conf) == 0) {
 ret = rados_conf_read_file(cluster, value);
-- 
1.7.4.1




Re: [Qemu-devel] [PATCH 2/4] rbd: allow escaping in config string

2011-09-19 Thread Sage Weil
On Mon, 19 Sep 2011, Kevin Wolf wrote:
 If src ends with a backslash, you read beyond the end of the string.

Right, sending an updated patch.

 Wouldn't it make sense to have the unescape integrated in
 qemu_rbd_next_tok? Or are there any places where you would want to call
 it without doing a qemu_rbd_unescape() afterwards?

The conf string makes two passes through rbd_next_tok(), once to grab 
the whole conf string, and again to pull out each item.  We can't 
strip out escaping the first time through or else e.g. \: will turn into 
the : delimiter.

I thought about adding a flag to enable/disable the escaping, but 
explicitly doing the unescape seemed cleaner.

sage
 



[Qemu-devel] [PATCH 0/4] More RBD updates

2011-09-15 Thread Sage Weil
Hi,

Here are a few more improvements to the qemu rbd support.  The first 
patch makes the configuration file handling cleaner (do not error out if 
/etc/ceph/ceph.conf doesn't exist).  One allows characters in the conf 
string to be escaped, so you can (for example) specify an ip\:port (':' 
is used as a delimiter).

The last patch implements flush when rbd_flush() is available.  This lets 
us take advantage of write buffering in newer versions of librbd, which 
improves performance significantly for many workloads (including the 
trivial qemu-img convert).

Thanks!
sage


Sage Weil (4):
  rbd: ignore failures when reading from default conf location
  rbd: allow escaping in config string
  rbd: update comment heading
  rbd: call flush, if available

 block/rbd.c |   83 +++---
 1 files changed, 56 insertions(+), 27 deletions(-)

-- 
1.7.2.5




[Qemu-devel] [PATCH 4/4] rbd: call flush, if available

2011-09-15 Thread Sage Weil
librbd recently added async writeback and flush support.  If the new
rbd_flush() call is available, call it.

Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 202e7ec..af65924 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -705,6 +705,17 @@ static BlockDriverAIOCB 
*qemu_rbd_aio_writev(BlockDriverState *bs,
 return rbd_aio_rw_vector(bs, sector_num, qiov, nb_sectors, cb, opaque, 1);
 }
 
+static int qemu_rbd_flush(BlockDriverState *bs)
+{
+#if LIBRBD_VERSION_CODE = LIBRBD_VERSION(0, 1, 1)
+/* rbd_flush added in 0.1.1 */
+BDRVRBDState *s = bs-opaque;
+return rbd_flush(s-image);
+#else
+return 0;
+#endif
+}
+
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs-opaque;
@@ -840,6 +851,7 @@ static BlockDriver bdrv_rbd = {
 .bdrv_file_open = qemu_rbd_open,
 .bdrv_close = qemu_rbd_close,
 .bdrv_create= qemu_rbd_create,
+.bdrv_flush = qemu_rbd_flush,
 .bdrv_get_info  = qemu_rbd_getinfo,
 .create_options = qemu_rbd_create_options,
 .bdrv_getlength = qemu_rbd_getlength,
-- 
1.7.2.5




[Qemu-devel] [PATCH 1/4] rbd: ignore failures when reading from default conf location

2011-09-15 Thread Sage Weil
If we are reading from the default config location, ignore any failures.
It is perfectly legal for the user to specify exactly the options they need
and to not rely on any config file.

Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   14 --
 1 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 1b78d51..f64b2e0 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -298,11 +298,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 }
 
 if (strstr(conf, conf=) == NULL) {
-if (rados_conf_read_file(cluster, NULL)  0) {
-error_report(error reading config file);
-rados_shutdown(cluster);
-return -EIO;
-}
+/* try default location, but ignore failure */
+rados_conf_read_file(cluster, NULL);
 }
 
 if (conf[0] != '\0' 
@@ -441,11 +438,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 }
 
 if (strstr(conf, conf=) == NULL) {
-r = rados_conf_read_file(s-cluster, NULL);
-if (r  0) {
-error_report(error reading config file);
-goto failed_shutdown;
-}
+/* try default location, but ignore failure */
+rados_conf_read_file(s-cluster, NULL);
 }
 
 if (conf[0] != '\0') {
-- 
1.7.2.5




[Qemu-devel] [PATCH 2/4] rbd: allow escaping in config string

2011-09-15 Thread Sage Weil
The config string is variously delimited by =, @, and /, depending on the
field.  Allow these characters to be escaped by preceeding them with \.

Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   29 +++--
 1 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f64b2e0..43f0e63 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -104,8 +104,15 @@ static int qemu_rbd_next_tok(char *dst, int dst_len,
 *p = NULL;
 
 if (delim != '\0') {
-end = strchr(src, delim);
-if (end) {
+for (end = src; *end; ++end) {
+if (*end == delim) {
+break;
+}
+if (*end == '\\') {
+end++;
+}
+}
+if (*end == delim) {
 *p = end + 1;
 *end = '\0';
 }
@@ -124,6 +131,19 @@ static int qemu_rbd_next_tok(char *dst, int dst_len,
 return 0;
 }
 
+static void qemu_rbd_unescape(char *src)
+{
+char *p;
+
+for (p = src; *src; ++src, ++p) {
+if (*src == '\\') {
+src++;
+}
+*p = *src;
+}
+*p = '\0';
+}
+
 static int qemu_rbd_parsename(const char *filename,
   char *pool, int pool_len,
   char *snap, int snap_len,
@@ -148,6 +168,7 @@ static int qemu_rbd_parsename(const char *filename,
 ret = -EINVAL;
 goto done;
 }
+qemu_rbd_unescape(pool);
 
 if (strchr(p, '@')) {
 ret = qemu_rbd_next_tok(name, name_len, p, '@', object name, p);
@@ -155,9 +176,11 @@ static int qemu_rbd_parsename(const char *filename,
 goto done;
 }
 ret = qemu_rbd_next_tok(snap, snap_len, p, ':', snap name, p);
+qemu_rbd_unescape(snap);
 } else {
 ret = qemu_rbd_next_tok(name, name_len, p, ':', object name, p);
 }
+qemu_rbd_unescape(name);
 if (ret  0 || !p) {
 goto done;
 }
@@ -213,6 +236,7 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 if (ret  0) {
 break;
 }
+qemu_rbd_unescape(name);
 
 if (!p) {
 error_report(conf option %s has no value, name);
@@ -225,6 +249,7 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 if (ret  0) {
 break;
 }
+qemu_rbd_unescape(value);
 
 if (strcmp(name, conf) == 0) {
 ret = rados_conf_read_file(cluster, value);
-- 
1.7.2.5




Re: [Qemu-devel] [PATCH 0/2] improve qemu-img conversion performance

2011-09-11 Thread Sage Weil
On Fri, 9 Sep 2011, Kevin Wolf wrote:
 Am 08.09.2011 18:36, schrieb Sage Weil:
  On Thu, 8 Sep 2011, Kevin Wolf wrote:
  Am 08.09.2011 01:06, schrieb Yehuda Sadeh:
  The following set of patches improve the qemu-img conversion process
  performance. When using a higher latency backend, small writes have a
  severe impact on the time it takes to do image conversion. 
  We switch to using async writes, and we avoid splitting writes due to
  holes when the holes are small enough.
 
  Yehuda Sadeh (2):
qemu-img: async write to block device when converting image
qemu-img: don't skip writing small holes
 
   qemu-img.c |   34 +++---
   1 files changed, 27 insertions(+), 7 deletions(-)
 
 
  This doesn't seem to be against git master or the block tree. Please 
  rebase.
 
  I think that commit a22f123c may obsolete your patch 2/2.
  
  With git.kernel.org down, where should I be looking for the latest 
  upstream?
 
 qemu has never been on kernel.org. The interesting repositories for you are:
 
 * Upstream: git://git.qemu.org/qemu.git master
 * Block development branch: git://repo.or.cz/qemu/kevin.git block

Oh right.  I've been working from qemu-kvm.git.  

I've done some (still minimal) testing, and it looks like the combination 
of a22f123c and the new writeback/flush stuff in librbd gets the same 
result as doing async io explicitly from qemu-img.c.  Want to take a look, 
Yehuda?  It still defaults to off, so you'll need to add 
rbd_writeback_window=800 or similar to the rbd device string.

Thanks!
sage



Re: [Qemu-devel] [PATCH 0/2] improve qemu-img conversion performance

2011-09-08 Thread Sage Weil
On Thu, 8 Sep 2011, Kevin Wolf wrote:
 Am 08.09.2011 01:06, schrieb Yehuda Sadeh:
  The following set of patches improve the qemu-img conversion process
  performance. When using a higher latency backend, small writes have a
  severe impact on the time it takes to do image conversion. 
  We switch to using async writes, and we avoid splitting writes due to
  holes when the holes are small enough.
  
  Yehuda Sadeh (2):
qemu-img: async write to block device when converting image
qemu-img: don't skip writing small holes
  
   qemu-img.c |   34 +++---
   1 files changed, 27 insertions(+), 7 deletions(-)
  
 
 This doesn't seem to be against git master or the block tree. Please rebase.
 
 I think that commit a22f123c may obsolete your patch 2/2.

With git.kernel.org down, where should I be looking for the latest 
upstream?

Thanks!
sage




Re: [Qemu-devel] [PATCH 0/2] improve qemu-img conversion performance

2011-09-08 Thread Sage Weil
On Thu, 8 Sep 2011, Stefan Hajnoczi wrote:
 On Wed, Sep 07, 2011 at 04:06:51PM -0700, Yehuda Sadeh wrote:
  The following set of patches improve the qemu-img conversion process
  performance. When using a higher latency backend, small writes have a
  severe impact on the time it takes to do image conversion. 
  We switch to using async writes, and we avoid splitting writes due to
  holes when the holes are small enough.
  
  Yehuda Sadeh (2):
qemu-img: async write to block device when converting image
qemu-img: don't skip writing small holes
  
   qemu-img.c |   34 +++---
   1 files changed, 27 insertions(+), 7 deletions(-)
  
  -- 
  2.7.5.1
 
 This has nothing to do with the patch itself, but I've been curious
 about the existence of both a QEMU and a Linux kernel rbd block driver.
 
 The I/O latency with qemu-img has been an issue for rbd users.  But they
 have the option of using the Linux kernel rbd block driver, where
 qemu-img can take advantage of the page cache instead of performing
 direct I/O.

 Does this mean you intend to support both QEMU block/rbd.c and Linux
 drivers/block/rbd.c?  As a user I would go with the Linux kernel driver
 instead of the QEMU block driver because it offers page cache and host
 block device features.  On the other hand a userspace driver is nice
 because it does not require privileges.

We intend to support both drivers, yes.  The native qemu driver is 
generally more convenient because there is no kernel dependency, so we 
want to make qemu-img perform reasonably one way or another.

There are plans to implement some limited buffering (and flush) in librbd 
to make the device behave a bit more like a disk with a cache.  That will 
mask the sync write latency, but I suspect that doing these writes using 
the aio interface (and ignoring small holes) will help everyone...

sage



[Qemu-devel] [PATCH 1/3] rbd: allow client id to be specified in config string

2011-09-07 Thread Sage Weil
Allow the client id to be specified in the config string via 'id=' so that
users can control who they authenticate as.  Currently they are stuck with
the default ('admin').  This is necessary for anyone using authentication
in their environment.

Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   52 
 1 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index ce0f6ef..6135fc1 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -169,6 +169,34 @@ done:
 return ret;
 }
 
+static char *qemu_rbd_parse_clientname(const char *conf, char *clientname)
+{
+const char *p = conf;
+
+while (*p) {
+int len;
+const char *end = strchr(p, ':');
+
+if (end) {
+len = end - p;
+} else {
+len = strlen(p);
+}
+
+if (strncmp(p, id=, 3) == 0) {
+len -= 3;
+strncpy(clientname, p + 3, len);
+clientname[len] = '\0';
+return clientname;
+}
+if (end == NULL) {
+break;
+}
+p = end + 1;
+}
+return NULL;
+}
+
 static int qemu_rbd_set_conf(rados_t cluster, const char *conf)
 {
 char *p, *buf;
@@ -198,17 +226,19 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 break;
 }
 
-if (strcmp(name, conf)) {
-ret = rados_conf_set(cluster, name, value);
+if (strcmp(name, conf) == 0) {
+ret = rados_conf_read_file(cluster, value);
 if (ret  0) {
-error_report(invalid conf option %s, name);
-ret = -EINVAL;
+error_report(error reading conf file %s, value);
 break;
 }
+} else if (strcmp(name, id) == 0) {
+/* ignore, this is parsed by qemu_rbd_parse_clientname() */
 } else {
-ret = rados_conf_read_file(cluster, value);
+ret = rados_conf_set(cluster, name, value);
 if (ret  0) {
-error_report(error reading conf file %s, value);
+error_report(invalid conf option %s, name);
+ret = -EINVAL;
 break;
 }
 }
@@ -227,6 +257,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 char name[RBD_MAX_IMAGE_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname_buf[RBD_MAX_CONF_SIZE];
+char *clientname;
 rados_t cluster;
 rados_ioctx_t io_ctx;
 int ret;
@@ -259,7 +291,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 options++;
 }
 
-if (rados_create(cluster, NULL)  0) {
+clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
+if (rados_create(cluster, clientname)  0) {
 error_report(error initializing);
 return -EIO;
 }
@@ -385,6 +418,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 char pool[RBD_MAX_POOL_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname_buf[RBD_MAX_CONF_SIZE];
+char *clientname;
 int r;
 
 if (qemu_rbd_parsename(filename, pool, sizeof(pool),
@@ -398,7 +433,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 s-snap = g_strdup(snap_buf);
 }
 
-r = rados_create(s-cluster, NULL);
+clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
+r = rados_create(s-cluster, clientname);
 if (r  0) {
 error_report(error initializing);
 return r;
-- 
1.7.2.5




[Qemu-devel] [PATCH 2/3] rbd: clean up, fix style

2011-09-07 Thread Sage Weil
No assignment in condition.  Remove duplicate ret  0 check.

Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 6135fc1..2763092 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -391,15 +391,14 @@ static void qemu_rbd_aio_event_reader(void *opaque)
 char *p = (char *)s-event_rcb;
 
 /* now read the rcb pointer that was sent from a non qemu thread */
-if ((ret = read(s-fds[RBD_FD_READ], p + s-event_reader_pos,
-sizeof(s-event_rcb) - s-event_reader_pos))  0) {
-if (ret  0) {
-s-event_reader_pos += ret;
-if (s-event_reader_pos == sizeof(s-event_rcb)) {
-s-event_reader_pos = 0;
-qemu_rbd_complete_aio(s-event_rcb);
-s-qemu_aio_count--;
-}
+ret = read(s-fds[RBD_FD_READ], p + s-event_reader_pos,
+   sizeof(s-event_rcb) - s-event_reader_pos);
+if (ret  0) {
+s-event_reader_pos += ret;
+if (s-event_reader_pos == sizeof(s-event_rcb)) {
+s-event_reader_pos = 0;
+qemu_rbd_complete_aio(s-event_rcb);
+s-qemu_aio_count--;
 }
 }
 } while (ret  0  errno == EINTR);
-- 
1.7.2.5




[Qemu-devel] [PATCH 3/3] rbd: fix leak in qemu_rbd_open failure paths

2011-09-07 Thread Sage Weil
Fix leak of s-snap in failure path.  Simplify error paths for the whole
function.

Reported-by: Stefan Hajnoczi stefa...@gmail.com
Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   28 +---
 1 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 2763092..1b78d51 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -427,10 +427,6 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
conf, sizeof(conf))  0) {
 return -EINVAL;
 }
-s-snap = NULL;
-if (snap_buf[0] != '\0') {
-s-snap = g_strdup(snap_buf);
-}
 
 clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
 r = rados_create(s-cluster, clientname);
@@ -439,12 +435,16 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 return r;
 }
 
+s-snap = NULL;
+if (snap_buf[0] != '\0') {
+s-snap = g_strdup(snap_buf);
+}
+
 if (strstr(conf, conf=) == NULL) {
 r = rados_conf_read_file(s-cluster, NULL);
 if (r  0) {
 error_report(error reading config file);
-rados_shutdown(s-cluster);
-return r;
+goto failed_shutdown;
 }
 }
 
@@ -452,31 +452,26 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 r = qemu_rbd_set_conf(s-cluster, conf);
 if (r  0) {
 error_report(error setting config options);
-rados_shutdown(s-cluster);
-return r;
+goto failed_shutdown;
 }
 }
 
 r = rados_connect(s-cluster);
 if (r  0) {
 error_report(error connecting);
-rados_shutdown(s-cluster);
-return r;
+goto failed_shutdown;
 }
 
 r = rados_ioctx_create(s-cluster, pool, s-io_ctx);
 if (r  0) {
 error_report(error opening pool %s, pool);
-rados_shutdown(s-cluster);
-return r;
+goto failed_shutdown;
 }
 
 r = rbd_open(s-io_ctx, s-name, s-image, s-snap);
 if (r  0) {
 error_report(error reading header from %s, s-name);
-rados_ioctx_destroy(s-io_ctx);
-rados_shutdown(s-cluster);
-return r;
+goto failed_open;
 }
 
 bs-read_only = (s-snap != NULL);
@@ -497,8 +492,11 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 
 failed:
 rbd_close(s-image);
+failed_open:
 rados_ioctx_destroy(s-io_ctx);
+failed_shutdown:
 rados_shutdown(s-cluster);
+g_free(s-snap);
 return r;
 }
 
-- 
1.7.2.5




Re: [Qemu-devel] [PATCH 1/2] qemu-img: async write to block device when converting image

2011-09-07 Thread Sage Weil
On Wed, 7 Sep 2011, Yehuda Sadeh wrote:
 In order to improve image conversion process, instead of synchronously
 writing the destingation image, we keep a window of async writes.
 
 Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net

Small fix below:

 ---
  qemu-img.c |   28 +++-
  1 files changed, 23 insertions(+), 5 deletions(-)
 
 diff --git a/qemu-img.c b/qemu-img.c
 index b205e98..0552746 100644
 --- a/qemu-img.c
 +++ b/qemu-img.c
 @@ -622,6 +622,17 @@ static int compare_sectors(const uint8_t *buf1, const 
 uint8_t *buf2, int n,
  }
  
  #define IO_BUF_SIZE (2 * 1024 * 1024)
 +#define IO_WRITE_WINDOW_THRESHOLD (32 * 1024 * 1024)
 +
 +static int write_window = 0;
 +
 +static void img_write_cb(void *opaque, int ret)
 +{
 +QEMUIOVector *qiov = (QEMUIOVector *)opaque;
 +write_window -=  qiov-iov-iov_len / 512;
 +qemu_iovec_destroy(qiov);
 +qemu_free(qiov);
 +}
  
  static int img_convert(int argc, char **argv)
  {
 @@ -980,6 +991,9 @@ static int img_convert(int argc, char **argv)
 should add a specific call to have the info to go faster */
  buf1 = buf;
  while (n  0) {
 +while (write_window  IO_WRITE_WINDOW_THRESHOLD / 512) {
 +qemu_aio_wait();
 +}
  /* If the output image is being created as a copy on write 
 image,
 copy all sectors even the ones containing only NUL bytes,
 because they may differ from the sectors in the base 
 image.
 @@ -989,11 +1003,11 @@ static int img_convert(int argc, char **argv)
 already there is garbage, not 0s. */
  if (!has_zero_init || out_baseimg ||
  is_allocated_sectors(buf1, n, n1)) {
 -ret = bdrv_write(out_bs, sector_num, buf1, n1);
 -if (ret  0) {
 -error_report(error while writing);
 -goto out;
 -}
 +QEMUIOVector *qiov = qemu_mallocz(sizeof(QEMUIOVector));
 + qemu_iovec_init(qiov, 1);
 + qemu_iovec_add(qiov, (void *)buf1, n1 * 512);
 +bdrv_aio_writev(out_bs, sector_num, qiov, n1, 
 img_write_cb, qiov);
 +write_window += n1;
  }
  sector_num += n1;
  n -= n1;
 @@ -1001,11 +1015,15 @@ static int img_convert(int argc, char **argv)
  }
  qemu_progress_print(local_progress, 100);
  }
 +while (write_window  0) {
 +qemu_aio_wait();
 +}
  }
  out:
  qemu_progress_end();
  free_option_parameters(create_options);
  free_option_parameters(param);
 +bdrv_flush(out_bs);
  qemu_free(buf);
  if (out_bs) {
  bdrv_delete(out_bs);

The bdrv_flush() needs to go inside the if or else we get a null 
dereference on error (e.g. from a bad image name).

sage


 -- 
 1.7.5.1
 
 
 



Re: [Qemu-devel] [PATCH v2] rbd: fix leak in qemu_rbd_open failure paths

2011-09-04 Thread Sage Weil
On Sun, 4 Sep 2011, Stefan Hajnoczi wrote:
 On Sat, Sep 3, 2011 at 11:04 PM, Sage Weil s...@newdream.net wrote:
  +failed_shutdown:
      rados_shutdown(s-cluster);
  +    qemu_free(s-snap);
 
 Sorry for being a pain here.  This patch is against an old qemu.git
 tree.  All memory allocation is now using glib's g_malloc()/g_free().

Heh, no problem.  We've been working off the last release to avoid 
worrying about transient instability in master.  Resending!

sage

[Qemu-devel] [PATCH v3] rbd: fix leak in qemu_rbd_open failure paths

2011-09-04 Thread Sage Weil
Fix leak of s-snap in failure path.  Simplify error paths for the whole
function.

Reported-by: Stefan Hajnoczi stefa...@gmail.com
Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   28 +---
 1 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 2763092..9f32211 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -427,10 +427,6 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
conf, sizeof(conf))  0) {
 return -EINVAL;
 }
-s-snap = NULL;
-if (snap_buf[0] != '\0') {
-s-snap = g_strdup(snap_buf);
-}
 
 clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
 r = rados_create(s-cluster, clientname);
@@ -439,12 +435,16 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 return r;
 }
 
+s-snap = NULL;
+if (snap_buf[0] != '\0') {
+s-snap = g_strdup(snap_buf);
+}
+
 if (strstr(conf, conf=) == NULL) {
 r = rados_conf_read_file(s-cluster, NULL);
 if (r  0) {
 error_report(error reading config file);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_shutdown;
 }
 }
 
@@ -452,31 +452,26 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 r = qemu_rbd_set_conf(s-cluster, conf);
 if (r  0) {
 error_report(error setting config options);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_shutdown;
 }
 }
 
 r = rados_connect(s-cluster);
 if (r  0) {
 error_report(error connecting);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_shutdown;
 }
 
 r = rados_ioctx_create(s-cluster, pool, s-io_ctx);
 if (r  0) {
 error_report(error opening pool %s, pool);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_shutdown;
 }
 
 r = rbd_open(s-io_ctx, s-name, s-image, s-snap);
 if (r  0) {
 error_report(error reading header from %s, s-name);
-rados_ioctx_destroy(s-io_ctx);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_open;
 }
 
 bs-read_only = (s-snap != NULL);
@@ -497,8 +492,11 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 
 failed:
 rbd_close(s-image);
+failed_open:
 rados_ioctx_destroy(s-io_ctx);
+failed_shutdown:
 rados_shutdown(s-cluster);
+g_free(s-snap);
 return r;
 }
 
-- 
1.7.2.5




[Qemu-devel] [PATCH v2] rbd: fix leak in qemu_rbd_open failure paths

2011-09-03 Thread Sage Weil
Fix leak of s-snap in failure path.  Simplify error paths for the whole
function.

Reported-by: Stefan Hajnoczi stefa...@gmail.com
Signed-off-by: Sage Weil s...@newdream.net
---
v2: fixes all error paths, not just the first one. 

 block/rbd.c |   28 +---
 1 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index eaf7912..2f0733e 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -427,10 +427,6 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
conf, sizeof(conf))  0) {
 return -EINVAL;
 }
-s-snap = NULL;
-if (snap_buf[0] != '\0') {
-s-snap = qemu_strdup(snap_buf);
-}
 
 clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
 r = rados_create(s-cluster, clientname);
@@ -439,12 +435,16 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 return r;
 }
 
+s-snap = NULL;
+if (snap_buf[0] != '\0') {
+s-snap = qemu_strdup(snap_buf);
+}
+
 if (strstr(conf, conf=) == NULL) {
 r = rados_conf_read_file(s-cluster, NULL);
 if (r  0) {
 error_report(error reading config file);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_shutdown;
 }
 }
 
@@ -452,31 +452,26 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 r = qemu_rbd_set_conf(s-cluster, conf);
 if (r  0) {
 error_report(error setting config options);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_shutdown;
 }
 }
 
 r = rados_connect(s-cluster);
 if (r  0) {
 error_report(error connecting);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_shutdown;
 }
 
 r = rados_ioctx_create(s-cluster, pool, s-io_ctx);
 if (r  0) {
 error_report(error opening pool %s, pool);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_shutdown;
 }
 
 r = rbd_open(s-io_ctx, s-name, s-image, s-snap);
 if (r  0) {
 error_report(error reading header from %s, s-name);
-rados_ioctx_destroy(s-io_ctx);
-rados_shutdown(s-cluster);
-return r;
+   goto failed_open;
 }
 
 bs-read_only = (s-snap != NULL);
@@ -497,8 +492,11 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 
 failed:
 rbd_close(s-image);
+failed_open:
 rados_ioctx_destroy(s-io_ctx);
+failed_shutdown:
 rados_shutdown(s-cluster);
+qemu_free(s-snap);
 return r;
 }
 
-- 
1.7.2.5




[Qemu-devel] [PATCH 2/3] rbd: allow client id to be specified in config string

2011-08-23 Thread Sage Weil
Allow the client id to be specified in the config string via 'id=' so that
users can control who they authenticate as.  Currently they are stuck with
the default ('admin').  This is necessary for anyone using authentication
in their environment.

Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   52 
 1 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 52b79fa..e239e04 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -169,6 +169,34 @@ done:
 return ret;
 }
 
+static char *qemu_rbd_parse_clientname(const char *conf, char *clientname)
+{
+const char *p = conf;
+
+while (*p) {
+int len;
+const char *end = strchr(p, ':');
+
+if (end) {
+len = end - p;
+} else {
+len = strlen(p);
+}
+
+if (strncmp(p, id=, 3) == 0) {
+len -= 3;
+strncpy(clientname, p + 3, len);
+clientname[len] = '\0';
+return clientname;
+}
+if (end == NULL) {
+break;
+}
+p = end + 1;
+}
+return NULL;
+}
+
 static int qemu_rbd_set_conf(rados_t cluster, const char *conf)
 {
 char *p, *buf;
@@ -198,17 +226,19 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 break;
 }
 
-if (strcmp(name, conf)) {
-ret = rados_conf_set(cluster, name, value);
+if (strcmp(name, conf) == 0) {
+ret = rados_conf_read_file(cluster, value);
 if (ret  0) {
-error_report(invalid conf option %s, name);
-ret = -EINVAL;
+error_report(error reading conf file %s, value);
 break;
 }
+} else if (strcmp(name, id) == 0) {
+/* ignore, this is parsed by qemu_rbd_parse_clientname() */
 } else {
-ret = rados_conf_read_file(cluster, value);
+ret = rados_conf_set(cluster, name, value);
 if (ret  0) {
-error_report(error reading conf file %s, value);
+error_report(invalid conf option %s, name);
+ret = -EINVAL;
 break;
 }
 }
@@ -227,6 +257,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 char name[RBD_MAX_IMAGE_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname_buf[RBD_MAX_CONF_SIZE];
+char *clientname;
 rados_t cluster;
 rados_ioctx_t io_ctx;
 int ret;
@@ -259,7 +291,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 options++;
 }
 
-if (rados_create(cluster, NULL)  0) {
+clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
+if (rados_create(cluster, clientname)  0) {
 error_report(error initializing);
 return -EIO;
 }
@@ -385,6 +418,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 char pool[RBD_MAX_POOL_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname_buf[RBD_MAX_CONF_SIZE];
+char *clientname;
 int r;
 
 if (qemu_rbd_parsename(filename, pool, sizeof(pool),
@@ -394,7 +429,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 return -EINVAL;
 }
 
-r = rados_create(s-cluster, NULL);
+clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
+r = rados_create(s-cluster, clientname);
 if (r  0) {
 error_report(error initializing);
 return r;
-- 
1.7.0




[Qemu-devel] [PATCH 3/3] rbd: clean up, fix style

2011-08-23 Thread Sage Weil
No assignment in condition.  Remove duplicate ret  0 check.

Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index e239e04..5423a2d 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -391,15 +391,14 @@ static void qemu_rbd_aio_event_reader(void *opaque)
 char *p = (char *)s-event_rcb;
 
 /* now read the rcb pointer that was sent from a non qemu thread */
-if ((ret = read(s-fds[RBD_FD_READ], p + s-event_reader_pos,
-sizeof(s-event_rcb) - s-event_reader_pos))  0) {
-if (ret  0) {
-s-event_reader_pos += ret;
-if (s-event_reader_pos == sizeof(s-event_rcb)) {
-s-event_reader_pos = 0;
-qemu_rbd_complete_aio(s-event_rcb);
-s-qemu_aio_count--;
-}
+ret = read(s-fds[RBD_FD_READ], p + s-event_reader_pos,
+   sizeof(s-event_rcb) - s-event_reader_pos);
+if (ret  0) {
+s-event_reader_pos += ret;
+if (s-event_reader_pos == sizeof(s-event_rcb)) {
+s-event_reader_pos = 0;
+qemu_rbd_complete_aio(s-event_rcb);
+s-qemu_aio_count--;
 }
 }
 } while (ret  0  errno == EINTR);
-- 
1.7.0




[Qemu-devel] [PATCH 1/3] rbd: fix leak in failure path

2011-08-23 Thread Sage Weil
Fix leak of s-snap when rados_create fails.

Reported-by: Stefan Hajnoczi stefa...@gmail.com
Signed-off-by: Sage Weil s...@newdream.net
---
 block/rbd.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index d5659cd..52b79fa 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -393,10 +393,6 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
conf, sizeof(conf))  0) {
 return -EINVAL;
 }
-s-snap = NULL;
-if (snap_buf[0] != '\0') {
-s-snap = qemu_strdup(snap_buf);
-}
 
 r = rados_create(s-cluster, NULL);
 if (r  0) {
@@ -404,6 +400,11 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 return r;
 }
 
+s-snap = NULL;
+if (snap_buf[0] != '\0') {
+s-snap = qemu_strdup(snap_buf);
+}
+
 if (strstr(conf, conf=) == NULL) {
 r = rados_conf_read_file(s-cluster, NULL);
 if (r  0) {
-- 
1.7.0




[Qemu-devel] [PATCH] rbd: allow client id to be specified in config string

2011-08-22 Thread Sage Weil
Allow the client id to be specified in the config string via 'id=' so that
users can control who they authenticate as.  Currently they are stuck with
the default ('admin').  This is necessary for anyone using authentication
in their environment.

Signed-off-by: Sage Weil s...@newdream.net
Reviewed-by: Yehuda Sadeh yeh...@newdream.net
---
 block/rbd.c |   64 ++
 1 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index d5659cd..a888da0 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -169,6 +169,38 @@ done:
 return ret;
 }
 
+static int qemu_rbd_parse_clientname(const char *conf, char *clientname,
+ size_t maxlen)
+{
+const char *p = conf;
+
+while (*p) {
+   int len;
+   const char *end = strchr(p, ':');
+
+   if (end) {
+   len = end - p;
+   } else {
+   len = strlen(p);
+   }
+
+   if (strncmp(p, id=, 3) == 0) {
+   len -= 3;
+   if (len = maxlen) {
+   return -ERANGE;
+   }
+   strncpy(clientname, p + 3, len);
+   clientname[len] = '\0';
+   return 0;
+   }
+   if (end == NULL) {
+   break;
+   }
+   p = end + 1;
+}
+return 0;
+}
+
 static int qemu_rbd_set_conf(rados_t cluster, const char *conf)
 {
 char *p, *buf;
@@ -198,17 +230,19 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 break;
 }
 
-if (strcmp(name, conf)) {
-ret = rados_conf_set(cluster, name, value);
+if (strcmp(name, conf) == 0) {
+ret = rados_conf_read_file(cluster, value);
 if (ret  0) {
-error_report(invalid conf option %s, name);
-ret = -EINVAL;
+error_report(error reading conf file %s, value);
 break;
 }
-} else {
-ret = rados_conf_read_file(cluster, value);
+   } else if (strcmp(name, id) == 0) {
+   /* ignore, this is parsed by qemu_rbd_parse_clientname() */
+   } else {
+ret = rados_conf_set(cluster, name, value);
 if (ret  0) {
-error_report(error reading conf file %s, value);
+error_report(invalid conf option %s, name);
+ret = -EINVAL;
 break;
 }
 }
@@ -227,6 +261,7 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 char name[RBD_MAX_IMAGE_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname[RBD_MAX_CONF_SIZE];
 rados_t cluster;
 rados_ioctx_t io_ctx;
 int ret;
@@ -259,7 +294,12 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 options++;
 }
 
-if (rados_create(cluster, NULL)  0) {
+ret = qemu_rbd_parse_clientname(conf, clientname, sizeof(clientname));
+if (ret  0) {
+   error_report(rbd client name is too long);
+   return ret;
+}
+if (rados_create(cluster, clientname[0] ? clientname : NULL)  0) {
 error_report(error initializing);
 return -EIO;
 }
@@ -385,6 +425,7 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 char pool[RBD_MAX_POOL_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname[RBD_MAX_CONF_SIZE];
 int r;
 
 if (qemu_rbd_parsename(filename, pool, sizeof(pool),
@@ -398,7 +439,12 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 s-snap = qemu_strdup(snap_buf);
 }
 
-r = rados_create(s-cluster, NULL);
+r = qemu_rbd_parse_clientname(conf, clientname, sizeof(clientname));
+if (r  0) {
+   error_report(rbd client name is too long);
+   return r;
+}
+r = rados_create(s-cluster, clientname[0] ? clientname : NULL);
 if (r  0) {
 error_report(error initializing);
 return r;
-- 
1.7.0




[Qemu-devel] [PATCH v2] rbd: allow client id to be specified in config string

2011-08-22 Thread Sage Weil
Allow the client id to be specified in the config string via 'id=' so that
users can control who they authenticate as.  Currently they are stuck with
the default ('admin').  This is necessary for anyone using authentication
in their environment.

Signed-off-by: Sage Weil s...@newdream.net
---
v2: whoops, fixed default case where id= is not specified.

 block/rbd.c |   54 +-
 1 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index d5659cd..b0cd750 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -169,6 +169,34 @@ done:
 return ret;
 }
 
+static char *qemu_rbd_parse_clientname(const char *conf, char *clientname)
+{
+const char *p = conf;
+
+while (*p) {
+   int len;
+   const char *end = strchr(p, ':');
+
+   if (end) {
+   len = end - p;
+   } else {
+   len = strlen(p);
+   }
+
+   if (strncmp(p, id=, 3) == 0) {
+   len -= 3;
+   strncpy(clientname, p + 3, len);
+   clientname[len] = '\0';
+   return clientname;
+   }
+   if (end == NULL) {
+   break;
+   }
+   p = end + 1;
+}
+return NULL;
+}
+
 static int qemu_rbd_set_conf(rados_t cluster, const char *conf)
 {
 char *p, *buf;
@@ -198,17 +226,19 @@ static int qemu_rbd_set_conf(rados_t cluster, const char 
*conf)
 break;
 }
 
-if (strcmp(name, conf)) {
-ret = rados_conf_set(cluster, name, value);
+if (strcmp(name, conf) == 0) {
+ret = rados_conf_read_file(cluster, value);
 if (ret  0) {
-error_report(invalid conf option %s, name);
-ret = -EINVAL;
+error_report(error reading conf file %s, value);
 break;
 }
-} else {
-ret = rados_conf_read_file(cluster, value);
+   } else if (strcmp(name, id) == 0) {
+   /* ignore, this is parsed by qemu_rbd_parse_clientname() */
+   } else {
+ret = rados_conf_set(cluster, name, value);
 if (ret  0) {
-error_report(error reading conf file %s, value);
+error_report(invalid conf option %s, name);
+ret = -EINVAL;
 break;
 }
 }
@@ -227,6 +257,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 char name[RBD_MAX_IMAGE_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname_buf[RBD_MAX_CONF_SIZE];
+char *clientname;
 rados_t cluster;
 rados_ioctx_t io_ctx;
 int ret;
@@ -259,7 +291,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 options++;
 }
 
-if (rados_create(cluster, NULL)  0) {
+clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
+if (rados_create(cluster, clientname)  0) {
 error_report(error initializing);
 return -EIO;
 }
@@ -385,6 +418,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 char pool[RBD_MAX_POOL_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
 char conf[RBD_MAX_CONF_SIZE];
+char clientname_buf[RBD_MAX_CONF_SIZE];
+char *clientname;
 int r;
 
 if (qemu_rbd_parsename(filename, pool, sizeof(pool),
@@ -398,7 +433,8 @@ static int qemu_rbd_open(BlockDriverState *bs, const char 
*filename, int flags)
 s-snap = qemu_strdup(snap_buf);
 }
 
-r = rados_create(s-cluster, NULL);
+clientname = qemu_rbd_parse_clientname(conf, clientname_buf);
+r = rados_create(s-cluster, clientname);
 if (r  0) {
 error_report(error initializing);
 return r;
-- 
1.7.0




Re: [Qemu-devel] Qemu + RBD = ceph::buffer::end_of_buffer

2011-05-06 Thread Sage Weil
On Fri, 6 May 2011, Dyweni - Qemu-Devel wrote:
 Hi Josh/Lists!
 
 463 ::decode(*data_bl, iter);
 (gdb) print r
 $1 = 0
 (gdb) print data_bl
 $2 = (ceph::bufferlist *) 0x7f16f40d6060
 (gdb) print data_bl-_len
 $3 = 0

What about c-bl._len?

sage


 (gdb) print iter-off
 $4 = 20
 
 
 Thanks,
 Dyweni
 
 
 
  CCing the ceph list.
 
  On 05/06/2011 12:23 PM, Dyweni - Qemu-Devel wrote:
  Hi List!
 
  I upgraded Ceph to the latest development version
 Commit: 0edbc75a5fe8c3028faf85546f3264d28653ea3f
 Pulled from: git://ceph.newdream.net/ceph.git
 
  I recompiled the latest GIT version of QEMU-KVM (with Josh Durgin's
  patches) against the latest git version of Ceph.
 
  However, this error is still occurring:
 
  terminate called after throwing an instance of
  'ceph::buffer::end_of_buffer'
 what():  buffer::end_of_buffer
  Aborted (core dumped)
 
 
 
  Here's another backtrace from GDB:
 
  #0  0x7f16ff829495 in raise (sig=value optimized out) at
  ../nptl/sysdeps/unix/sysv/linux/raise.c:64
  #1  0x7f16ff82a81f in abort () at abort.c:92
  #2  0x7f16fed74a25 in __gnu_cxx::__verbose_terminate_handler () at
  /usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/vterminate.cc:93
  #3  0x7f16fed71c64 in __cxxabiv1::__terminate
  (handler=0x7f16fed74817
  __gnu_cxx::__verbose_terminate_handler())
   at
  /usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/eh_terminate.cc:38
  #4  0x7f16fed71c8c in std::terminate () at
  /usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/eh_terminate.cc:48
  #5  0x7f16fed71ea4 in __cxxabiv1::__cxa_throw (obj=0x1346470,
  tinfo=0x7f1701952ce0, dest=0x7f17017403d4
  ceph::buffer::end_of_buffer::~end_of_buffer())
   at
  /usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/eh_throw.cc:83
  #6  0x7f1701740a7b in ceph::buffer::list::iterator::copy
  (this=0x7f16fd8b1930, len=4, dest=0x7f16fd8b18dc ) at
  include/buffer.h:379
  #7  0x7f1701743328 in decode_raw__le32  (t=@0x7f16fd8b18dc, p=...)
  at
  include/encoding.h:35
  #8  0x7f170174198a in decode (v=@0x7f16fd8b190c, p=...) at
  include/encoding.h:80
  #9  0x7f1701741ade in decode (s=..., p=...) at
  include/encoding.h:189
  #10 0x7f17012e8369 in
  librados::RadosClient::C_aio_sparse_read_Ack::finish
  (this=0x7f16f40d6200,
  r=0) at librados.cc:463
  #11 0x7f170132bb5a in Objecter::handle_osd_op_reply (this=0x13423e0,
  m=0x1346520) at osdc/Objecter.cc:794
  #12 0x7f17012d1444 in librados::RadosClient::_dispatch
  (this=0x133f810, m=0x1346520) at librados.cc:751
  #13 0x7f17012d1244 in librados::RadosClient::ms_dispatch
  (this=0x133f810, m=0x1346520) at librados.cc:717
  #14 0x7f170131b57b in Messenger::ms_deliver_dispatch
  (this=0x1341910,
  m=0x1346520) at msg/Messenger.h:98
  #15 0x7f17013090d3 in SimpleMessenger::dispatch_entry
  (this=0x1341910)
  at msg/SimpleMessenger.cc:352
  #16 0x7f17012e296e in SimpleMessenger::DispatchThread::entry
  (this=0x1341da0) at msg/SimpleMessenger.h:533
  #17 0x7f170131a39b in Thread::_entry_func (arg=0x1341da0) at
  common/Thread.h:41
  #18 0x7f1701f6dac4 in start_thread (arg=value optimized out) at
  pthread_create.c:297
  #19 0x7f16ff8c838d in clone () at
  ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
 
  I haven't seen that error before, but it's probably a bug in the OSD
  where it doesn't set an error code. If you've still got the core file,
  could you go to frame 10 and send us the values of r, bl._len, and
  iter.off?
 
  Thanks,
  Josh
 
 
 
 
 



Re: [Qemu-devel] Qemu + RBD = ceph::buffer::end_of_buffer

2011-05-06 Thread Sage Weil
f 9  (or 8?)
p n
p s

(BTW this might be faster over irc, #ceph on irc.oftc.net)

Thanks!
sage


On Fri, 6 May 2011, Dyweni - Qemu-Devel wrote:

 Hi Sage/Lists!
 
 
 (gdb) print c-bl._len
 $1 = 20
 
 
 And in case this is helpful:
 
 (gdb) print *c
 $2 = {lock = {name = 0x7f1701430f8d AioCompletionImpl lock, id = -1,
 recursive = false, lockdep = true, backtrace = false, _m = {__data =
 {__lock = 1, __count = 0,
 __owner = 25800, __nusers = 1, __kind = 2, __spins = 0, __list =
 {__prev = 0x0, __next = 0x0}},
   __size =
 \001\000\000\000\000\000\000\000\310d\000\000\001\000\000\000\002,
 '\000' repeats 22 times, __align = 1}, nlock = 1}, cond = {
 _vptr.Cond = 0x7f1701952bd0, _c = {__data = {__lock = 0, __futex = 0,
 __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0,
 __nwaiters = 0,
 __broadcast_seq = 0}, __size = '\000' repeats 47 times, __align
 = 0}}, ref = 1, rval = 0, released = true, ack = true, safe =
 false, objver = {version = 0,
 epoch = 0, __pad = 0}, callback_complete = 0x7f170173de33
 librbd::rados_aio_sparse_read_cb(rados_completion_t, void*),
   callback_safe = 0x7f170173d8bd librbd::rados_cb(rados_completion_t,
 void*), callback_arg = 0x7f16f40d6010, bl = {
 _buffers = {std::_List_baseceph::buffer::ptr,
 std::allocatorceph::buffer::ptr  = {
 _M_impl = {std::allocatorstd::_List_nodeceph::buffer::ptr  =
 {__gnu_cxx::new_allocatorstd::_List_nodeceph::buffer::ptr  =
 {No data fields}, No data fields}, _M_node = {_M_next =
 0x1350530, _M_prev = 0x1350530}}}, No data fields}, _len = 20,
 append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {
   bl = 0x7f16f40d6170, ls = 0x7f16f40d6170, off = 0, p = {_M_node =
 0x7f16f40d6170}, p_off = 0}}, pbl = 0x0, buf = 0x0, maxlen = 0}
 
 
 
 Thanks,
 Dyweni
 
 
 
 
  On Fri, 6 May 2011, Dyweni - Qemu-Devel wrote:
  Hi Josh/Lists!
 
  463 ::decode(*data_bl, iter);
  (gdb) print r
  $1 = 0
  (gdb) print data_bl
  $2 = (ceph::bufferlist *) 0x7f16f40d6060
  (gdb) print data_bl-_len
  $3 = 0
 
  What about c-bl._len?
 
  sage
 
 
  (gdb) print iter-off
  $4 = 20
 
 
  Thanks,
  Dyweni
 
 
 
   CCing the ceph list.
  
   On 05/06/2011 12:23 PM, Dyweni - Qemu-Devel wrote:
   Hi List!
  
   I upgraded Ceph to the latest development version
   Commit: 0edbc75a5fe8c3028faf85546f3264d28653ea3f
   Pulled from: git://ceph.newdream.net/ceph.git
  
   I recompiled the latest GIT version of QEMU-KVM (with Josh Durgin's
   patches) against the latest git version of Ceph.
  
   However, this error is still occurring:
  
   terminate called after throwing an instance of
   'ceph::buffer::end_of_buffer'
  what():  buffer::end_of_buffer
   Aborted (core dumped)
  
  
  
   Here's another backtrace from GDB:
  
   #0  0x7f16ff829495 in raise (sig=value optimized out) at
   ../nptl/sysdeps/unix/sysv/linux/raise.c:64
   #1  0x7f16ff82a81f in abort () at abort.c:92
   #2  0x7f16fed74a25 in __gnu_cxx::__verbose_terminate_handler ()
  at
   /usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/vterminate.cc:93
   #3  0x7f16fed71c64 in __cxxabiv1::__terminate
   (handler=0x7f16fed74817
   __gnu_cxx::__verbose_terminate_handler())
at
   /usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/eh_terminate.cc:38
   #4  0x7f16fed71c8c in std::terminate () at
   /usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/eh_terminate.cc:48
   #5  0x7f16fed71ea4 in __cxxabiv1::__cxa_throw (obj=0x1346470,
   tinfo=0x7f1701952ce0, dest=0x7f17017403d4
   ceph::buffer::end_of_buffer::~end_of_buffer())
at
   /usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/eh_throw.cc:83
   #6  0x7f1701740a7b in ceph::buffer::list::iterator::copy
   (this=0x7f16fd8b1930, len=4, dest=0x7f16fd8b18dc ) at
   include/buffer.h:379
   #7  0x7f1701743328 in decode_raw__le32  (t=@0x7f16fd8b18dc,
  p=...)
   at
   include/encoding.h:35
   #8  0x7f170174198a in decode (v=@0x7f16fd8b190c, p=...) at
   include/encoding.h:80
   #9  0x7f1701741ade in decode (s=..., p=...) at
   include/encoding.h:189
   #10 0x7f17012e8369 in
   librados::RadosClient::C_aio_sparse_read_Ack::finish
   (this=0x7f16f40d6200,
   r=0) at librados.cc:463
   #11 0x7f170132bb5a in Objecter::handle_osd_op_reply
  (this=0x13423e0,
   m=0x1346520) at osdc/Objecter.cc:794
   #12 0x7f17012d1444 in librados::RadosClient::_dispatch
   (this=0x133f810, m=0x1346520) at librados.cc:751
   #13 0x7f17012d1244 in librados::RadosClient::ms_dispatch
   (this=0x133f810, m=0x1346520) at librados.cc:717
   #14 0x7f170131b57b in Messenger::ms_deliver_dispatch
   (this=0x1341910,
   m=0x1346520) at msg/Messenger.h:98
   #15 0x7f17013090d3 in SimpleMessenger::dispatch_entry
   (this=0x1341910)
   at msg/SimpleMessenger.cc:352
   #16 0x7f17012e296e in SimpleMessenger::DispatchThread::entry
   (this=0x1341da0) at msg/SimpleMessenger.h:533
   

Re: [Qemu-devel] Qemu + RBD = ceph::buffer::end_of_buffer

2011-05-06 Thread Sage Weil
On Fri, 6 May 2011, Dyweni - Qemu-Devel wrote:
 Hi Sage/Lists!
 
 
 (gdb) f 8
 #8  0x7f170174198a in decode (v=@0x7f16fd8b190c, p=...) at
 include/encoding.h:80
 80  WRITE_INTTYPE_ENCODER(uint32_t, le32)
 (gdb) p n
 No symbol n in current context.
 (gdb) p s
 No symbol s in current context.
 
 
 (gdb) f 9
 #9  0x7f1701741ade in decode (s=..., p=...) at include/encoding.h:189
 189   decode(len, p);
 (gdb) p n
 No symbol n in current context.
 (gdb) p s
 $3 = (ceph::bufferlist ) @0x7f16f40d6060: {_buffers =
 {std::_List_baseceph::buffer::ptr, std::allocatorceph::buffer::ptr 
 = {
   _M_impl = {std::allocatorstd::_List_nodeceph::buffer::ptr  =
 {__gnu_cxx::new_allocatorstd::_List_nodeceph::buffer::ptr  =
 {No data fields}, No data fields}, _M_node = {_M_next =
 0x7f16f40d6060, _M_prev = 0x7f16f40d6060}}}, No data fields}, _len
 = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {
 bl = 0x7f16f40d6060, ls = 0x7f16f40d6060, off = 0, p = {_M_node =
 0x7f16f40d6060}, p_off = 0}}
 
 
 Sorry, I don't have access to IRC from where I am at.

No worries.

Are you OSDs, by chance, running on 32bit machines?  This looks like a 
word size encoding thing.

sage


 
 Thanks,
 Dyweni
 
 
 
 
  f 9  (or 8?)
  p n
  p s
 
  (BTW this might be faster over irc, #ceph on irc.oftc.net)
 
  Thanks!
  sage
 
 
  On Fri, 6 May 2011, Dyweni - Qemu-Devel wrote:
 
  Hi Sage/Lists!
 
 
  (gdb) print c-bl._len
  $1 = 20
 
 
  And in case this is helpful:
 
  (gdb) print *c
  $2 = {lock = {name = 0x7f1701430f8d AioCompletionImpl lock, id = -1,
  recursive = false, lockdep = true, backtrace = false, _m = {__data =
  {__lock = 1, __count = 0,
  __owner = 25800, __nusers = 1, __kind = 2, __spins = 0, __list =
  {__prev = 0x0, __next = 0x0}},
__size =
  \001\000\000\000\000\000\000\000\310d\000\000\001\000\000\000\002,
  '\000' repeats 22 times, __align = 1}, nlock = 1}, cond = {
  _vptr.Cond = 0x7f1701952bd0, _c = {__data = {__lock = 0, __futex =
  0,
  __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0,
  __nwaiters = 0,
  __broadcast_seq = 0}, __size = '\000' repeats 47 times,
  __align
  = 0}}, ref = 1, rval = 0, released = true, ack = true, safe =
  false, objver = {version = 0,
  epoch = 0, __pad = 0}, callback_complete = 0x7f170173de33
  librbd::rados_aio_sparse_read_cb(rados_completion_t, void*),
callback_safe = 0x7f170173d8bd librbd::rados_cb(rados_completion_t,
  void*), callback_arg = 0x7f16f40d6010, bl = {
  _buffers = {std::_List_baseceph::buffer::ptr,
  std::allocatorceph::buffer::ptr  = {
  _M_impl = {std::allocatorstd::_List_nodeceph::buffer::ptr 
  =
  {__gnu_cxx::new_allocatorstd::_List_nodeceph::buffer::ptr  =
  {No data fields}, No data fields}, _M_node = {_M_next =
  0x1350530, _M_prev = 0x1350530}}}, No data fields}, _len = 20,
  append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {
bl = 0x7f16f40d6170, ls = 0x7f16f40d6170, off = 0, p = {_M_node =
  0x7f16f40d6170}, p_off = 0}}, pbl = 0x0, buf = 0x0, maxlen = 0}
 
 
 
  Thanks,
  Dyweni
 
 
 
 
   On Fri, 6 May 2011, Dyweni - Qemu-Devel wrote:
   Hi Josh/Lists!
  
   463 ::decode(*data_bl, iter);
   (gdb) print r
   $1 = 0
   (gdb) print data_bl
   $2 = (ceph::bufferlist *) 0x7f16f40d6060
   (gdb) print data_bl-_len
   $3 = 0
  
   What about c-bl._len?
  
   sage
  
  
   (gdb) print iter-off
   $4 = 20
  
  
   Thanks,
   Dyweni
  
  
  
CCing the ceph list.
   
On 05/06/2011 12:23 PM, Dyweni - Qemu-Devel wrote:
Hi List!
   
I upgraded Ceph to the latest development version
 Commit: 0edbc75a5fe8c3028faf85546f3264d28653ea3f
 Pulled from: git://ceph.newdream.net/ceph.git
   
I recompiled the latest GIT version of QEMU-KVM (with Josh
  Durgin's
patches) against the latest git version of Ceph.
   
However, this error is still occurring:
   
terminate called after throwing an instance of
'ceph::buffer::end_of_buffer'
   what():  buffer::end_of_buffer
Aborted (core dumped)
   
   
   
Here's another backtrace from GDB:
   
#0  0x7f16ff829495 in raise (sig=value optimized out) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x7f16ff82a81f in abort () at abort.c:92
#2  0x7f16fed74a25 in __gnu_cxx::__verbose_terminate_handler
  ()
   at
/usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/vterminate.cc:93
#3  0x7f16fed71c64 in __cxxabiv1::__terminate
(handler=0x7f16fed74817
__gnu_cxx::__verbose_terminate_handler())
 at
/usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/eh_terminate.cc:38
#4  0x7f16fed71c8c in std::terminate () at
/usr/src/debug/sys-devel/gcc-4.4.5/gcc-4.4.5/libstdc++-v3/libsupc++/eh_terminate.cc:48
#5  0x7f16fed71ea4 in __cxxabiv1::__cxa_throw (obj=0x1346470,
tinfo=0x7f1701952ce0, dest=0x7f17017403d4
ceph::buffer::end_of_buffer::~end_of_buffer())
   

Re: [Qemu-devel] Qemu + RBD = ceph::buffer::end_of_buffer

2011-05-06 Thread Sage Weil
On Fri, 6 May 2011, Dyweni - Qemu-Devel wrote:
 Hi Sage/Lists!
 
 Yes!  The entire Ceph cluster (1 Mon, 1 MSD, 3 OSD) are 32bit linux.
 
 The machine running Qemu is 64bit linux.

Excellent.  This is now fixed by 48d94f6e34da8ace2b5cb128de1bcfb34b3c40b5 
in the stable and master branches of ceph.git.  Or you can apply the patch 
below.

Thanks!
sage



From 48d94f6e34da8ace2b5cb128de1bcfb34b3c40b5 Mon Sep 17 00:00:00 2001
From: Sage Weil s...@newdream.net
Date: Fri, 6 May 2011 13:42:23 -0700
Subject: [PATCH] osd: used fixed size types for fiemap/mapext/sparseread 
encoding

The client expects uint64_t,uint64_t, so this breaks on any 32-bit osd.

Signed-off-by: Sage Weil s...@newdream.net
---
 src/os/FileStore.cc |4 ++--
 src/osd/ReplicatedPG.cc |4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc
index 81bc91d..ca11435 100644
--- a/src/os/FileStore.cc
+++ b/src/os/FileStore.cc
@@ -2198,7 +2198,7 @@ int FileStore::fiemap(coll_t cid, const sobject_t oid,
 {
 
   if (!ioctl_fiemap) {
-mapoff_t, size_t m;
+mapuint64_t, uint64_t m;
 m[offset] = len;
 ::encode(m, bl);
 return 0;
@@ -2207,7 +2207,7 @@ int FileStore::fiemap(coll_t cid, const sobject_t oid,
 
   char fn[PATH_MAX];
   struct fiemap *fiemap = NULL;
-  mapoff_t, size_t extmap;
+  mapuint64_t, uint64_t extmap;
 
   get_coname(cid, oid, fn, sizeof(fn));
 
diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index ba76b05..84399ca 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1142,10 +1142,10 @@ int ReplicatedPG::do_osd_ops(OpContext *ctx, 
vectorOSDOp ops,
  result = r;
   break;
}
-mapoff_t, size_t m;
+mapuint64_t, uint64_t m;
 bufferlist::iterator iter = bl.begin();
 ::decode(m, iter);
-mapoff_t, size_t::iterator miter;
+mapuint64_t, uint64_t::iterator miter;
 bufferlist data_bl;
 for (miter = m.begin(); miter != m.end(); ++miter) {
   bufferlist tmpbl;
-- 
1.7.0




Re: [Qemu-devel] [PATCH v2 1/2] rbd: use the higher level librbd instead of just librados

2011-04-12 Thread Sage Weil
On Tue, 12 Apr 2011, Stefan Hajnoczi wrote:
 On Tue, Apr 12, 2011 at 1:18 AM, Josh Durgin josh.dur...@dreamhost.com 
 wrote:
  On 04/08/2011 01:43 AM, Stefan Hajnoczi wrote:
 
  On Mon, Mar 28, 2011 at 04:15:57PM -0700, Josh Durgin wrote:
 
  librbd stacks on top of librados to provide access
  to rbd images.
 
  Using librbd simplifies the qemu code, and allows
  qemu to use new versions of the rbd format
  with few (if any) changes.
 
  Signed-off-by: Josh Durginjosh.dur...@dreamhost.com
  Signed-off-by: Yehuda Sadehyeh...@hq.newdream.net
  ---
   block/rbd.c       |  785
  +++--
   block/rbd_types.h |   71 -
   configure         |   33 +--
   3 files changed, 221 insertions(+), 668 deletions(-)
   delete mode 100644 block/rbd_types.h
 
  Hi Josh,
  I have applied your patches onto qemu.git/master and am running
  ceph.git/master.
 
  Unfortunately qemu-iotests fails for me.
 
 
  Test 016 seems to hang in qemu-io -g -c write -P 66 128M 512
  rbd:rbd/t.raw.  I can reproduce this consistently.  Here is the
  backtrace of the hung process (not consuming CPU, probably deadlocked):
 
  This hung because it wasn't checking the return value of rbd_aio_write.
  I've fixed this in the for-qemu branch of
  http://ceph.newdream.net/git/qemu-kvm.git. Also, the existing rbd
  implementation is not 'growable' - writing to a large offset will not expand
  the rbd image correctly. Should we implement bdrv_truncate to support this
  (librbd has a resize operation)? Is bdrv_truncate useful outside of qemu-img
  and qemu-io?
 
 If librbd has a resize operation then it would be nice to wire up
 bdrv_truncate() for completeness.  Note that bdrv_truncate() can also
 be called online using the block_resize monitor command.
 
 Since rbd devices are not growable we should fix qemu-iotests to skip
 016 for rbd.

There is a resize operation, but it's expected that you'll use it for any 
bdev size change (grow or shrink).  Does qemu grow a device by writing to 
the (new) highest offset, or is there another operation that should be 
wired up?  We want to avoid a situation where RBD isn't aware of the qemu 
bdev resize and has to grow a bit each time we write to a larger offset, 
as resize is a somewhat expensive operation...

Thanks!
sage

[Qemu-devel] Re: [PATCH] ceph/rbd block driver for qemu-kvm (v7)

2010-11-01 Thread Sage Weil
,
 +.bdrv_getlength = rbd_getlength,
 +.protocol_name  = rbd,
 +
 +.bdrv_aio_readv = rbd_aio_readv,
 +.bdrv_aio_writev= rbd_aio_writev,
 +
 +.bdrv_snapshot_create = rbd_snap_create,
 +.bdrv_snapshot_list = rbd_snap_list,
 +};
 +
 +static void bdrv_rbd_init(void)
 +{
 +bdrv_register(bdrv_rbd);
 +}
 +
 +block_init(bdrv_rbd_init);
 diff --git a/block/rbd_types.h b/block/rbd_types.h
 new file mode 100644
 index 000..c35d840
 --- /dev/null
 +++ b/block/rbd_types.h
 @@ -0,0 +1,71 @@
 +/*
 + * Ceph - scalable distributed file system
 + *
 + * Copyright (C) 2004-2010 Sage Weil s...@newdream.net
 + *
 + * This is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU Lesser General Public
 + * License version 2.1, as published by the Free Software
 + * Foundation.  See file COPYING.
 + *
 + */
 +
 +#ifndef CEPH_RBD_TYPES_H
 +#define CEPH_RBD_TYPES_H
 +
 +
 +/*
 + * rbd image 'foo' consists of objects
 + *   foo.rbd  - image metadata
 + *   foo.
 + *   foo.0001
 + *   ...  - data
 + */
 +
 +#define RBD_SUFFIX  .rbd
 +#define RBD_DIRECTORY   rbd_directory
 +#define RBD_INFOrbd_info
 +
 +#define RBD_DEFAULT_OBJ_ORDER   22   /* 4MB */
 +
 +#define RBD_MAX_OBJ_NAME_SIZE   96
 +#define RBD_MAX_BLOCK_NAME_SIZE 24
 +#define RBD_MAX_SEG_NAME_SIZE   128
 +
 +#define RBD_COMP_NONE   0
 +#define RBD_CRYPT_NONE  0
 +
 +#define RBD_HEADER_TEXT  Rados Block Device Image \n
 +#define RBD_HEADER_SIGNATURERBD
 +#define RBD_HEADER_VERSION  001.005
 +
 +struct rbd_info {
 +uint64_t max_id;
 +} __attribute__ ((packed));
 +
 +struct rbd_obj_snap_ondisk {
 +uint64_t id;
 +uint64_t image_size;
 +} __attribute__((packed));
 +
 +struct rbd_obj_header_ondisk {
 +char text[40];
 +char block_name[RBD_MAX_BLOCK_NAME_SIZE];
 +char signature[4];
 +char version[8];
 +struct {
 +uint8_t order;
 +uint8_t crypt_type;
 +uint8_t comp_type;
 +uint8_t unused;
 +} __attribute__((packed)) options;
 +uint64_t image_size;
 +uint64_t snap_seq;
 +uint32_t snap_count;
 +uint32_t reserved;
 +uint64_t snap_names_len;
 +struct rbd_obj_snap_ondisk snaps[0];
 +} __attribute__((packed));
 +
 +
 +#endif
 diff --git a/configure b/configure
 index af50607..5d8f620 100755
 --- a/configure
 +++ b/configure
 @@ -325,6 +325,7 @@ cpu_emulation=yes
  check_utests=no
  user_pie=no
  zero_malloc=
 +rbd=
  
  # OS specific
  if check_define __linux__ ; then
 @@ -724,6 +725,10 @@ for opt do
;;
--*dir)
;;
 +  --disable-rbd) rbd=no
 +  ;;
 +  --enable-rbd) rbd=yes
 +  ;;
*) echo ERROR: unknown option $opt; show_help=yes
;;
esac
 @@ -909,6 +914,7 @@ echo   --enable-docsenable documentation 
 build
  echo   --disable-docs   disable documentation build
  echo   --disable-vhost-net  disable vhost-net acceleration support
  echo   --enable-vhost-net   enable vhost-net acceleration support
 +echo   --enable-rbd enable building the rados block device 
 (rbd)
  echo 
  echo NOTE: The object files are built at the place where configure is 
 launched
  exit 1
 @@ -1755,6 +1761,27 @@ if test $mingw32 != yes -a $pthread = no; then
  fi
  
  ##
 +# rbd probe
 +if test $rbd != no ; then
 +  cat  $TMPC EOF
 +#include stdio.h
 +#include rados/librados.h
 +int main(void) { rados_initialize(0, NULL); return 0; }
 +EOF
 +  rbd_libs=-lrados -lcrypto
 +  if compile_prog  $rbd_libs ; then
 +rbd=yes
 +libs_tools=$rbd_libs $libs_tools
 +libs_softmmu=$rbd_libs $libs_softmmu
 +  else
 +if test $rbd = yes ; then
 +  feature_not_found rados block device
 +fi
 +rbd=no
 +  fi
 +fi
 +
 +##
  # linux-aio probe
  
  if test $linux_aio != no ; then
 @@ -2256,6 +2283,7 @@ echo preadv support$preadv
  echo fdatasync $fdatasync
  echo uuid support  $uuid
  echo vhost-net support $vhost_net
 +echo rbd support   $rbd
  
  if test $sdl_too_old = yes; then
  echo - Your SDL version is too old - please upgrade to have SDL support
 @@ -2498,6 +2526,9 @@ echo CONFIG_UNAME_RELEASE=\$uname_release\  
 $config_host_mak
  if test $zero_malloc = yes ; then
echo CONFIG_ZERO_MALLOC=y  $config_host_mak
  fi
 +if test $rbd = yes ; then
 +  echo CONFIG_RBD=y  $config_host_mak
 +fi
  
  # USB host support
  case $usb in
 -- 
 1.7.1
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 



Re: Fwd: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and sheepdog

2010-10-22 Thread Sage Weil
On Fri, 22 Oct 2010, Kevin Wolf wrote:
 [ Adding qemu-devel to CC again ]
 
 Am 21.10.2010 20:59, schrieb Sage Weil:
  On Thu, 21 Oct 2010, Christian Brunner wrote:
  Hi,
 
  is there a flush operation in librados? - I guess the only way to
  handle this, would be waiting until all aio requests are finished?
 
 That's not the semantics of bdrv_flush, you don't need to wait for
 running requests. You just need to make sure that all completed requests
 are safe on disk so that they would persist even in case of a
 crash/power failure.

Okay, in that case we're fine.  librados doesn't declare a write committed 
until it is safely on disk on multiple backend nodes.  There is a 
mechanism to get an ack sooner, but the qemu storage driver does not use 
it.  

  There is no flush currently.  But librados does no caching, so in this 
  case at least silenting upgrading to cache=writethrough should work.
 
 You're making sure that the data can't be cached in the server's page
 cache or volatile disk cache either, e.g. by using O_SYNC for the image
 file? If so, upgrading would be safe.

Right.

  If that's a problem, we can implement a flush.  Just let us know.
 
 Presumably providing a writeback mode with explicit flushes could
 improve performance. Upgrading to writethrough is not a correctness
 problem, though, so it's your decision if you want to implement it.

So is a bdrv_flush generated when e.g. the guest filesystem issues a 
barrier, or would otherwise normally ask a SATA disk to flush it's cache?

sage



 Kevin
 
  -- Forwarded message --
  From: Kevin Wolf kw...@redhat.com
  Date: 2010/10/21
  Subject: [Qemu-devel] bdrv_flush for qemu block drivers nbd, rbd and 
  sheepdog
  To: Christian Brunner c...@muc.de, Laurent Vivier
  laur...@vivier.eu, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
  Cc: Qemu-devel@nongnu.org
 
 
  Hi all,
 
  I'm currently looking into adding a return value to qemu's bdrv_flush
  function and I noticed that your block drivers (nbd, rbd and sheepdog)
  don't implement bdrv_flush at all. bdrv_flush is going to return
  -ENOTSUP for any block driver not implementing this, effectively
  breaking these three drivers for anything but cache=unsafe.
 
  Is there a specific reason why your drivers don't implement this? I
  think I remember that one of the drivers always provides
  cache=writethough semantics. It would be okay to silently upgrade to
  cache=writethrough, so in this case I'd just need to add an empty
  bdrv_flush implementation.
 
  Otherwise, we really cannot allow any option except cache=unsafe because
  that's the semantics provided by the driver.
 
  In any case, I think it would be a good idea to implement a real
  bdrv_flush function to allow the write-back cache modes cache=off and
  cache=writeback in order to improve performance over writethrough.
 
  Is this possible with your protocols, or can the protocol be changed to
  consider this? Any hints on how to proceed?
 
  Kevin
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 



Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v4)

2010-10-07 Thread Sage Weil
On Thu, 7 Oct 2010, Anthony Liguori wrote:

 On 10/07/2010 04:49 PM, Yehuda Sadeh Weinraub wrote:
  On Thu, Oct 7, 2010 at 2:04 PM, Anthony Liguorianth...@codemonkey.ws
  wrote:
 
   On 10/07/2010 03:47 PM, Yehuda Sadeh Weinraub wrote:

 How is that possible?  Are the callbacks delivered in the context of a
 different thread?  If so, don't you need locking?
 
  
Not sure I'm completely following you. The callbacks are delivered in
the context of a different thread, but won't run concurrently.
   
   Concurrently to what?  How do you prevent them from running concurrently
   with qemu?

  There are two types of callbacks. The first is for rados aio
  completions, and the second one is the one added later for the fd glue
  layer.
 
 
 This is a bad architecture for something like qemu.  You could create a 
 pipe and use the pipe to signal to qemu.  Same principle as eventfd.  
 Ideally, you would do this in the library itself.

I'm sorry, I'm having a hard time understanding what it is you're 
objecting to, or what you would prefer, as there are two different things 
we're talking about here (callbacks and fd glue/pipes).  (Please bear with 
me as I am not a qemu expert!)

The first is the aio completion.  You said a few messages back:

 It looks like you just use the eventfd to signal aio completion 
 callbacks.  A better way to do this would be to schedule a bottom half.

This is what we're doing.  The librados makes a callback to rbd.c's 
rbd_finish_aiocb(), which updates some internal rbd accounting and then 
calls qemu_bh_schedule().  Is that part right?


The second part is an fd (currently created via eventfd(), but I don't 
think it matters where it comes from) that was later added because 
qemu_aio_flush() wouldn't trigger when our aio's completed (and scheduled 
the bottom halves).  This was proposed by Simone Gotti, who had problems 
with live migration:

http://www.mail-archive.com/qemu-devel@nongnu.org/msg35516.html

Apparently calling the bottom half isn't sufficient to wake up a blocked 
qemu_aio_flush()?  His solution was to create an eventfd() fd, write a 
word to it in the aio completion callback (before we schedule the bh), and 
add the necessary callbacks to make qemu_aio_flush() behave.

Is the problem simply that we should be using pipe(2) instead of 
eventfd(2)?

So far I've heard that we should be scheduling the bottom halves (we are), 
and we should be using a pipe to signal qemu (we're using an fd created by 
eventfd(2)).

Thanks,
sage




 
 Regards,
 
 Anthony Liguori
 
  The first callback, called by librados whenever aio completes, runs in
  the context of a single librados thread:
  
  +static void rbd_finish_aiocb(rados_completion_t c, RADOSCB *rcb)
  +{
  +RBDAIOCB *acb = rcb-acb;
  rcb is per a single aio. Was created  before and will be destroyed
  here, whereas acb is shared between a few aios, however, it was
  generated before the first aio was created.
  
  +int64_t r;
  +uint64_t buf = 1;
  +int i;
  +
  +acb-aiocnt--;
  
  acb-aiocnt has been set before initiating all the aios, so it's ok to
  touch it now. Same goes to all acb fields.
  
  +r = rados_aio_get_return_value(c);
  +rados_aio_release(c);
  +if (acb-write) {
  +if (r  0) {
  +acb-ret = r;
  +acb-error = 1;
  +} else if (!acb-error) {
  +acb-ret += rcb-segsize;
  +}
  +} else {
  +if (r == -ENOENT) {
  +memset(rcb-buf, 0, rcb-segsize);
  +if (!acb-error) {
  +acb-ret += rcb-segsize;
  +}
  +} else if (r  0) {
  +acb-ret = r;
  +acb-error = 1;
  +} else if (r  rcb-segsize) {
  +memset(rcb-buf + r, 0, rcb-segsize - r);
  +if (!acb-error) {
  +acb-ret += rcb-segsize;
  +}
  +} else if (!acb-error) {
  +acb-ret += r;
  +}
  +}
  +if (write(acb-s-efd,buf, sizeof(buf))  0)
  This will wake up the io_read()
  
  +error_report(failed writing to acb-s-efd\n);
  +qemu_free(rcb);
  +i = 0;
  +if (!acb-aiocnt  acb-bh) {
  +qemu_bh_schedule(acb-bh);
  This is the only qemu related call in here, seems safe to call it.
  
  +}
  +}
  
  The scheduled bh function will be called only after all aios that
  relate to this specific aio set are done, so the following seems ok,
  as there's no more acb references.
  +static void rbd_aio_bh_cb(void *opaque)
  +{
  +RBDAIOCB *acb = opaque;
  +uint64_t buf = 1;
  +
  +if (!acb-write) {
  +qemu_iovec_from_buffer(acb-qiov, acb-bounce, acb-qiov-size);
  +}
  +qemu_vfree(acb-bounce);
  +acb-common.cb(acb-common.opaque, (acb-ret  0 ? 0 : acb-ret));
  +qemu_bh_delete(acb-bh);
  +acb-bh = NULL;
  +
  +if (write(acb-s-efd,buf, sizeof(buf))  0)
  +

Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread Sage Weil
On Tue, 25 May 2010, Avi Kivity wrote:
  What's the reason for not having these drivers upstream? Do we gain
  anything by hiding them from our users and requiring them to install the
  drivers separately from somewhere else?
 
 
 Six months.

FWIW, we (Ceph) aren't complaining about the 6 month lag time (and I don't 
think the Sheepdog guys are either).

From our perspective, the current BlockDriver abstraction is ideal, as it 
represents the reality of qemu's interaction with storage.  Any 'external' 
interface will be inferior to that in one way or another.  But either way, 
we are perfectly willing to work with you to all to keep in sync with any 
future BlockDriver API improvements.  It is worth our time investment even 
if the API is less stable.

The ability to dynamically load a shared object using the existing api 
would make development a bit easier, but I'm not convinced it's better for 
for users.  I think having ceph and sheepdog upstream with qemu will serve 
end users best, and we at least are willing to spend the time to help 
maintain that code in qemu.git.

sage