Fwd: OSD daemon changes port no
does 'ceph mds dump' list pool 3 in teh data_pools line? Yes. It lists the desired poolids I wanted to put data in. -- Forwarded message -- From: hemant surale hemant.sur...@gmail.com Date: Thu, Nov 29, 2012 at 2:59 PM Subject: Re: OSD daemon changes port no To: Sage Weil s...@inktank.com I used a little different version of cephfs as cephfs /home/hemant/a set_layout --pool 3 -c 1 -u 4194304 -s 4194304 and cephfs /home/hemant/b set_layout --pool 5 -c 1 -u 4194304 -s 4194304. Now cmd didnt showed any error but When I put data to dir a b ideally it should go to different pool but its not working as of now. Whatever I am doing is it possible (to use 2 dir pointing to 2 different pools for data placement) ? - Hemant Surale. On Tue, Nov 27, 2012 at 10:21 PM, Sage Weil s...@inktank.com wrote: On Tue, 27 Nov 2012, hemant surale wrote: I did mkdir a chmod 777 a . So dir a is /home/hemant/a . then I used mount.ceph 10.72.148.245:/ /ho root@hemantsec-virtual-machine:/home/hemant# cephfs /home/hemant/a set_layout --pool 3 Error setting layout: Invalid argument does 'ceph mds dump' list pool 3 in teh data_pools line? sage On Mon, Nov 26, 2012 at 9:56 PM, Sage Weil s...@inktank.com wrote: On Mon, 26 Nov 2012, hemant surale wrote: While I was using cephfs following error is observed - root@hemantsec-virtual-machine:~# cephfs /mnt/ceph/a --pool 3 invalid command Try cephfs /mnt/ceph/a set_layout --pool 3 (set_layout is the command) sage usage: cephfs path command [options]* Commands: show_layout-- view the layout information on a file or dir set_layout -- set the layout on an empty file, or the default layout on a directory show_location -- view the location information on a file Options: Useful for setting layouts: --stripe_unit, -u: set the size of each stripe --stripe_count, -c: set the number of objects to stripe across --object_size, -s: set the size of the objects to stripe across --pool, -p: set the pool to use Useful for getting location data: --offset, -l: the offset to retrieve location data for It may be silly question but unable to figure it out. :( On Wed, Nov 21, 2012 at 8:59 PM, Sage Weil s...@inktank.com wrote: On Wed, 21 Nov 2012, hemant surale wrote: Oh I see. Generally speaking, the only way to guarantee separation is to put them in different pools and distribute the pools across different sets of OSDs. yeah that was correct approach but i found problem doing so from abstract level i.e. when I put file inside mounted dir /home/hemant/cephfs ( mounted using mount.ceph cmd ) . At that time anyways ceph is going to use default pool data to store files ( here files were striped into different objects and then sent to appropriate osd ) . So how to tell ceph to use different pools in this case ? Goal : separate read and write operations , where read will be done from one group of OSD and write is done to other group of OSD. First create the other pool, ceph osd pool create name and then adjust the CRUSH rule to distributed to a different set of OSDs for that pool. To allow cephfs use it, ceph mds add_data_pool poolid and then: cephfs /mnt/ceph/foo --pool poolid will set the policy on the directory such that new files beneath that point will be stored in a different pool. Hope that helps! sage - Hemant Surale. On Wed, Nov 21, 2012 at 12:33 PM, Sage Weil s...@inktank.com wrote: On Wed, 21 Nov 2012, hemant surale wrote: Its a little confusing question I believe . Actually there are two files X Y. When I am reading X from its primary .I want to make sure simultaneous writing of Y should go to any other OSD except primary OSD for X (from where my current read is getting served ) . Oh I see. Generally speaking, the only way to guarantee separation is to put them in different pools and distribute the pools across different sets of OSDs. Otherwise, it's all (pseudo)random and you never know. Usually, they will be different, particularly as the cluster size increases, but sometimes they will be the same. sage - Hemant Sural.e On Wed, Nov 21, 2012 at 11:50 AM, Sage Weil s...@inktank.com wrote: On Wed, 21 Nov 2012, hemant surale wrote: and one more thing how can it be possible to read from one osd and then simultaneous write to direct on other osd with less/no traffic? I'm not sure I understand the
[PATCHv3] rbd block driver fix race between aio completition and aio cancel
This one fixes a race which qemu had also in iscsi block driver between cancellation and io completition. qemu_rbd_aio_cancel was not synchronously waiting for the end of the command. To archieve this it introduces a new status flag which uses -EINPROGRESS. Changes since last PATCH: - fixed missing braces - added vfree for bounce Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index 0aaacaf..917c64c 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -77,6 +77,7 @@ typedef struct RBDAIOCB { int error; struct BDRVRBDState *s; int cancelled; +int status; } RBDAIOCB; typedef struct RADOSCB { @@ -376,12 +377,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) RBDAIOCB *acb = rcb-acb; int64_t r; -if (acb-cancelled) { -qemu_vfree(acb-bounce); -qemu_aio_release(acb); -goto done; -} - r = rcb-ret; if (acb-cmd == RBD_AIO_WRITE || @@ -406,10 +401,11 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) acb-ret = r; } } +acb-status = 0; + /* Note that acb-bh can be NULL in case where the aio was cancelled */ acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb); qemu_bh_schedule(acb-bh); -done: g_free(rcb); } @@ -568,6 +564,13 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} + +qemu_vfree(acb-bounce); +qemu_aio_release(acb); } static const AIOCBInfo rbd_aiocb_info = { @@ -640,7 +643,9 @@ static void rbd_aio_bh_cb(void *opaque) qemu_bh_delete(acb-bh); acb-bh = NULL; -qemu_aio_release(acb); +if (!acb-cancelled) { +qemu_aio_release(acb); +} } static int rbd_aio_discard_wrapper(rbd_image_t image, @@ -685,6 +690,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, acb-s = s; acb-cancelled = 0; acb-bh = NULL; +acb-status = -EINPROGRESS; if (cmd == RBD_AIO_WRITE) { qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size); @@ -731,7 +737,10 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, failed: g_free(rcb); s-qemu_aio_count--; -qemu_aio_release(acb); +if (!acb-cancelled) { +qemu_vfree(acb-bounce); +qemu_aio_release(acb); +} return NULL; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd STDIN import does not work / wip-rbd-export-stdout
Hi Josh, Am 28.11.2012 19:51, schrieb Josh Durgin: No idea how to archieve this with git send-email ;-( But still more important is also the patch for discards... Use git format-patch, edit the patch file (it includes the basic headers already), then send it with git send-email. done / fixed Your return value type change was already merged into master of qemu.git as 08448d5195aeff49bf25fb62b4a6218f079f5284. Oh thanks i didn't get a reply that it is applied. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RBD: periodic cephx issue ? CephxAuthorizeHandler::verify_authorizer isvalid=0
Hi, I'm using RBD to store VM image and they're accessed through the kernel client (xen vms). In the client dmesg log, I see periodically : Nov 29 10:46:48 b53-04 kernel: [160055.012206] libceph: osd8 10.208.2.213:6806 socket closed Nov 29 10:46:48 b53-04 kernel: [160055.013635] libceph: osd8 10.208.2.213:6806 socket error on read And in the matching osd log I find : 2012-11-29 10:46:48.130673 7f6018127700 0 -- 192.168.2.213:6806/944 192.168.2.28:0/869804615 pipe(0xcf80600 sd=46 pgs=0 cs=0 l=0).accept peer addr is really 192.168.2.28:0/869804615 (socket is 192.168.2.28:40567/0) 2012-11-29 10:46:48.130902 7f6018127700 0 auth: could not find secret_id=0 2012-11-29 10:46:48.130912 7f6018127700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=0 2012-11-29 10:46:48.130915 7f6018127700 1 CephxAuthorizeHandler::verify_authorizer isvalid=0 2012-11-29 10:46:48.130917 7f6018127700 0 -- 192.168.2.213:6806/944 192.168.2.28:0/869804615 pipe(0xcf80600 sd=46 pgs=0 cs=0 l=1).accept bad authorizer 2012-11-29 10:46:48.131132 7f6018127700 0 auth: could not find secret_id=0 2012-11-29 10:46:48.131146 7f6018127700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=0 2012-11-29 10:46:48.131151 7f6018127700 1 CephxAuthorizeHandler::verify_authorizer isvalid=0 2012-11-29 10:46:48.131154 7f6018127700 0 -- 192.168.2.213:6806/944 192.168.2.28:0/869804615 pipe(0xcf80600 sd=46 pgs=0 cs=0 l=1).accept bad authorizer 2012-11-29 10:46:48.824180 7f6018127700 0 -- 192.168.2.213:6806/944 192.168.2.28:0/869804615 pipe(0xaf5de00 sd=46 pgs=0 cs=0 l=0).accept peer addr is really 192.168.2.28:0/869804615 (socket is 192.168.2.28:40568/0) 2012-11-29 10:46:48.824585 7f6018127700 1 CephxAuthorizeHandler::verify_authorizer isvalid=1 2012-11-29 10:46:48.825013 7f601f484700 0 osd.8 951 pg[3.514( v 950'1138 (223'137,950'1138] n=15 ec=10 les/c 941/948 916/916/916) [8,7] r=0 lpr=916 mlcod 950'1137 active+clean] watch: ctx-obc=0xb72e340 cookie=2 oi.version=1109 ctx-at_version=951'1139 2012-11-29 10:46:48.825024 7f601f484700 0 osd.8 951 pg[3.514( v 950'1138 (223'137,950'1138] n=15 ec=10 les/c 941/948 916/916/916) [8,7] r=0 lpr=916 mlcod 950'1137 active+clean] watch: oi.user_version=755 Note that this doesn't seem to pose any operational issue (i.e. it works ... when it retries it eventually connects). My configuration: The client currently runs on a debian wheezy and use a custom built 3.6.8 kernel that contains all the latest ceph rbd patch AFAIK but the problem was also showing up with earlier kernel versions. The cluster is a 0.48.2 running on Ubuntu 12.04 LTS. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd block driver fix race between aio completition and aio cancel
On Thu, Nov 22, 2012 at 11:00:19AM +0100, Stefan Priebe wrote: @@ -406,10 +401,11 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) acb-ret = r; } } +acb-status = 0; + I suggest doing this in the BH. The qemu_aio_wait() loop in qemu_rbd_aio_cancel() needs to wait until the BH has executed. By clearing status in the BH we ensure that no matter in which order qemu_aio_wait() invokes BHs and callbacks, we'll always wait until the BH has completed before ending the while loop in qemu_rbd_aio_cancel(). @@ -737,7 +741,8 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, failed: g_free(rcb); s-qemu_aio_count--; -qemu_aio_release(acb); +if (!acb-cancelled) +qemu_aio_release(acb); return NULL; } This scenario is impossible. We haven't returned the acb back to the caller yet so they could not have invoked qemu_aio_cancel(). Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv4] rbd block driver fix race between aio completition and aio cancel
This one fixes a race which qemu had also in iscsi block driver between cancellation and io completition. qemu_rbd_aio_cancel was not synchronously waiting for the end of the command. To archieve this it introduces a new status flag which uses -EINPROGRESS. Changes since PATCHv3: - removed unnecessary if condition in rbd_start_aio as we haven't start io yet - moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always waits until BH was executed Changes since PATCHv2: - fixed missing braces - added vfree for bounce Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index f3becc7..28e94ab 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -77,6 +77,7 @@ typedef struct RBDAIOCB { int error; struct BDRVRBDState *s; int cancelled; +int status; } RBDAIOCB; typedef struct RADOSCB { @@ -376,12 +377,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) RBDAIOCB *acb = rcb-acb; int64_t r; -if (acb-cancelled) { -qemu_vfree(acb-bounce); -qemu_aio_release(acb); -goto done; -} - r = rcb-ret; if (acb-cmd == RBD_AIO_WRITE || @@ -409,7 +404,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) /* Note that acb-bh can be NULL in case where the aio was cancelled */ acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb); qemu_bh_schedule(acb-bh); -done: g_free(rcb); } @@ -568,6 +562,12 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} + +qemu_vfree(acb-bounce); } static const AIOCBInfo rbd_aiocb_info = { @@ -639,6 +639,7 @@ static void rbd_aio_bh_cb(void *opaque) acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); qemu_bh_delete(acb-bh); acb-bh = NULL; +acb-status = 0; qemu_aio_release(acb); } @@ -685,6 +686,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, acb-s = s; acb-cancelled = 0; acb-bh = NULL; +acb-status = -EINPROGRESS; if (cmd == RBD_AIO_WRITE) { qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd block driver fix race between aio completition and aio cancel
Hi, i hope i've done everything correctly. I've send a new v4 patch. Am 29.11.2012 14:58, schrieb Stefan Hajnoczi: On Thu, Nov 22, 2012 at 11:00:19AM +0100, Stefan Priebe wrote: @@ -406,10 +401,11 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) acb-ret = r; } } +acb-status = 0; + I suggest doing this in the BH. The qemu_aio_wait() loop in qemu_rbd_aio_cancel() needs to wait until the BH has executed. By clearing status in the BH we ensure that no matter in which order qemu_aio_wait() invokes BHs and callbacks, we'll always wait until the BH has completed before ending the while loop in qemu_rbd_aio_cancel(). @@ -737,7 +741,8 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, failed: g_free(rcb); s-qemu_aio_count--; -qemu_aio_release(acb); +if (!acb-cancelled) +qemu_aio_release(acb); return NULL; } This scenario is impossible. We haven't returned the acb back to the caller yet so they could not have invoked qemu_aio_cancel(). Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: don't reference req after put
In __unregister_request(), there is a call to list_del_init() referencing a request that was the subject of a call to ceph_osdc_put_request() on the previous line. This is not safe, because the request structure could have been freed by the time we reach the list_del_init(). Fix this by reversing the order of these lines. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/osd_client.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 9b6f0e4..d1177ec 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -797,9 +797,9 @@ static void __unregister_request(struct ceph_osd_client *osdc, req-r_osd = NULL; } + list_del_init(req-r_req_lru_item); ceph_osdc_put_request(req); - list_del_init(req-r_req_lru_item); if (osdc-num_requests == 0) { dout( no requests, canceling timeout\n); __cancel_osd_timeout(osdc); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4] rbd block driver fix race between aio completition and aio cancel
- Messaggio originale - Da: Stefan Priebe s.pri...@profihost.ag A: qemu-de...@nongnu.org Cc: stefa...@gmail.com, josh durgin josh.dur...@inktank.com, ceph-devel@vger.kernel.org, pbonz...@redhat.com, Stefan Priebe s.pri...@profihost.ag Inviato: Giovedì, 29 novembre 2012 15:28:35 Oggetto: [PATCHv4] rbd block driver fix race between aio completition and aio cancel This one fixes a race which qemu had also in iscsi block driver between cancellation and io completition. qemu_rbd_aio_cancel was not synchronously waiting for the end of the command. To archieve this it introduces a new status flag which uses -EINPROGRESS. Changes since PATCHv3: - removed unnecessary if condition in rbd_start_aio as we haven't start io yet - moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always waits until BH was executed Changes since PATCHv2: - fixed missing braces - added vfree for bounce Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index f3becc7..28e94ab 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -77,6 +77,7 @@ typedef struct RBDAIOCB { int error; struct BDRVRBDState *s; int cancelled; +int status; } RBDAIOCB; typedef struct RADOSCB { @@ -376,12 +377,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) RBDAIOCB *acb = rcb-acb; int64_t r; -if (acb-cancelled) { -qemu_vfree(acb-bounce); -qemu_aio_release(acb); -goto done; -} - r = rcb-ret; if (acb-cmd == RBD_AIO_WRITE || @@ -409,7 +404,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) /* Note that acb-bh can be NULL in case where the aio was cancelled */ acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb); qemu_bh_schedule(acb-bh); -done: g_free(rcb); } @@ -568,6 +562,12 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} + +qemu_vfree(acb-bounce); This vfree is not needed, since the BH will run and do the free. Otherwise looks ok. } static const AIOCBInfo rbd_aiocb_info = { @@ -639,6 +639,7 @@ static void rbd_aio_bh_cb(void *opaque) acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); qemu_bh_delete(acb-bh); acb-bh = NULL; +acb-status = 0; qemu_aio_release(acb); } @@ -685,6 +686,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, acb-s = s; acb-cancelled = 0; acb-bh = NULL; +acb-status = -EINPROGRESS; if (cmd == RBD_AIO_WRITE) { qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd block driver fix race between aio completition and aio cancel
@@ -574,6 +570,12 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} + There should be a qemu_vfree(acb-bounce); here No, because the BH will have run at this point and you'd doubly-free the buffer. Paolo +qemu_aio_release(acb); } static AIOPool rbd_aio_pool = { @@ -646,7 +648,8 @@ static void rbd_aio_bh_cb(void *opaque) qemu_bh_delete(acb-bh); acb-bh = NULL; -qemu_aio_release(acb); +if (!acb-cancelled) +qemu_aio_release(acb); } static int rbd_aio_discard_wrapper(rbd_image_t image, @@ -691,6 +694,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, acb-s = s; acb-cancelled = 0; acb-bh = NULL; +acb-status = -EINPROGRESS; if (cmd == RBD_AIO_WRITE) { qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size); @@ -737,7 +741,8 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, failed: g_free(rcb); s-qemu_aio_count--; -qemu_aio_release(acb); +if (!acb-cancelled) qemu_vfree(acb-bounce) should be here as well, although that's a separate bug that's probably never hit. +qemu_aio_release(acb); return NULL; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: parsing in the ceph osd subsystem
On Thu, 29 Nov 2012, Andrey Korolyov wrote: $ ceph osd down - osd.0 is already down $ ceph osd down --- osd.0 is already down the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys ceph cli should explicitly work only with positive integers plus zero, refusing all other input. which branch is this? this parsing is cleaned u pin the latest next/master. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD: periodic cephx issue ? CephxAuthorizeHandler::verify_authorizer isvalid=0
Hi Sylvian, Can you attach/post the whole log somewhere? I'm curious what is leading up to it not having secret_id=0. Ideally with 'debug auth = 20' and 'debug osd = 20' and 'debug ms = 1'. Thanks! sage On Thu, 29 Nov 2012, Sylvain Munaut wrote: Hi, I'm using RBD to store VM image and they're accessed through the kernel client (xen vms). In the client dmesg log, I see periodically : Nov 29 10:46:48 b53-04 kernel: [160055.012206] libceph: osd8 10.208.2.213:6806 socket closed Nov 29 10:46:48 b53-04 kernel: [160055.013635] libceph: osd8 10.208.2.213:6806 socket error on read And in the matching osd log I find : 2012-11-29 10:46:48.130673 7f6018127700 0 -- 192.168.2.213:6806/944 192.168.2.28:0/869804615 pipe(0xcf80600 sd=46 pgs=0 cs=0 l=0).accept peer addr is really 192.168.2.28:0/869804615 (socket is 192.168.2.28:40567/0) 2012-11-29 10:46:48.130902 7f6018127700 0 auth: could not find secret_id=0 2012-11-29 10:46:48.130912 7f6018127700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=0 2012-11-29 10:46:48.130915 7f6018127700 1 CephxAuthorizeHandler::verify_authorizer isvalid=0 2012-11-29 10:46:48.130917 7f6018127700 0 -- 192.168.2.213:6806/944 192.168.2.28:0/869804615 pipe(0xcf80600 sd=46 pgs=0 cs=0 l=1).accept bad authorizer 2012-11-29 10:46:48.131132 7f6018127700 0 auth: could not find secret_id=0 2012-11-29 10:46:48.131146 7f6018127700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=0 2012-11-29 10:46:48.131151 7f6018127700 1 CephxAuthorizeHandler::verify_authorizer isvalid=0 2012-11-29 10:46:48.131154 7f6018127700 0 -- 192.168.2.213:6806/944 192.168.2.28:0/869804615 pipe(0xcf80600 sd=46 pgs=0 cs=0 l=1).accept bad authorizer 2012-11-29 10:46:48.824180 7f6018127700 0 -- 192.168.2.213:6806/944 192.168.2.28:0/869804615 pipe(0xaf5de00 sd=46 pgs=0 cs=0 l=0).accept peer addr is really 192.168.2.28:0/869804615 (socket is 192.168.2.28:40568/0) 2012-11-29 10:46:48.824585 7f6018127700 1 CephxAuthorizeHandler::verify_authorizer isvalid=1 2012-11-29 10:46:48.825013 7f601f484700 0 osd.8 951 pg[3.514( v 950'1138 (223'137,950'1138] n=15 ec=10 les/c 941/948 916/916/916) [8,7] r=0 lpr=916 mlcod 950'1137 active+clean] watch: ctx-obc=0xb72e340 cookie=2 oi.version=1109 ctx-at_version=951'1139 2012-11-29 10:46:48.825024 7f601f484700 0 osd.8 951 pg[3.514( v 950'1138 (223'137,950'1138] n=15 ec=10 les/c 941/948 916/916/916) [8,7] r=0 lpr=916 mlcod 950'1137 active+clean] watch: oi.user_version=755 Note that this doesn't seem to pose any operational issue (i.e. it works ... when it retries it eventually connects). My configuration: The client currently runs on a debian wheezy and use a custom built 3.6.8 kernel that contains all the latest ceph rbd patch AFAIK but the problem was also showing up with earlier kernel versions. The cluster is a 0.48.2 running on Ubuntu 12.04 LTS. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: don't reference req after put
Reviewed-by: Sage Weil s...@inktank.com On Thu, 29 Nov 2012, Alex Elder wrote: In __unregister_request(), there is a call to list_del_init() referencing a request that was the subject of a call to ceph_osdc_put_request() on the previous line. This is not safe, because the request structure could have been freed by the time we reach the list_del_init(). Fix this by reversing the order of these lines. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/osd_client.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 9b6f0e4..d1177ec 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -797,9 +797,9 @@ static void __unregister_request(struct ceph_osd_client *osdc, req-r_osd = NULL; } + list_del_init(req-r_req_lru_item); ceph_osdc_put_request(req); - list_del_init(req-r_req_lru_item); if (osdc-num_requests == 0) { dout( no requests, canceling timeout\n); __cancel_osd_timeout(osdc); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: parsing in the ceph osd subsystem
On Thu, Nov 29, 2012 at 8:34 PM, Sage Weil s...@inktank.com wrote: On Thu, 29 Nov 2012, Andrey Korolyov wrote: $ ceph osd down - osd.0 is already down $ ceph osd down --- osd.0 is already down the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys ceph cli should explicitly work only with positive integers plus zero, refusing all other input. which branch is this? this parsing is cleaned u pin the latest next/master. It was produced by 0.54-tag. I have built dd3a24a647d0b0f1153cf1b102ed1f51d51be2f2 today and problem has gone(except parsing ``-0'' as 0 and 0/001 as 0 and 1 correspondingly). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: parsing in the ceph osd subsystem
On 11/29/2012 07:01 PM, Andrey Korolyov wrote: On Thu, Nov 29, 2012 at 8:34 PM, Sage Weil s...@inktank.com wrote: On Thu, 29 Nov 2012, Andrey Korolyov wrote: $ ceph osd down - osd.0 is already down $ ceph osd down --- osd.0 is already down the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys ceph cli should explicitly work only with positive integers plus zero, refusing all other input. which branch is this? this parsing is cleaned u pin the latest next/master. It was produced by 0.54-tag. I have built dd3a24a647d0b0f1153cf1b102ed1f51d51be2f2 today and problem has gone(except parsing ``-0'' as 0 and 0/001 as 0 and 1 correspondingly). We use strtol() to parse numeric values, and '-0', '0' or '1' are valid numeric values. I suppose we could enforce the argument to be numeric only, hence getting rid of '-0' and enforce stricter checks on the parameters to rule out valid numeric values that look funny, which in the '0*\d' cases it should be fairly simple. -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Client crash on getcwd with non-default root mount
I'm getting the assert failure below with the following test: ceph_mount(cmount, /otherdir); ceph_getcwd(cmount); -- client/Inode.h: In function 'Dentry* Inode::get_first_parent()' thread 7fded47c8780 time 2012-11-29 11:49:00.890184 client/Inode.h: 165: FAILED assert(!dn_set.empty()) ceph version 0.54-808-g1ed5a1f (1ed5a1f984d8260d86cc25b1ae95ffedf597e579) 1: (()+0x11ee89) [0x7fded36fae89] 2: (()+0x1429d3) [0x7fded371e9d3] 3: (ceph_getcwd()+0x11) [0x7fded36fdb41] 4: (MountedTest2_XYZ_Test::TestBody()+0x63a) [0x42563a] 5: (testing::Test::Run()+0xaa) [0x45017a] 6: (testing::internal::TestInfoImpl::Run()+0x100) [0x450280] 7: (testing::TestCase::Run()+0xbd) [0x45034d] 8: (testing::internal::UnitTestImpl::RunAllTests()+0x217) [0x4505b7] 9: (main()+0x35) [0x423115] 10: (__libc_start_main()+0xed) [0x7fded2d2876d] 11: /home/nwatkins/projects/ceph/ceph/src/.libs/lt-test_libcephfs() [0x423171] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted (core dumped) Thanks, Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Client crash on getcwd with non-default root mount
On 11/29/2012 01:52 PM, Noah Watkins wrote: I'm getting the assert failure below with the following test: ceph_mount(cmount, /otherdir); This should fail with ENOENT if you check the return code. -sam ceph_getcwd(cmount); -- client/Inode.h: In function 'Dentry* Inode::get_first_parent()' thread 7fded47c8780 time 2012-11-29 11:49:00.890184 client/Inode.h: 165: FAILED assert(!dn_set.empty()) ceph version 0.54-808-g1ed5a1f (1ed5a1f984d8260d86cc25b1ae95ffedf597e579) 1: (()+0x11ee89) [0x7fded36fae89] 2: (()+0x1429d3) [0x7fded371e9d3] 3: (ceph_getcwd()+0x11) [0x7fded36fdb41] 4: (MountedTest2_XYZ_Test::TestBody()+0x63a) [0x42563a] 5: (testing::Test::Run()+0xaa) [0x45017a] 6: (testing::internal::TestInfoImpl::Run()+0x100) [0x450280] 7: (testing::TestCase::Run()+0xbd) [0x45034d] 8: (testing::internal::UnitTestImpl::RunAllTests()+0x217) [0x4505b7] 9: (main()+0x35) [0x423115] 10: (__libc_start_main()+0xed) [0x7fded2d2876d] 11: /home/nwatkins/projects/ceph/ceph/src/.libs/lt-test_libcephfs() [0x423171] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted (core dumped) Thanks, Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Client crash on getcwd with non-default root mount
Oh, let me clarify. /otherdir exists, and the mount succeeds. - Noah On Thu, Nov 29, 2012 at 11:58 AM, Sam Lang sam.l...@inktank.com wrote: On 11/29/2012 01:52 PM, Noah Watkins wrote: I'm getting the assert failure below with the following test: ceph_mount(cmount, /otherdir); This should fail with ENOENT if you check the return code. -sam ceph_getcwd(cmount); -- client/Inode.h: In function 'Dentry* Inode::get_first_parent()' thread 7fded47c8780 time 2012-11-29 11:49:00.890184 client/Inode.h: 165: FAILED assert(!dn_set.empty()) ceph version 0.54-808-g1ed5a1f (1ed5a1f984d8260d86cc25b1ae95ffedf597e579) 1: (()+0x11ee89) [0x7fded36fae89] 2: (()+0x1429d3) [0x7fded371e9d3] 3: (ceph_getcwd()+0x11) [0x7fded36fdb41] 4: (MountedTest2_XYZ_Test::TestBody()+0x63a) [0x42563a] 5: (testing::Test::Run()+0xaa) [0x45017a] 6: (testing::internal::TestInfoImpl::Run()+0x100) [0x450280] 7: (testing::TestCase::Run()+0xbd) [0x45034d] 8: (testing::internal::UnitTestImpl::RunAllTests()+0x217) [0x4505b7] 9: (main()+0x35) [0x423115] 10: (__libc_start_main()+0xed) [0x7fded2d2876d] 11: /home/nwatkins/projects/ceph/ceph/src/.libs/lt-test_libcephfs() [0x423171] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted (core dumped) Thanks, Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Client crash on getcwd with non-default root mount
Here is the full test case: TEST(LibCephFS, MountRootChdir) { struct ceph_mount_info *cmount; /* create mount and new directory */ ASSERT_EQ(ceph_create(cmount, NULL), 0); ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0); ASSERT_EQ(ceph_mount(cmount, /), 0); ASSERT_EQ(ceph_mkdir(cmount, /xyz, 0700), 0); ceph_shutdown(cmount); /* create mount with non-/ root */ ASSERT_EQ(ceph_create(cmount, NULL), 0); ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0); ASSERT_EQ(ceph_mount(cmount, /xyz), 0); /* should be at root directory, but blows up */ ASSERT_STREQ(ceph_getcwd(cmount), /); } On Thu, Nov 29, 2012 at 12:02 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: Oh, let me clarify. /otherdir exists, and the mount succeeds. - Noah On Thu, Nov 29, 2012 at 11:58 AM, Sam Lang sam.l...@inktank.com wrote: On 11/29/2012 01:52 PM, Noah Watkins wrote: I'm getting the assert failure below with the following test: ceph_mount(cmount, /otherdir); This should fail with ENOENT if you check the return code. -sam ceph_getcwd(cmount); -- client/Inode.h: In function 'Dentry* Inode::get_first_parent()' thread 7fded47c8780 time 2012-11-29 11:49:00.890184 client/Inode.h: 165: FAILED assert(!dn_set.empty()) ceph version 0.54-808-g1ed5a1f (1ed5a1f984d8260d86cc25b1ae95ffedf597e579) 1: (()+0x11ee89) [0x7fded36fae89] 2: (()+0x1429d3) [0x7fded371e9d3] 3: (ceph_getcwd()+0x11) [0x7fded36fdb41] 4: (MountedTest2_XYZ_Test::TestBody()+0x63a) [0x42563a] 5: (testing::Test::Run()+0xaa) [0x45017a] 6: (testing::internal::TestInfoImpl::Run()+0x100) [0x450280] 7: (testing::TestCase::Run()+0xbd) [0x45034d] 8: (testing::internal::UnitTestImpl::RunAllTests()+0x217) [0x4505b7] 9: (main()+0x35) [0x423115] 10: (__libc_start_main()+0xed) [0x7fded2d2876d] 11: /home/nwatkins/projects/ceph/ceph/src/.libs/lt-test_libcephfs() [0x423171] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted (core dumped) Thanks, Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd map command hangs for 15 minutes during system start up
On 11/22/2012 12:04 PM, Nick Bartos wrote: Here are the ceph log messages (including the libceph kernel debug stuff you asked for) from a node boot with the rbd command hung for a couple of minutes: Nick, I have put together a branch that includes two fixes that might be helpful. I don't expect these fixes will necessarily *fix* what you're seeing, but one of them pulls a big hunk of processing out of the picture and might help eliminate some potential causes. I had to pull in several other patches as prerequisites in order to get those fixes to apply cleanly. Would you be able to give it a try, and let us know what results you get? The branch contains: - Linux 3.5.5 - Plus the first 49 patches you listed - Plus four patches, which are prerequisites... libceph: define ceph_extract_encoded_string() rbd: define some new format constants rbd: define rbd_dev_image_id() rbd: kill create_snap sysfs entry - ...for these two bug fixes: libceph: remove 'osdtimeout' option ceph: don't reference req after put The branch is available in the ceph-client git repository under the name wip-nick and has commit id dd9323aa. https://github.com/ceph/ceph-client/tree/wip-nick https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt This full debug output is very helpful. Please supply that again as well. Thanks. -Alex On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos n...@pistoncloud.com wrote: It's very easy to reproduce now with my automated install script, the most I've seen it succeed with that patch is 2 in a row, and hanging on the 3rd, although it hangs on most builds. So it shouldn't take much to get it to do it again. I'll try and get to that tomorrow, when I'm a bit more rested and my brain is working better. Yes during this the OSDs are probably all syncing up. All the osd and mon daemons have started by the time the rdb commands are ran, though. On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil s...@inktank.com wrote: On Wed, 21 Nov 2012, Nick Bartos wrote: FYI the build which included all 3.5 backports except patch #50 is still going strong after 21 builds. Okay, that one at least makes some sense. I've opened http://tracker.newdream.net/issues/3519 How easy is this to reproduce? If it is something you can trigger with debugging enabled ('echo module libceph +p /sys/kernel/debug/dynamic_debug/control') that would help tremendously. I'm guessing that during this startup time the OSDs are still in the process of starting? Alex, I bet that a test that does a lot of map/unmap stuff in a loop while thrashing OSDs could hit this. Thanks! sage On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote: With 8 successful installs already done, I'm reasonably confident that it's patch #50. I'm making another build which applies all patches from the 3.5 backport branch, excluding that specific one. I'll let you know if that turns up any unexpected failures. What will the potential fall out be for removing that specific patch? On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos n...@pistoncloud.com wrote: It's really looking like it's the libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When patches 1-50 (listed below) are applied to 3.5.7, the hang is present. So far I have gone through 4 successful installs with no hang with only 1-49 applied. I'm still leaving my test run to make sure it's not a fluke, but since previously it hangs within the first couple of builds, it really looks like this is where the problem originated. 1-libceph_eliminate_connection_state_DEAD.patch 2-libceph_kill_bad_proto_ceph_connection_op.patch 3-libceph_rename_socket_callbacks.patch 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch 6-libceph_start_separating_connection_flags_from_state.patch 7-libceph_start_tracking_connection_socket_state.patch 8-libceph_provide_osd_number_when_creating_osd.patch 9-libceph_set_CLOSED_state_bit_in_con_init.patch 10-libceph_embed_ceph_connection_structure_in_mon_client.patch 11-libceph_drop_connection_refcounting_for_mon_client.patch 12-libceph_init_monitor_connection_when_opening.patch 13-libceph_fully_initialize_connection_in_con_init.patch 14-libceph_tweak_ceph_alloc_msg.patch 15-libceph_have_messages_point_to_their_connection.patch 16-libceph_have_messages_take_a_connection_reference.patch 17-libceph_make_ceph_con_revoke_a_msg_operation.patch 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch 19-libceph_fix_overflow_in___decode_pool_names.patch 20-libceph_fix_overflow_in_osdmap_decode.patch 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch 22-libceph_transition_socket_state_prior_to_actual_connect.patch 23-libceph_fix_NULL_dereference_in_reset_connection.patch 24-libceph_use_con_get_put_methods.patch
Re: Client crash on getcwd with non-default root mount
On 11/29/2012 02:12 PM, Noah Watkins wrote: Here is the full test case: Sorry - I was assuming it was just an issue with checking the return code. I've pushed a one-line fix to wip-mount-subdir. You can cherry-pick to your branch if you want. -sam TEST(LibCephFS, MountRootChdir) { struct ceph_mount_info *cmount; /* create mount and new directory */ ASSERT_EQ(ceph_create(cmount, NULL), 0); ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0); ASSERT_EQ(ceph_mount(cmount, /), 0); ASSERT_EQ(ceph_mkdir(cmount, /xyz, 0700), 0); ceph_shutdown(cmount); /* create mount with non-/ root */ ASSERT_EQ(ceph_create(cmount, NULL), 0); ASSERT_EQ(ceph_conf_read_file(cmount, NULL), 0); ASSERT_EQ(ceph_mount(cmount, /xyz), 0); /* should be at root directory, but blows up */ ASSERT_STREQ(ceph_getcwd(cmount), /); } On Thu, Nov 29, 2012 at 12:02 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: Oh, let me clarify. /otherdir exists, and the mount succeeds. - Noah On Thu, Nov 29, 2012 at 11:58 AM, Sam Lang sam.l...@inktank.com wrote: On 11/29/2012 01:52 PM, Noah Watkins wrote: I'm getting the assert failure below with the following test: ceph_mount(cmount, /otherdir); This should fail with ENOENT if you check the return code. -sam ceph_getcwd(cmount); -- client/Inode.h: In function 'Dentry* Inode::get_first_parent()' thread 7fded47c8780 time 2012-11-29 11:49:00.890184 client/Inode.h: 165: FAILED assert(!dn_set.empty()) ceph version 0.54-808-g1ed5a1f (1ed5a1f984d8260d86cc25b1ae95ffedf597e579) 1: (()+0x11ee89) [0x7fded36fae89] 2: (()+0x1429d3) [0x7fded371e9d3] 3: (ceph_getcwd()+0x11) [0x7fded36fdb41] 4: (MountedTest2_XYZ_Test::TestBody()+0x63a) [0x42563a] 5: (testing::Test::Run()+0xaa) [0x45017a] 6: (testing::internal::TestInfoImpl::Run()+0x100) [0x450280] 7: (testing::TestCase::Run()+0xbd) [0x45034d] 8: (testing::internal::UnitTestImpl::RunAllTests()+0x217) [0x4505b7] 9: (main()+0x35) [0x423115] 10: (__libc_start_main()+0xed) [0x7fded2d2876d] 11: /home/nwatkins/projects/ceph/ceph/src/.libs/lt-test_libcephfs() [0x423171] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted (core dumped) Thanks, Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD: periodic cephx issue ? CephxAuthorizeHandler::verify_authorizer isvalid=0
fwding to the list as I forgot to hit reply all ... Can you attach/post the whole log somewhere? I'm curious what is leading up to it not having secret_id=0. Ideally with 'debug auth = 20' and 'debug osd = 20' and 'debug ms = 1'. Well without the debug options there isn't anything else than that in the log, just this snippet several time. I haven't caught him with the log active ... unfortunately I can't leave it more than a couple of hours with those options as it fills up the root filesystem with log way too fast and it seems to not happen when you just restarted an OSD ... Usually it show up by burst (so like 10 times or so in an hour then nothing for a day for example). Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv5] rbd block driver fix race between aio completition and aio cancel
This one fixes a race which qemu had also in iscsi block driver between cancellation and io completition. qemu_rbd_aio_cancel was not synchronously waiting for the end of the command. To archieve this it introduces a new status flag which uses -EINPROGRESS. Changes since PATCHv4: - removed unnecessary qemu_vfree of acb-bounce as BH will always run Changes since PATCHv3: - removed unnecessary if condition in rbd_start_aio as we haven't start io yet - moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always waits until BH was executed Changes since PATCHv2: - fixed missing braces - added vfree for bounce Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index f3becc7..3bc9c7a 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -77,6 +77,7 @@ typedef struct RBDAIOCB { int error; struct BDRVRBDState *s; int cancelled; +int status; } RBDAIOCB; typedef struct RADOSCB { @@ -376,12 +377,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) RBDAIOCB *acb = rcb-acb; int64_t r; -if (acb-cancelled) { -qemu_vfree(acb-bounce); -qemu_aio_release(acb); -goto done; -} - r = rcb-ret; if (acb-cmd == RBD_AIO_WRITE || @@ -409,7 +404,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) /* Note that acb-bh can be NULL in case where the aio was cancelled */ acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb); qemu_bh_schedule(acb-bh); -done: g_free(rcb); } @@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} } static const AIOCBInfo rbd_aiocb_info = { @@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque) acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); qemu_bh_delete(acb-bh); acb-bh = NULL; +acb-status = 0; qemu_aio_release(acb); } @@ -685,6 +684,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, acb-s = s; acb-cancelled = 0; acb-bh = NULL; +acb-status = -EINPROGRESS; if (cmd == RBD_AIO_WRITE) { qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4] rbd block driver fix race between aio completition and aio cancel
Hi Paolo, Am 29.11.2012 16:23, schrieb Paolo Bonzini: +qemu_vfree(acb-bounce); This vfree is not needed, since the BH will run and do the free. new patch v5 sent. Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: parsing in the ceph osd subsystem
On 11/29/2012 07:01 PM, Andrey Korolyov wrote: On Thu, Nov 29, 2012 at 8:34 PM, Sage Weil s...@inktank.com wrote: On Thu, 29 Nov 2012, Andrey Korolyov wrote: $ ceph osd down - osd.0 is already down $ ceph osd down --- osd.0 is already down the same for ``+'', ``/'', ``%'' and so - I think that for osd subsys ceph cli should explicitly work only with positive integers plus zero, refusing all other input. which branch is this? this parsing is cleaned u pin the latest next/master. It was produced by 0.54-tag. I have built dd3a24a647d0b0f1153cf1b102ed1f51d51be2f2 today and problem has gone(except parsing ``-0'' as 0 and 0/001 as 0 and 1 correspondingly). A fix for the signed parameter has been pushed to next. However, after consideration, when it comes to the '0+\d' parameters, that kind of input was considered valid; Greg put it best on IRC, and I quote: gregaf joao: not sure we want to prevent 01 from parsing as 1, I suspect some people with large clusters will find that useful so they can conflate the name and ID while keeping everything three digits Hope this makes sense to you. -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the new command to add osd to the crushmap to enable it to receive data
On 11/30/2012 01:22 AM, Isaac Otsiabah wrote: This command below which adds a new to the crushmap to enable it to receive data has changed and does not work anymore. ceph osd crush set {id} {name} Please, what is the new command to add a new osd to the crushmap to enable it to receive data? You must specify a weight and a location. For instance, ceph osd crush set 0 osd.0 1.0 root=default Also, you can check up on the docs for more infos. From the docs at [1]: Add the OSD to the CRUSH map so that it can begin receiving data. You may also decompile the CRUSH map, add the OSD to the device list, add the host as a bucket (if it’s not already in the CRUSH map), add the device as an item in the host, assign it a weight, recompile it and set it. See Add/Move an OSD for details. ceph osd crush set {id} {name} {weight} pool={pool-name} [{bucket-type}={bucket-name} ...] [1] http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] mds: fix journaling issue regarding rstat accounting
From: Yan, Zheng zheng.z@intel.com Rename operation can call predirty_journal_parents() twice. So directory fragment's rstat can be modified twice. But only the first modification is journaled because EMetaBlob::add_dir() does not update existing dirlump. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/events/EMetaBlob.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/mds/events/EMetaBlob.h b/src/mds/events/EMetaBlob.h index 9c281e9..116b704 100644 --- a/src/mds/events/EMetaBlob.h +++ b/src/mds/events/EMetaBlob.h @@ -635,12 +635,12 @@ private: dirty, complete, isnew); } dirlump add_dir(dirfrag_t df, fnode_t *pf, version_t pv, bool dirty, bool complete=false, bool isnew=false) { -if (lump_map.count(df) == 0) { +if (lump_map.count(df) == 0) lump_order.push_back(df); - lump_map[df].fnode = *pf; - lump_map[df].fnode.version = pv; -} + dirlump l = lump_map[df]; +l.fnode = *pf; +l.fnode.version = pv; if (complete) l.mark_complete(); if (dirty) l.mark_dirty(); if (isnew) l.mark_new(); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] mds: fixes for mds
From: Yan, Zheng zheng.z@intel.com Hi, The 1st patch fixes a rstat accounting bug. The 5th patch fixes journal replay bug, the fix requires a minor disk format change. These patches are also in: git://github.com/ukernel/ceph.git wip-mds Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] mds: call eval() after cap is imported
From: Yan, Zheng zheng.z@intel.com The migrator calls eval() for imported caps after importing a directory tree. We should do the same thing after importing a inode. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 1 + 1 file changed, 1 insertion(+) diff --git a/src/mds/Migrator.cc b/src/mds/Migrator.cc index 41d97e9..fcc06cd 100644 --- a/src/mds/Migrator.cc +++ b/src/mds/Migrator.cc @@ -2613,6 +2613,7 @@ void Migrator::logged_import_caps(CInode *in, assert(cap_imports.count(in)); finish_import_inode_caps(in, from, cap_imports[in]); + mds-locker-eval(in, CEPH_CAP_LOCKS); mds-send_message_mds(new MExportCapsAck(in-ino()), from); } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] mds: fix handle_client_openc() hang
From: Yan, Zheng zheng.z@intel.com handle_client_openc() calls handle_client_open() if the linkage isn't null. handle_client_open() calls rdlock_path_pin_ref() which returns mdr-in[0] directly because mdr-done_locking is true. the problem here is that mdr-in[0] can be NULL if the linkage is remote. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc | 6 ++ 1 file changed, 6 insertions(+) diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 4c66f4a..59d7d3c 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -2637,6 +2637,12 @@ void Server::handle_client_openc(MDRequest *mdr) reply_request(mdr, -EEXIST, dnl-get_inode(), dn); return; } + +mdcache-request_drop_non_rdlocks(mdr); + +// remote link, avoid rdlock_path_pin_ref() returning null +if (!mdr-in[0]) + mdr-done_locking = false; handle_client_open(mdr); return; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] mds: alllow handle_client_readdir() fetching freezing dir.
From: Yan, Zheng zheng.z@intel.com At that point, the request already auth pins some objects. So CDir::fetch() should ignore can_auth_pin check. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 59d7d3c..2c59f25 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -2741,9 +2741,14 @@ void Server::handle_client_readdir(MDRequest *mdr) assert(dir-is_auth()); if (!dir-is_complete()) { +if (dir-is_frozen()) { + dout(7) dir is frozen *dir dendl; + dir-add_waiter(CDir::WAIT_UNFREEZE, new C_MDS_RetryRequest(mdcache, mdr)); + return; +} // fetch dout(10) incomplete dir contents for readdir on *dir , fetching dendl; -dir-fetch(new C_MDS_RetryRequest(mdcache, mdr)); +dir-fetch(new C_MDS_RetryRequest(mdcache, mdr), true); return; } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] mds: compare sessionmap version before replaying imported sessions
From: Yan, Zheng zheng.z@intel.com Otherwise we may wrongly increase mds-sessionmap.version, which will confuse future journal replays that involving sessionmap. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Server.cc| 2 ++ src/mds/events/EUpdate.h | 8 ++-- src/mds/journal.cc | 24 +--- 3 files changed, 25 insertions(+), 9 deletions(-) diff --git a/src/mds/Server.cc b/src/mds/Server.cc index 2c59f25..a10c503 100644 --- a/src/mds/Server.cc +++ b/src/mds/Server.cc @@ -5425,6 +5425,8 @@ void Server::handle_client_rename(MDRequest *mdr) } _rename_prepare(mdr, le-metablob, le-client_map, srcdn, destdn, straydn); + if (le-client_map.length()) +le-cmapv = mds-sessionmap.projected; // -- commit locally -- C_MDS_rename_finish *fin = new C_MDS_rename_finish(mds, mdr, srcdn, destdn, straydn); diff --git a/src/mds/events/EUpdate.h b/src/mds/events/EUpdate.h index 6ce18fe..a302a5a 100644 --- a/src/mds/events/EUpdate.h +++ b/src/mds/events/EUpdate.h @@ -23,13 +23,14 @@ public: EMetaBlob metablob; string type; bufferlist client_map; + version_t cmapv; metareqid_t reqid; bool had_slaves; EUpdate() : LogEvent(EVENT_UPDATE) { } EUpdate(MDLog *mdlog, const char *s) : LogEvent(EVENT_UPDATE), metablob(mdlog), -type(s), had_slaves(false) { } +type(s), cmapv(0), had_slaves(false) { } void print(ostream out) { if (type.length()) @@ -38,12 +39,13 @@ public: } void encode(bufferlist bl) const { -__u8 struct_v = 2; +__u8 struct_v = 3; ::encode(struct_v, bl); ::encode(stamp, bl); ::encode(type, bl); ::encode(metablob, bl); ::encode(client_map, bl); +::encode(cmapv, bl); ::encode(reqid, bl); ::encode(had_slaves, bl); } @@ -55,6 +57,8 @@ public: ::decode(type, bl); ::decode(metablob, bl); ::decode(client_map, bl); +if (struct_v = 3) + ::decode(cmapv, bl); ::decode(reqid, bl); ::decode(had_slaves, bl); } diff --git a/src/mds/journal.cc b/src/mds/journal.cc index 04b1a92..3f6d5eb 100644 --- a/src/mds/journal.cc +++ b/src/mds/journal.cc @@ -999,14 +999,24 @@ void EUpdate::replay(MDS *mds) mds-mdcache-add_uncommitted_master(reqid, _segment, slaves); } - // open client sessions? - mapclient_t,entity_inst_t cm; - mapclient_t, uint64_t seqm; if (client_map.length()) { -bufferlist::iterator blp = client_map.begin(); -::decode(cm, blp); -mds-server-prepare_force_open_sessions(cm, seqm); -mds-server-finish_force_open_sessions(cm, seqm); +if (mds-sessionmap.version = cmapv) { + dout(10) EUpdate.replay sessionmap v cmapv += table mds-sessionmap.version dendl; +} else { + dout(10) EUpdate.replay sessionmap mds-sessionmap.version + cmapv dendl; + // open client sessions? + mapclient_t,entity_inst_t cm; + mapclient_t, uint64_t seqm; + bufferlist::iterator blp = client_map.begin(); + ::decode(cm, blp); + mds-server-prepare_force_open_sessions(cm, seqm); + mds-server-finish_force_open_sessions(cm, seqm); + + assert(mds-sessionmap.version = cmapv); + mds-sessionmap.projected = mds-sessionmap.version; +} } } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: re-calculate truncate_size for strip object
From: Yan, Zheng zheng.z@intel.com Otherwise osd may truncate the object to larger size. Signed-off-by: Yan, Zheng zheng.z@intel.com --- net/ceph/osd_client.c | 8 1 file changed, 8 insertions(+) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index ccbdfbb..f8b0e56 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -76,8 +76,16 @@ int ceph_calc_raw_layout(struct ceph_osd_client *osdc, orig_len - *plen, off, *plen); if (op_has_extent(op-op)) { + u32 osize = le32_to_cpu(layout-fl_object_size); op-extent.offset = objoff; op-extent.length = objlen; + if (op-extent.truncate_size = off - objoff) { + op-extent.truncate_size = 0; + } else { + op-extent.truncate_size -= off - objoff; + if (op-extent.truncate_size osize) + op-extent.truncate_size = osize; + } } req-r_num_pages = calc_pages_for(off, *plen); req-r_page_alignment = off ~PAGE_MASK; -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD daemon changes port no
Hi Sage,Community , I am unable to use 2 directories to direct data to 2 different pools. I did following expt. Created 2 pool host ghost to seperate data placement . --//crushmap file --- # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool type 7 ghost # buckets host hemantone-mirror-virtual-machine { id -6 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.2 weight 1.000 } host hemantone-virtual-machine { id -7 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.1 weight 1.000 } rack one { id -2 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item hemantone-mirror-virtual-machine weight 1.000 item hemantone-virtual-machine weight 1.000 } ghost hemant-virtual-machine { id -4 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 } ghost hemant-mirror-virtual-machine { id -5 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.3 weight 1.000 } rack two { id -3 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item hemant-virtual-machine weight 1.000 item hemant-mirror-virtual-machine weight 1.000 } pool default { id -1 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item one weight 2.000 item two weight 2.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step take one step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step take one step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step take one step chooseleaf firstn 0 type host step emit } rule forhost { ruleset 3 type replicated min_size 1 max_size 10 step take default step take one step chooseleaf firstn 0 type host step emit } rule forghost { ruleset 4 type replicated min_size 1 max_size 10 step take default step take two step chooseleaf firstn 0 type ghost step emit } # end crush map 1) set replication factor to 2. and crushrule accordingly . ( host got crush_ruleset = 3 ghost pool got crush_ruleset = 4). 2) Now I mounted data to dir. using mount.ceph 10.72.148.245:6789:/ /home/hemant/xmount.ceph 10.72.148.245:6789:/ /home/hemant/y 3) then mds add_data_pool 5 mds add_data_pool 6 ( here pool id are host = 5, ghost = 6) 4) cephfs /home/hemant/x set_layout --pool 5 -c 1 -u 4194304 -s 4194304 cephfs /home/hemant/y set_layout --pool 6 -c 1 -u 4194304 -s 4194304 PROBLEM: $ cephfs /home/hemant/x show_layout layout.data_pool: 6 layout.object_size: 4194304 layout.stripe_unit: 4194304 layout.stripe_count: 1 cephfs /home/hemant/y show_layout layout.data_pool: 6 layout.object_size: 4194304 layout.stripe_unit: 4194304 layout.stripe_count: 1 Both dir are using same pool to place data even after I specified to use separate using cephfs cmd. Please help me figure this out. - Hemant Surale. On Thu, Nov 29, 2012 at 3:45 PM, hemant surale hemant.sur...@gmail.com wrote: does 'ceph mds dump' list pool 3 in teh data_pools line? Yes. It lists the desired poolids I wanted to put data in. -- Forwarded message -- From: hemant surale hemant.sur...@gmail.com Date: Thu, Nov 29, 2012 at 2:59 PM Subject: Re: OSD daemon changes port no To: Sage Weil s...@inktank.com I used a little different version of cephfs as cephfs /home/hemant/a set_layout --pool 3 -c 1 -u 4194304 -s 4194304 and cephfs /home/hemant/b set_layout --pool 5 -c 1 -u 4194304 -s 4194304. Now cmd didnt showed any error but When I put data to dir a b ideally it should go to different pool but its not working as of now. Whatever I am doing is it possible (to use 2 dir pointing to 2 different pools for data placement) ? - Hemant Surale. On Tue, Nov 27, 2012 at 10:21 PM, Sage Weil