rbd command to display free space in a cluster ?
Hi, I'm looking for a way to retrieve the free space from a rbd cluster with rbd command. Any hint ? (something like ceph -w status, but without need to parse the result) Regards, Alexandre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD::mkfs: couldn't mount FileStore: error -22
current/ is a btrfs subvolume.. 'btrfs sub delete current' will remove it. Ah, that worked, thanks. Unfortunately mkcephfs still fails with the same error. The warning in the previous email suggets you're running a fairly old kernel.. there is probably something handled incorrectly during the fs init process. Exactly which kernel are you running? 2.6.32 - apparently the latest in Debian stable. I figured this was workable since ceph.com offers packages for Debian stable. In any case, btrfs isn't going to work particularly well on something that old; I suggest running something newer (3.5 or 3.6) or switching to XFS. Ok fair enough. I'll have to see how practical it is to get a more recent kernel going, otherwise I'll go down the XFS route. Thanks again, Adam. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: zero return code in rbd_dev_image_id()
There is a call in rbd_dev_image_id() to rbd_req_sync_exec() to get the image id for an image. Despite the get_id class method only returning 0 on success, I am getting back a positive value (I think the number of bytes returned with the call). That may or may not be how rbd_req_sync_exec() is supposed to behave, but zeroing the return value if successful makes it moot and makes this path through the code work as desired. Do the same in rbd_dev_v2_object_prefix(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |2 ++ 1 file changed, 2 insertions(+) Index: b/drivers/block/rbd.c === --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2207,6 +2207,7 @@ static int rbd_dev_v2_object_prefix(stru dout(%s: rbd_req_sync_exec returned %d\n, __func__, ret); if (ret 0) goto out; + ret = 0;/* rbd_req_sync_exec() can return positive */ p = reply_buf; rbd_dev-header.object_prefix = ceph_extract_encoded_string(p, @@ -2900,6 +2901,7 @@ static int rbd_dev_image_id(struct rbd_d dout(%s: rbd_req_sync_exec returned %d\n, __func__, ret); if (ret 0) goto out; + ret = 0;/* rbd_req_sync_exec() can return positive */ p = response; rbd_dev-image_id = ceph_extract_encoded_string(p, -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: kill rbd_device-rbd_opts
The rbd_device structure has an embedded rbd_options structure. Such a structure is needed to work with the generic ceph argument parsing code, but there's no need to keep it around once argument parsing is done. Use a local variable to hold the rbd options used in parsing in rbd_get_client(), and just transfer its content (it's just a read_only flag) into the field in the rbd_mapping sub-structure that requires that information. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) Index: b/drivers/block/rbd.c === --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -181,7 +181,6 @@ struct rbd_device { struct gendisk *disk; /* blkdev's gendisk and rq */ u32 image_format; /* Either 1 or 2 */ - struct rbd_options rbd_opts; struct rbd_client *rbd_client; charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */ @@ -453,18 +452,24 @@ static int parse_rbd_opts_token(char *c, static int rbd_get_client(struct rbd_device *rbd_dev, const char *mon_addr, size_t mon_addr_len, char *options) { - struct rbd_options *rbd_opts = rbd_dev-rbd_opts; + struct rbd_options rbd_opts; struct ceph_options *ceph_opts; struct rbd_client *rbdc; - rbd_opts-read_only = RBD_READ_ONLY_DEFAULT; + /* Initialize all rbd options to the defaults */ + + rbd_opts.read_only = RBD_READ_ONLY_DEFAULT; ceph_opts = ceph_parse_options(options, mon_addr, mon_addr + mon_addr_len, - parse_rbd_opts_token, rbd_opts); + parse_rbd_opts_token, rbd_opts); if (IS_ERR(ceph_opts)) return PTR_ERR(ceph_opts); + /* Record the parsed rbd options */ + + rbd_dev-mapping.read_only = rbd_opts.read_only; + rbdc = rbd_client_find(ceph_opts); if (rbdc) { /* using an existing client */ @@ -672,7 +677,6 @@ static int rbd_dev_set_mapping(struct rb rbd_dev-mapping.size = rbd_dev-header.image_size; rbd_dev-mapping.features = rbd_dev-header.features; rbd_dev-mapping.snap_exists = false; - rbd_dev-mapping.read_only = rbd_dev-rbd_opts.read_only; ret = 0; } else { ret = snap_by_name(rbd_dev, snap_name); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd command to display free space in a cluster ?
On Mon, 15 Oct 2012, Alexandre DERUMIER wrote: Hi, I'm looking for a way to retrieve the free space from a rbd cluster with rbd command. Any hint ? (something like ceph -w status, but without need to parse the result) rados df is the closest. sage Regards, Alexandre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph benchmark high wait on journal device
Hi, inspired from the performance test Mark did, I tried to compile my own one. I have four OSD processes on one Node, each process has a Intel 710 SSD for its journal and 4 SAS Disk via an Lsi 9266-8i in Raid 0. If I test the SSD with fio they are quite fast and the w_wait time is quite low. But if I run rados bench on the cluster, the w_wait times for the journal devices are quite high (around 20-40ms). I thought the SSD would be better, any ideas what happend here? -martin Logs: /dev/sd{c,d,e,f} Intel SSD 710 200G /dev/sd{g,h,i,j} each 4 x SAS on LSI 9266-8i Raid 0 fio -name iops -rw=write -size=10G -iodepth 1 -filename /dev/sdc2 -ioengine libaio -direct 1 -bs 256k Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util - snip - sdc 0,00 0,000,00 809,20 0,00 202,30 512,00 0,961,190,001,19 1,18 95,84 - snap - rados bench -p rbd 300 write -t 16 2012-10-15 17:53:17.058383min lat: 0.035382 max lat: 0.469604 avg lat: 0.189553 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 300 16 25329 25313 337.443 324 0.274815 0.189553 Total time run: 300.169843 Total writes made: 25329 Write size: 4194304 Bandwidth (MB/sec): 337.529 Stddev Bandwidth: 25.1568 Max bandwidth (MB/sec): 372 Min bandwidth (MB/sec): 0 Average Latency:0.189597 Stddev Latency: 0.0641609 Max latency:0.469604 Min latency:0.035382 during the rados bench test. avg-cpu: %user %nice %system %iowait %steal %idle 20,380,00 16,208,870,00 54,55 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0041,200,00 12,40 0,00 0,35 57,42 0,000,310,000,31 0,31 0,38 sdb 0,00 0,000,000,00 0,00 0,00 0,00 0,000,000,000,00 0,00 0,00 sdc 0,00 0,000,00 332,80 0,00 139,67 859,53 7,36 22,090,00 22,09 2,12 70,42 sdd 0,00 0,000,00 391,60 0,00 175,84 919,6215,59 39,620,00 39,62 2,40 93,80 sde 0,00 0,000,00 342,00 0,00 147,39 882,59 8,54 24,890,00 24,89 2,18 74,58 sdf 0,00 0,000,00 362,20 0,00 162,72 920,0515,35 42,500,00 42,50 2,60 94,20 sdg 0,00 0,000,00 522,00 0,00 139,20 546,13 0,280,540,000,54 0,10 5,26 sdh 0,00 0,000,00 672,00 0,00 179,20 546,13 9,67 14,420,00 14,42 0,61 41,18 sdi 0,00 0,000,00 555,00 0,00 148,00 546,13 0,320,570,000,57 0,10 5,46 sdj 0,00 0,000,00 582,00 0,00 155,20 546,13 0,510,870,000,87 0,12 6,96 100 seconds later avg-cpu: %user %nice %system %iowait %steal %idle 22,920,00 19,579,250,00 48,25 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,0040,800,00 15,60 0,00 0,36 47,08 0,000,220,000,22 0,22 0,34 sdb 0,00 0,000,000,00 0,00 0,00 0,00 0,000,000,000,00 0,00 0,00 sdc 0,00 0,000,00 386,60 0,00 168,33 891,7012,11 31,080,00 31,08 2,25 86,86 sdd 0,00 0,000,00 405,00 0,00 183,06 925,6815,68 38,700,00 38,70 2,34 94,90 sde 0,00 0,000,00 411,00 0,00 185,06 922,1515,58 38,090,00 38,09 2,33 95,92 sdf 0,00 0,000,00 387,00 0,00 168,33 890,7912,19 31,480,00 31,48 2,26 87,48 sdg 0,00 0,000,00 646,20 0,00 171,22 542,64 0,420,650,000,65 0,10 6,70 sdh 0,0085,600,40 797,00 0,01 192,97 495,6510,95 13,73 32,50 13,72 0,55 44,22 sdi 0,00 0,000,00 678,20 0,00 180,01 543,59 0,450,670,000,67 0,10 6,76 sdj 0,00 0,000,00 639,00 0,00 169,61 543,61 0,360,570,000,57 0,10 6,32 --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump
Help...MDS Continuously Segfaulting
Well, both of my MDSs seem to be down right now, and then continually segfault (every time I try to start them) with the following: ceph-mdsmon-a:~ # ceph-mds -n mds.b -c /etc/ceph/ceph.conf -f starting mds.b at :/0 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 0 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Segmentation fault Anyone have any hints on recovering? I'm running 0.48.1argonaut - I can attempt to upgrade to 0.48.2 and see if that helps, but I figured if anyone can offer any insight as to what to do to get the replay to run without segfaulting? This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help...MDS Continuously Segfaulting
Something in the MDS log is bad or is poking at a bug in the code. Can you turn on MDS debugging and restart a daemon and put that log somewhere accessible? debug mds = 20 debug journaler = 20 debug ms = 1 -Greg On Mon, Oct 15, 2012 at 10:02 AM, Nick Couchman nick.couch...@seakr.com wrote: Well, both of my MDSs seem to be down right now, and then continually segfault (every time I try to start them) with the following: ceph-mdsmon-a:~ # ceph-mds -n mds.b -c /etc/ceph/ceph.conf -f starting mds.b at :/0 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 0 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Segmentation fault Anyone have any hints on recovering? I'm running 0.48.1argonaut - I can attempt to upgrade to 0.48.2 and see if that helps, but I figured if anyone can offer any insight as to what to do to get the replay to run without segfaulting? This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help...MDS Continuously Segfaulting
Anywhere in particular I should make it available? It's a little over a million lines of debug in the file - I can put it on a pastebin, if that works, or perhaps zip it up and throw it somewhere? -Nick On 2012/10/15 at 11:26, Gregory Farnum g...@inktank.com wrote: Something in the MDS log is bad or is poking at a bug in the code. Can you turn on MDS debugging and restart a daemon and put that log somewhere accessible? debug mds = 20 debug journaler = 20 debug ms = 1 -Greg On Mon, Oct 15, 2012 at 10:02 AM, Nick Couchman nick.couch...@seakr.com wrote: Well, both of my MDSs seem to be down right now, and then continually segfault (every time I try to start them) with the following: ceph-mdsmon-a:~ # ceph-mds -n mds.b -c /etc/ceph/ceph.conf -f starting mds.b at :/0 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 0 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Segmentation fault Anyone have any hints on recovering? I'm running 0.48.1argonaut - I can attempt to upgrade to 0.48.2 and see if that helps, but I figured if anyone can offer any insight as to what to do to get the replay to run without segfaulting? This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help...MDS Continuously Segfaulting
Yeah, zip it and post — somebody's going to have to download it and do fun things. :) -Greg On Mon, Oct 15, 2012 at 10:43 AM, Nick Couchman nick.couch...@seakr.com wrote: Anywhere in particular I should make it available? It's a little over a million lines of debug in the file - I can put it on a pastebin, if that works, or perhaps zip it up and throw it somewhere? -Nick On 2012/10/15 at 11:26, Gregory Farnum g...@inktank.com wrote: Something in the MDS log is bad or is poking at a bug in the code. Can you turn on MDS debugging and restart a daemon and put that log somewhere accessible? debug mds = 20 debug journaler = 20 debug ms = 1 -Greg On Mon, Oct 15, 2012 at 10:02 AM, Nick Couchman nick.couch...@seakr.com wrote: Well, both of my MDSs seem to be down right now, and then continually segfault (every time I try to start them) with the following: ceph-mdsmon-a:~ # ceph-mds -n mds.b -c /etc/ceph/ceph.conf -f starting mds.b at :/0 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 0 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal (Segmentation fault) ** in thread 7fbe0d61d700 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) 1: ceph-mds() [0x7ef83a] 2: (()+0xfd00) [0x7fbe15a0cd00] 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] 6: (()+0x7f05) [0x7fbe15a04f05] 7: (clone()+0x6d) [0x7fbe14bc410d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Segmentation fault Anyone have any hints on recovering? I'm running 0.48.1argonaut - I can attempt to upgrade to 0.48.2 and see if that helps, but I figured if anyone can offer any insight as to what to do to get the replay to run without segfaulting? This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html This e-mail may contain confidential and privileged material for the sole use of the intended recipient. If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information. In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way. If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox. Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
New branch: Python packaging integrated into automake
Hi. While working on the external journal stuff, for a while I thought I needed more python code than I ended up needing. To support that code, I put in the skeleton of import ceph.foo support. While I ultimately didn't need it, I didn't want to throw away the results. If you later need to have more python stuff in core ceph, use this branch as the base. It's intentionally separate from the python-ceph Debian package; that one is about providing python programmers APIs to use RADOS, RBD etc (compare to librados-dev etc); this is about Python code used by core Ceph itself. https://github.com/ceph/ceph/tree/automake-python -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two questions about client writes update to Ceph
Hi Alex, 1) When a replica goes, down the write won't complete until the replica is detected as down. At that point, the write can complete without the down replica. Shortly thereafter, if the down replica does not come back, a new replica will replace it bringing the replication count to what it should be. In the scenario you described, the transaction will be re-replicated to the replicas when they come back up (or to new replicas if they don't). 2) The clients don't have a local, durable journal. The client side state acts like the volatile cache on a spinning disk: flushing the disk io will force the data to become durable on an OSD (either in the journal or synced to the filesystem). -Sam On Mon, Oct 15, 2012 at 1:46 AM, Alex Jiang alex.jiang@gmail.com wrote: Hi,all I'm very interested in Ceph. Recently I deployed a Ceph cluster in lab environment, and it works very well. Now I am studying the principle of Ceph RBD. But I am confused about two questions. 1) When client writes update to Ceph RBD, it contacts with primary OSD directly. Then The primary OSD forwards the update to replicas. If the replicas are down and update is not committed to replicas disks, but the update has been committed to primary OSD disk, will Ceph rollbacks the transaction? 2) To ensure availability, do journals exist in client side to log I/O history? If true, are the journals durable in disks? Regards, Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph benchmark high wait on journal device
Hi Martin, I haven't tested the 9266-8i specifically, but it may behave similarly to the 9265-8i. This is just a theory, but I get the impression that the controller itself introduces some latency getting data to disk, and that it may get worse as the more data is pushed across the controller. That seems to be the case even of the data is not going to the disk in question. Are you using a single controller with expanders? On some of our nodes that use a single controller with lots of expanders, I've noticed high IO wait times, especially when doing lots of small writes. Mark On 10/15/2012 11:12 AM, Martin Mailand wrote: Hi, inspired from the performance test Mark did, I tried to compile my own one. I have four OSD processes on one Node, each process has a Intel 710 SSD for its journal and 4 SAS Disk via an Lsi 9266-8i in Raid 0. If I test the SSD with fio they are quite fast and the w_wait time is quite low. But if I run rados bench on the cluster, the w_wait times for the journal devices are quite high (around 20-40ms). I thought the SSD would be better, any ideas what happend here? -martin Logs: /dev/sd{c,d,e,f} Intel SSD 710 200G /dev/sd{g,h,i,j} each 4 x SAS on LSI 9266-8i Raid 0 fio -name iops -rw=write -size=10G -iodepth 1 -filename /dev/sdc2 -ioengine libaio -direct 1 -bs 256k Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util - snip - sdc 0,00 0,00 0,00 809,20 0,00 202,30 512,00 0,96 1,19 0,00 1,19 1,18 95,84 - snap - rados bench -p rbd 300 write -t 16 2012-10-15 17:53:17.058383min lat: 0.035382 max lat: 0.469604 avg lat: 0.189553 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 300 16 25329 25313 337.443 324 0.274815 0.189553 Total time run: 300.169843 Total writes made: 25329 Write size: 4194304 Bandwidth (MB/sec): 337.529 Stddev Bandwidth: 25.1568 Max bandwidth (MB/sec): 372 Min bandwidth (MB/sec): 0 Average Latency: 0.189597 Stddev Latency: 0.0641609 Max latency: 0.469604 Min latency: 0.035382 during the rados bench test. avg-cpu: %user %nice %system %iowait %steal %idle 20,38 0,00 16,20 8,87 0,00 54,55 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 41,20 0,00 12,40 0,00 0,35 57,42 0,00 0,31 0,00 0,31 0,31 0,38 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 332,80 0,00 139,67 859,53 7,36 22,09 0,00 22,09 2,12 70,42 sdd 0,00 0,00 0,00 391,60 0,00 175,84 919,62 15,59 39,62 0,00 39,62 2,40 93,80 sde 0,00 0,00 0,00 342,00 0,00 147,39 882,59 8,54 24,89 0,00 24,89 2,18 74,58 sdf 0,00 0,00 0,00 362,20 0,00 162,72 920,05 15,35 42,50 0,00 42,50 2,60 94,20 sdg 0,00 0,00 0,00 522,00 0,00 139,20 546,13 0,28 0,54 0,00 0,54 0,10 5,26 sdh 0,00 0,00 0,00 672,00 0,00 179,20 546,13 9,67 14,42 0,00 14,42 0,61 41,18 sdi 0,00 0,00 0,00 555,00 0,00 148,00 546,13 0,32 0,57 0,00 0,57 0,10 5,46 sdj 0,00 0,00 0,00 582,00 0,00 155,20 546,13 0,51 0,87 0,00 0,87 0,12 6,96 100 seconds later avg-cpu: %user %nice %system %iowait %steal %idle 22,92 0,00 19,57 9,25 0,00 48,25 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 40,80 0,00 15,60 0,00 0,36 47,08 0,00 0,22 0,00 0,22 0,22 0,34 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 386,60 0,00 168,33 891,70 12,11 31,08 0,00 31,08 2,25 86,86 sdd 0,00 0,00 0,00 405,00 0,00 183,06 925,68 15,68 38,70 0,00 38,70 2,34 94,90 sde 0,00 0,00 0,00 411,00 0,00 185,06 922,15 15,58 38,09 0,00 38,09 2,33 95,92 sdf 0,00 0,00 0,00 387,00 0,00 168,33 890,79 12,19 31,48 0,00 31,48 2,26 87,48 sdg 0,00 0,00 0,00 646,20 0,00 171,22 542,64 0,42 0,65 0,00 0,65 0,10 6,70 sdh 0,00 85,60 0,40 797,00 0,01 192,97 495,65 10,95 13,73 32,50 13,72 0,55 44,22 sdi 0,00 0,00 0,00 678,20 0,00 180,01 543,59 0,45 0,67 0,00 0,67 0,10 6,76 sdj 0,00 0,00 0,00 639,00 0,00 169,61 543,61 0,36 0,57 0,00 0,57 0,10 6,32 --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump {filestore:{journal_queue_max_ops:500,journal_queue_ops:0,journal_ops:34653,journal_queue_max_bytes:104857600,journal_queue_bytes:0,journal_bytes:86821481160,journal_latency:{avgcount:34653,sum:3458.68},journal_wr:19372,journal_wr_bytes:{avgcount:19372,sum:87026655232},op_queue_max_ops:500,op_queue_ops:126,ops:34653,op_queue_max_bytes:104857600,op_queue_bytes:167023,bytes:86821143225,apply_latency:{avgcount:34527,sum:605.768},committing:0,commitcycle:19,commitcycle_interval:{avgcount:19,sum:572.674},commitcycle_latency:{avgcount:19,sum:2.62279},journal_full:0},osd:{opq:0,op_wip:4,op:15199,op_in_bytes:36140461079,op_out_bytes:0,op_latency:{avgcount:15199,sum:1811.57},op_r:0,op_r_out_bytes:0,op_r_latency:{avgcount:0,sum:0},op_w:15199,op_w_in_bytes:36140461079,op_w_rlat:{avgcount:15199,sum:177.327},op_w_latency:{avgcount:15199,sum:1811.57},op_rw:0,op_rw_in_bytes:0,op! _rw_out_
Re: osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)
Do you have a coredump for the crash? Can you reproduce the crash with: debug filestore = 20 debug osd = 20 and post the logs? As far as the incomplete pg goes, can you post the output of ceph pg pgid query where pgid is the pgid of the incomplete pg (e.g. 1.34)? Thanks -Sam On Thu, Oct 11, 2012 at 3:17 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Hello everybody. I'm currently having problem with 1 of my OSD, crashing with this trace : ceph version 0.52 (commit:e48859474c4944d4ff201ddc9f5fd400e8898173) 1: /usr/bin/ceph-osd() [0x737879] 2: (()+0xf030) [0x7f43f0af0030] 3: (ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)+0x292) [0x555262] 4: (ReplicatedPG::recover_backfill(int)+0x1c1a) [0x55c93a] 5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x26a) [0x563c1a] 6: (OSD::do_recovery(PG*)+0x39d) [0x5d3c9d] 7: (OSD::RecoveryWQ::_process(PG*)+0xd) [0x6119fd] 8: (ThreadPool::worker()+0x82b) [0x7c176b] 9: (ThreadPool::WorkThread::entry()+0xd) [0x5f609d] 10: (()+0x6b50) [0x7f43f0ae7b50] 11: (clone()+0x6d) [0x7f43ef81b78d] Restarting gives the same message after some seconds. I've been watching the bug tracker but I don't see something related. Some informations : kernel is 3.6.1, with standard debian packages from ceph.com My ceph cluster was running well and stable on 6 osd since june (3 datacenters, 2 with 2 nodes, 1 with 4 nodes, a replication of 2, and adjusted weight to try to balance data evenly). Beginned with the then-up-to-date version, then 0.48, 49,50,51... Data store is on XFS. I'm currently in the process of growing my ceph from 6 nodes to 12 nodes. 11 nodes are currently in ceph, for a 130 TB total. Declaring new osd was OK, the data has moved quite ok (in fact I had some OSD crash - not definitive, the osd restart ok-, maybe related to an error in my new nodes network configuration that I discovered after. More on that later, I can find the traces, but I'm not sure it's related) When ceph was finally stable again, with HEALTH_OK, I decided to reweight the osd (that was tuesday). Operation went quite OK, but near the end of operation (0,085% left), 1 of my OSD crashed, and won't start again. More problematic, with this osd down, I have 1 incomplete PG : ceph -s health HEALTH_WARN 86 pgs backfill; 231 pgs degraded; 4 pgs down; 15 pgs incomplete; 4 pgs peering; 134 pgs recovering; 19 pgs stuck inactive; 455 pgs stuck unclean; recovery 2122878/23181946 degraded (9.157%); 2321/11590973 unfound (0.020%); 1 near full osd(s) monmap e1: 3 mons at {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, election epoch 20, quorum 0,1,2 chichibu,glenesk,karuizawa osdmap e13184: 11 osds: 10 up, 10 in pgmap v2399093: 1728 pgs: 165 active, 1270 active+clean, 8 active+recovering+degraded, 41 active+recovering+degraded+remapped+backfill, 4 down+peering, 137 active+degraded, 3 active+clean+scrubbing, 15 incomplete, 40 active+recovering, 45 active+recovering+degraded+backfill; 44119 GB data, 84824 GB used, 37643 GB / 119 TB avail; 2122878/23181946 degraded (9.157%); 2321/11590973 unfound (0.020%) mdsmap e321: 1/1/1 up {0=karuizawa=up:active}, 2 up:standby how is it possible as I have a replication of 2 ? Is it a known problem ? Cheers, -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd command to display free space in a cluster ?
Nothing like that exists at the moment; see http://tracker.newdream.net/issues/3283 fpr the other side of it. On 10/15/2012 12:52 AM, Alexandre DERUMIER wrote: Hi, I'm looking for a way to retrieve the free space from a rbd cluster with rbd command. Any hint ? (something like ceph -w status, but without need to parse the result) Regards, Alexandre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph benchmark high wait on journal device
Hi Mark, I think there is no differences between the 9266-8i and the 9265-8i, except for the cache vault and the angel of the SAS connectors. In the last test, which I posted, the SSDs where connected to the onboard SATA ports. Further test showed if I reduce the the object size (the -b option) to 1M, 512k, 256k the latency almost vanished. With 256k the w_wait was around 1ms. So my observation shows almost the different of yours. I use a singel controller with a dual expander backplane. That's the baby. http://85.214.49.87/ceph/testlab/IMAG0018.jpg btw. Is there a nice way to format the output of ceph --admin-daemon ceph-osd.0.asok perf_dump? -martin Am 15.10.2012 21:50, schrieb Mark Nelson: Hi Martin, I haven't tested the 9266-8i specifically, but it may behave similarly to the 9265-8i. This is just a theory, but I get the impression that the controller itself introduces some latency getting data to disk, and that it may get worse as the more data is pushed across the controller. That seems to be the case even of the data is not going to the disk in question. Are you using a single controller with expanders? On some of our nodes that use a single controller with lots of expanders, I've noticed high IO wait times, especially when doing lots of small writes. Mark On 10/15/2012 11:12 AM, Martin Mailand wrote: Hi, inspired from the performance test Mark did, I tried to compile my own one. I have four OSD processes on one Node, each process has a Intel 710 SSD for its journal and 4 SAS Disk via an Lsi 9266-8i in Raid 0. If I test the SSD with fio they are quite fast and the w_wait time is quite low. But if I run rados bench on the cluster, the w_wait times for the journal devices are quite high (around 20-40ms). I thought the SSD would be better, any ideas what happend here? -martin Logs: /dev/sd{c,d,e,f} Intel SSD 710 200G /dev/sd{g,h,i,j} each 4 x SAS on LSI 9266-8i Raid 0 fio -name iops -rw=write -size=10G -iodepth 1 -filename /dev/sdc2 -ioengine libaio -direct 1 -bs 256k Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util - snip - sdc 0,00 0,00 0,00 809,20 0,00 202,30 512,00 0,96 1,19 0,00 1,19 1,18 95,84 - snap - rados bench -p rbd 300 write -t 16 2012-10-15 17:53:17.058383min lat: 0.035382 max lat: 0.469604 avg lat: 0.189553 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 300 16 25329 25313 337.443 324 0.274815 0.189553 Total time run: 300.169843 Total writes made: 25329 Write size: 4194304 Bandwidth (MB/sec): 337.529 Stddev Bandwidth: 25.1568 Max bandwidth (MB/sec): 372 Min bandwidth (MB/sec): 0 Average Latency: 0.189597 Stddev Latency: 0.0641609 Max latency: 0.469604 Min latency: 0.035382 during the rados bench test. avg-cpu: %user %nice %system %iowait %steal %idle 20,38 0,00 16,20 8,87 0,00 54,55 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 41,20 0,00 12,40 0,00 0,35 57,42 0,00 0,31 0,00 0,31 0,31 0,38 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 332,80 0,00 139,67 859,53 7,36 22,09 0,00 22,09 2,12 70,42 sdd 0,00 0,00 0,00 391,60 0,00 175,84 919,62 15,59 39,62 0,00 39,62 2,40 93,80 sde 0,00 0,00 0,00 342,00 0,00 147,39 882,59 8,54 24,89 0,00 24,89 2,18 74,58 sdf 0,00 0,00 0,00 362,20 0,00 162,72 920,05 15,35 42,50 0,00 42,50 2,60 94,20 sdg 0,00 0,00 0,00 522,00 0,00 139,20 546,13 0,28 0,54 0,00 0,54 0,10 5,26 sdh 0,00 0,00 0,00 672,00 0,00 179,20 546,13 9,67 14,42 0,00 14,42 0,61 41,18 sdi 0,00 0,00 0,00 555,00 0,00 148,00 546,13 0,32 0,57 0,00 0,57 0,10 5,46 sdj 0,00 0,00 0,00 582,00 0,00 155,20 546,13 0,51 0,87 0,00 0,87 0,12 6,96 100 seconds later avg-cpu: %user %nice %system %iowait %steal %idle 22,92 0,00 19,57 9,25 0,00 48,25 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 40,80 0,00 15,60 0,00 0,36 47,08 0,00 0,22 0,00 0,22 0,22 0,34 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 386,60 0,00 168,33 891,70 12,11 31,08 0,00 31,08 2,25 86,86 sdd 0,00 0,00 0,00 405,00 0,00 183,06 925,68 15,68 38,70 0,00 38,70 2,34 94,90 sde 0,00 0,00 0,00 411,00 0,00 185,06 922,15 15,58 38,09 0,00 38,09 2,33 95,92 sdf 0,00 0,00 0,00 387,00 0,00 168,33 890,79 12,19 31,48 0,00 31,48 2,26 87,48 sdg 0,00 0,00 0,00 646,20 0,00 171,22 542,64 0,42 0,65 0,00 0,65 0,10 6,70 sdh 0,00 85,60 0,40 797,00 0,01 192,97 495,65 10,95 13,73 32,50 13,72 0,55 44,22 sdi 0,00 0,00 0,00 678,20 0,00 180,01 543,59 0,45 0,67 0,00 0,67 0,10 6,76 sdj 0,00 0,00 0,00 639,00 0,00 169,61 543,61 0,36 0,57 0,00 0,57 0,10 6,32 --admin-daemon /var/run/ceph/ceph-osd.1.asok perf dump
Re: Ceph benchmark high wait on journal device
On Mon, 15 Oct 2012, Travis Rhoden wrote: Martin, btw. Is there a nice way to format the output of ceph --admin-daemon ceph-osd.0.asok perf_dump? I use: ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok perf dump | python -mjson.tool There is also ceph.git/src/script/perf-watch.py -s foo.asok list of vars or var prefixes s - Travis On Mon, Oct 15, 2012 at 4:38 PM, Martin Mailand mar...@tuxadero.com wrote: Hi Mark, I think there is no differences between the 9266-8i and the 9265-8i, except for the cache vault and the angel of the SAS connectors. In the last test, which I posted, the SSDs where connected to the onboard SATA ports. Further test showed if I reduce the the object size (the -b option) to 1M, 512k, 256k the latency almost vanished. With 256k the w_wait was around 1ms. So my observation shows almost the different of yours. I use a singel controller with a dual expander backplane. That's the baby. http://85.214.49.87/ceph/testlab/IMAG0018.jpg btw. Is there a nice way to format the output of ceph --admin-daemon ceph-osd.0.asok perf_dump? -martin Am 15.10.2012 21:50, schrieb Mark Nelson: Hi Martin, I haven't tested the 9266-8i specifically, but it may behave similarly to the 9265-8i. This is just a theory, but I get the impression that the controller itself introduces some latency getting data to disk, and that it may get worse as the more data is pushed across the controller. That seems to be the case even of the data is not going to the disk in question. Are you using a single controller with expanders? On some of our nodes that use a single controller with lots of expanders, I've noticed high IO wait times, especially when doing lots of small writes. Mark On 10/15/2012 11:12 AM, Martin Mailand wrote: Hi, inspired from the performance test Mark did, I tried to compile my own one. I have four OSD processes on one Node, each process has a Intel 710 SSD for its journal and 4 SAS Disk via an Lsi 9266-8i in Raid 0. If I test the SSD with fio they are quite fast and the w_wait time is quite low. But if I run rados bench on the cluster, the w_wait times for the journal devices are quite high (around 20-40ms). I thought the SSD would be better, any ideas what happend here? -martin Logs: /dev/sd{c,d,e,f} Intel SSD 710 200G /dev/sd{g,h,i,j} each 4 x SAS on LSI 9266-8i Raid 0 fio -name iops -rw=write -size=10G -iodepth 1 -filename /dev/sdc2 -ioengine libaio -direct 1 -bs 256k Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util - snip - sdc 0,00 0,00 0,00 809,20 0,00 202,30 512,00 0,96 1,19 0,00 1,19 1,18 95,84 - snap - rados bench -p rbd 300 write -t 16 2012-10-15 17:53:17.058383min lat: 0.035382 max lat: 0.469604 avg lat: 0.189553 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 300 16 25329 25313 337.443 324 0.274815 0.189553 Total time run: 300.169843 Total writes made: 25329 Write size: 4194304 Bandwidth (MB/sec): 337.529 Stddev Bandwidth: 25.1568 Max bandwidth (MB/sec): 372 Min bandwidth (MB/sec): 0 Average Latency: 0.189597 Stddev Latency: 0.0641609 Max latency: 0.469604 Min latency: 0.035382 during the rados bench test. avg-cpu: %user %nice %system %iowait %steal %idle 20,38 0,00 16,20 8,87 0,00 54,55 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 41,20 0,00 12,40 0,00 0,35 57,42 0,00 0,31 0,00 0,31 0,31 0,38 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 332,80 0,00 139,67 859,53 7,36 22,09 0,00 22,09 2,12 70,42 sdd 0,00 0,00 0,00 391,60 0,00 175,84 919,62 15,59 39,62 0,00 39,62 2,40 93,80 sde 0,00 0,00 0,00 342,00 0,00 147,39 882,59 8,54 24,89 0,00 24,89 2,18 74,58 sdf 0,00 0,00 0,00 362,20 0,00 162,72 920,05 15,35 42,50 0,00 42,50 2,60 94,20 sdg 0,00 0,00 0,00 522,00 0,00 139,20 546,13 0,28 0,54 0,00 0,54 0,10 5,26 sdh 0,00 0,00 0,00 672,00 0,00 179,20 546,13 9,67 14,42 0,00 14,42 0,61 41,18 sdi 0,00 0,00 0,00 555,00 0,00 148,00 546,13 0,32 0,57 0,00 0,57 0,10 5,46 sdj 0,00 0,00 0,00 582,00 0,00 155,20 546,13 0,51 0,87 0,00 0,87 0,12 6,96 100 seconds later avg-cpu: %user %nice %system %iowait %steal %idle 22,92 0,00 19,57 9,25 0,00 48,25 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 40,80 0,00 15,60 0,00 0,36 47,08 0,00 0,22 0,00 0,22 0,22 0,34 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 sdc 0,00 0,00 0,00 386,60 0,00 168,33 891,70 12,11 31,08 0,00 31,08 2,25 86,86 sdd 0,00 0,00 0,00 405,00 0,00 183,06 925,68 15,68 38,70 0,00 38,70 2,34 94,90 sde 0,00 0,00 0,00 411,00 0,00 185,06 922,15 15,58 38,09 0,00 38,09 2,33 95,92 sdf 0,00 0,00 0,00 387,00 0,00 168,33 890,79