Re: [ceph-users] ceph v6.1, rbd-fuse issue, rbd_list: error %d Numerical result out of range
Hi Sean: It looks to me like this is the result of the simple-minded[1] strategy for allocating a return buffer for rbd_list(): ibuf_len = 1024; ibuf = malloc(ibuf_len); actual_len = rbd_list(ioctx, ibuf, ibuf_len); if (actual_len 0) { simple_err(rbd_list: error %d\n, actual_len); return; } An easy fix would be to catch the actual_len 0 case and reallocate ibuf with the returned ibuf_len size. I also note that ibuf is never freed, which is not great. In fact that whole enumerate_images() routine is not what you'd call very solid. Here's a mostly-untested patch you can try if you like (I'll test tomorrow): diff --git a/src/rbd_fuse/rbd-fuse.c b/src/rbd_fuse/rbd-fuse.c index 5a4bfe2..5411ff8 100644 --- a/src/rbd_fuse/rbd-fuse.c +++ b/src/rbd_fuse/rbd-fuse.c @@ -93,8 +93,13 @@ enumerate_images(struct rbd_image **head) ibuf = malloc(ibuf_len); actual_len = rbd_list(ioctx, ibuf, ibuf_len); if (actual_len 0) { - simple_err(rbd_list: error %d\n, actual_len); - return; + /* ibuf_len now set to required length */ + actual_len = rbd_list(ioctx, ibuf, ibuf_len); + if (actual_len 0) { + /* shouldn't happen */ + simple_err(rbd_list:, actual_len); + return; + } } fprintf(stderr, pool %s: , pool_name); @@ -102,10 +107,11 @@ enumerate_images(struct rbd_image **head) ip += strlen(ip) + 1) { fprintf(stderr, %s, , ip); im = malloc(sizeof(*im)); - im-image_name = ip; + im-image_name = strdup(ip); im-next = *head; *head = im; } + free(ibuf); fprintf(stderr, \n); return; } -- [1] it was my simple mind... On 05/17/2013 02:34 AM, Sean wrote: Hi everyone The image files don't display in mount point when using the command rbd-fuse -p poolname -c /etc/ceph/ceph.conf /aa but other pools can display image files with the same command. I also create more sizes and more numbers images than that pool, it's work fine. How can I track the issue? It reports the below errors after enabling debug output of Fuse options. root@ceph3:/# rbd-fuse -p qa_vol /aa -d FUSE library version: 2.8.6 nullpath_ok: 0 unique: 1, opcode: INIT (26), nodeid: 0, insize: 56 INIT: 7.17 flags=0x047b max_readahead=0x0002 INIT: 7.12 flags=0x0031 max_readahead=0x0002 max_write=0x0002 unique: 1, success, outsize: 40 unique: 2, opcode: GETATTR (3), nodeid: 1, insize: 56 getattr / rbd_list: error %d : Numerical result out of range unique: 2, success, outsize: 120 unique: 3, opcode: OPENDIR (27), nodeid: 1, insize: 48 opendir flags: 0x98800 / rbd_list: error %d : Numerical result out of range opendir[0] flags: 0x98800 / unique: 3, success, outsize: 32 unique: 4, opcode: READDIR (28), nodeid: 1, insize: 80 readdir[0] from 0 unique: 4, success, outsize: 80 unique: 5, opcode: READDIR (28), nodeid: 1, insize: 80 unique: 5, success, outsize: 16 unique: 6, opcode: RELEASEDIR (29), nodeid: 1, insize: 64 releasedir[0] flags: 0x0 unique: 6, success, outsize: 16 thanks. Sean Cao ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Dan Mick, Filesystem Engineering Inktank Storage, Inc. http://inktank.com Ceph docs: http://ceph.com/docs ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Determining when an 'out' OSD is actually unused
Dan, On 21 May 2013, at 00:52, Dan Mick wrote: On 05/20/2013 01:33 PM, Alex Bligh wrote: If I want to remove an osd, I use 'ceph out' before taking it down, i.e. stopping the OSD process, and removing the disk. How do I (preferably programatically) tell when it is safe to stop the OSD process? The documentation says 'ceph -w', which is not especially helpful, (a) if I want to do it programatically, or (b) if there are other problems in the cluster so ceph was not reporting HEALTH_OK to start with. Is there a better way? We've had some discussions about this recently, but there's no great way of doing this right now. OK. So would the following conservative rule work for now? * Don't mark the OSD out until and unless you have ceph HEALTH_OK * Then mark it out * Then you are safe to remove only when it returns to ceph HEALTH_OK The instructions at present say watch ceph -w, but don't say exactly what to watch for. We should probably have a query option that returns number of PGs on this OSD or some such. That would be very useful. -- Alex Bligh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Determining when an 'out' OSD is actually unused
Yes, with the proviso that you really mean kill the osd when clean. Marking out is step 1. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: RGW
-- Forwarded message -- From: Gandalf Corvotempesta gandalf.corvotempe...@gmail.com Date: 2013/5/20 Subject: RGW To: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Hi, i'm receiving an EntityTooLarge error when trying to upload an object of 100MB I've already set LimitRequestBody to 0 in apache. Anyting else to check ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mon IO usage
Hi, I've just added some monitoring to the IO usage of mon (trying to track down that growing mon issue), and I'm kind of surprised by the amount of IO generated by the monitor process. I get continuous 4 Mo/s / 75 iops with added big spikes at each compaction every 3 min or so. Is there a description somewhere of what the monitor does exactly ? I mean the monmap / pgmap / osdmap / mdsmap / election epoch don't change that often (pgmap is like 1 per second and that's the fastest change by several orders of magnitude). So what exactly does the monitor do with all that IO ??? Cheers, Sylvain ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon IO usage
Sylvain, I can confirm I see a similar traffic pattern. Any time I have lots of writes going to my cluster (like heavy writes from RBD or remapping/backfilling after losing an OSD), I see all sorts of monitor issues. If my monitor leveldb store.db directories grow past some unknown point (maybe ~1GB or so), 'compact on trim' is insufficiently slow. The store.db grows faster than compact can trim the garbage. After that point, the only hope to rein in the store.db size is to stop the OSDs and get leveldb to compact without any ongoing writes. I sent Sage and Joao a transaction dump of the growth yesterday. Sage looked, but the files are so large it is tough to get useful info. http://tracker.ceph.com/issues/4895 I believe this issue has existed since 0.48. - Mike On 5/21/2013 8:16 AM, Sylvain Munaut wrote: Hi, I've just added some monitoring to the IO usage of mon (trying to track down that growing mon issue), and I'm kind of surprised by the amount of IO generated by the monitor process. I get continuous 4 Mo/s / 75 iops with added big spikes at each compaction every 3 min or so. Is there a description somewhere of what the monitor does exactly ? I mean the monmap / pgmap / osdmap / mdsmap / election epoch don't change that often (pgmap is like 1 per second and that's the fastest change by several orders of magnitude). So what exactly does the monitor do with all that IO ??? Cheers, Sylvain ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon IO usage
Hi, So, AFAICT, the bulk of the write would be writing out the pgmap to disk every second or so. It should be writing out the full map only every N commits... see 'paxos stash full interval', which defaults to 25. But doesn't it also write it in full when there is a new pgmap ? I have a new one about every second and its size * period seemed to match the IO rate pretty well which it why I thought it was the reason for the IO. Is it really needed to write it in full ? It doesn't change all that much AFAICT, so writing incremental changes with only periodic flush might be a better option ? Right. It works this way now only because we haven't fully transitioned from the old scheme. The next step is to store the PGMap over lots of leveldb keys (one per pg) so that there is no big encode/decode of the entire PGMap structure... Makes sense. I'm not sure of the per-key overhead of leveldb though, in case where there are lots ( 10k ) PGs. Cheers, Sylvain ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon IO usage
On Tue, 21 May 2013, Sylvain Munaut wrote: Hi, So, AFAICT, the bulk of the write would be writing out the pgmap to disk every second or so. It should be writing out the full map only every N commits... see 'paxos stash full interval', which defaults to 25. But doesn't it also write it in full when there is a new pgmap ? I have a new one about every second and its size * period seemed to match the IO rate pretty well which it why I thought it was the reason for the IO. Hmm. Can you generate a log with 'debug mon = 20', 'debug paxos = 20', 'debug ms = 1' for a few minutes over which you see a high data rate and send it my way? It sounds like there is something wrong with the stash_full logic. Thanks! Is it really needed to write it in full ? It doesn't change all that much AFAICT, so writing incremental changes with only periodic flush might be a better option ? Right. It works this way now only because we haven't fully transitioned from the old scheme. The next step is to store the PGMap over lots of leveldb keys (one per pg) so that there is no big encode/decode of the entire PGMap structure... Makes sense. I'm not sure of the per-key overhead of leveldb though, in case where there are lots ( 10k ) PGs. Yeah, it will be larger on-disk, but the io rate will at least be proportional to the update rate. :) sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon IO usage
Hi, Hmm. Can you generate a log with 'debug mon = 20', 'debug paxos = 20', 'debug ms = 1' for a few minutes over which you see a high data rate and send it my way? It sounds like there is something wrong with the stash_full logic. Mm, actually I may have been fooled by the instrumentation ... it does 30 sec average, so when looking closer I don't have 4 Mo/s constantly, it's more like 50 Mo every 15/20 sec as a burst. In anycase, that seems like a lot of data being written. logs can be downloaded from http://ge.tt/9MOeKHh/v/0 Cheers, Sylvan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Inconsistent PG's, repair ineffective
I can't reproduce this on v0.61-2. Could the disks for osd.13 osd.22 be unwritable? In your case it looks like the 3rd replica is probably the bad one, since osd.13 and osd.22 are the same. You probably want to manually repair the 3rd replica. David Zafman Senior Developer http://www.inktank.com On May 21, 2013, at 6:45 AM, John Nielsen li...@jnielsen.net wrote: Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64. On May 21, 2013, at 12:13 AM, David Zafman david.zaf...@inktank.com wrote: What version of ceph are you running? David Zafman Senior Developer http://www.inktank.com On May 20, 2013, at 9:14 AM, John Nielsen li...@jnielsen.net wrote: Some scrub errors showed up on our cluster last week. We had some issues with host stability a couple weeks ago; my guess is that errors were introduced at that point and a recent background scrub detected them. I was able to clear most of them via ceph pg repair, but several remain. Based on some other posts, I'm guessing that they won't repair because it is the primary copy that has the error. All of our pools are set to size 3 so there _ought_ to be a way to verify and restore the correct data, right? Below is some log output about one of the problem PG's. Can anyone suggest a way to fix the inconsistencies? 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 19.1b repair 0 missing, 1 inconsistent objects 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 19.1b repair 2 errors, 2 fixed 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 19.1b deep-scrub 0 missing, 1 inconsistent objects 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 19.1b deep-scrub 2 errors Thanks, JN ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Inconsistent PG's, repair ineffective
I've checked, all the disks are fine and the cluster is healthy except for the inconsistent objects. How would I go about manually repairing? On May 21, 2013, at 3:26 PM, David Zafman david.zaf...@inktank.com wrote: I can't reproduce this on v0.61-2. Could the disks for osd.13 osd.22 be unwritable? In your case it looks like the 3rd replica is probably the bad one, since osd.13 and osd.22 are the same. You probably want to manually repair the 3rd replica. David Zafman Senior Developer http://www.inktank.com On May 21, 2013, at 6:45 AM, John Nielsen li...@jnielsen.net wrote: Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64. On May 21, 2013, at 12:13 AM, David Zafman david.zaf...@inktank.com wrote: What version of ceph are you running? David Zafman Senior Developer http://www.inktank.com On May 20, 2013, at 9:14 AM, John Nielsen li...@jnielsen.net wrote: Some scrub errors showed up on our cluster last week. We had some issues with host stability a couple weeks ago; my guess is that errors were introduced at that point and a recent background scrub detected them. I was able to clear most of them via ceph pg repair, but several remain. Based on some other posts, I'm guessing that they won't repair because it is the primary copy that has the error. All of our pools are set to size 3 so there _ought_ to be a way to verify and restore the correct data, right? Below is some log output about one of the problem PG's. Can anyone suggest a way to fix the inconsistencies? 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 19.1b repair 0 missing, 1 inconsistent objects 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 19.1b repair 2 errors, 2 fixed 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 19.1b deep-scrub 0 missing, 1 inconsistent objects 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 19.1b deep-scrub 2 errors Thanks, JN ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com