Re: [ceph-users] ceph v6.1, rbd-fuse issue, rbd_list: error %d Numerical result out of range

2013-05-21 Thread Dan Mick

Hi Sean:

It looks to me like this is the result of the simple-minded[1] strategy 
for allocating a return buffer for rbd_list():


ibuf_len = 1024;
ibuf = malloc(ibuf_len);
actual_len = rbd_list(ioctx, ibuf, ibuf_len);
if (actual_len  0) {
simple_err(rbd_list: error %d\n, actual_len);
return;
}

An easy fix would be to catch the actual_len  0 case and reallocate 
ibuf with the returned ibuf_len size.


I also note that ibuf is never freed, which is not great.  In fact that 
whole enumerate_images() routine is not what you'd call very solid. 
Here's a mostly-untested patch you can try if you like (I'll test tomorrow):


diff --git a/src/rbd_fuse/rbd-fuse.c b/src/rbd_fuse/rbd-fuse.c
index 5a4bfe2..5411ff8 100644
--- a/src/rbd_fuse/rbd-fuse.c
+++ b/src/rbd_fuse/rbd-fuse.c
@@ -93,8 +93,13 @@ enumerate_images(struct rbd_image **head)
ibuf = malloc(ibuf_len);
actual_len = rbd_list(ioctx, ibuf, ibuf_len);
if (actual_len  0) {
-   simple_err(rbd_list: error %d\n, actual_len);
-   return;
+   /* ibuf_len now set to required length */
+   actual_len = rbd_list(ioctx, ibuf, ibuf_len);
+   if (actual_len  0) {
+   /* shouldn't happen */
+   simple_err(rbd_list:, actual_len);
+   return;
+   }
}

fprintf(stderr, pool %s: , pool_name);
@@ -102,10 +107,11 @@ enumerate_images(struct rbd_image **head)
 ip += strlen(ip) + 1)  {
fprintf(stderr, %s, , ip);
im = malloc(sizeof(*im));
-   im-image_name = ip;
+   im-image_name = strdup(ip);
im-next = *head;
*head = im;
}
+   free(ibuf);
fprintf(stderr, \n);
return;
 }


--
[1] it was my simple mind...



On 05/17/2013 02:34 AM, Sean wrote:

Hi everyone

The image files don't display in mount point when using the command
rbd-fuse -p poolname -c /etc/ceph/ceph.conf /aa

but other pools can display image files with the same command. I also create
more sizes and more numbers images than that pool, it's work fine.

How can I track the issue?

It reports the below errors after enabling debug output of Fuse options.

root@ceph3:/# rbd-fuse -p qa_vol /aa -d
FUSE library version: 2.8.6
nullpath_ok: 0
unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
INIT: 7.17
flags=0x047b
max_readahead=0x0002
INIT: 7.12
flags=0x0031
max_readahead=0x0002
max_write=0x0002
unique: 1, success, outsize: 40
unique: 2, opcode: GETATTR (3), nodeid: 1, insize: 56
getattr /
rbd_list: error %d
: Numerical result out of range
unique: 2, success, outsize: 120
unique: 3, opcode: OPENDIR (27), nodeid: 1, insize: 48
opendir flags: 0x98800 /
rbd_list: error %d
: Numerical result out of range
opendir[0] flags: 0x98800 /
unique: 3, success, outsize: 32
unique: 4, opcode: READDIR (28), nodeid: 1, insize: 80
readdir[0] from 0
unique: 4, success, outsize: 80
unique: 5, opcode: READDIR (28), nodeid: 1, insize: 80
unique: 5, success, outsize: 16
unique: 6, opcode: RELEASEDIR (29), nodeid: 1, insize: 64
releasedir[0] flags: 0x0
unique: 6, success, outsize: 16


thanks.
Sean Cao


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Dan Mick, Filesystem Engineering
Inktank Storage, Inc.   http://inktank.com
Ceph docs: http://ceph.com/docs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Determining when an 'out' OSD is actually unused

2013-05-21 Thread Alex Bligh
Dan,

On 21 May 2013, at 00:52, Dan Mick wrote:

 On 05/20/2013 01:33 PM, Alex Bligh wrote:
 If I want to remove an osd, I use 'ceph out' before taking it down, i.e. 
 stopping the OSD process, and removing the disk.
 
 How do I (preferably programatically) tell when it is safe to stop the OSD 
 process? The documentation says 'ceph -w', which is not especially helpful, 
 (a) if I want to do it programatically, or (b) if there are other problems 
 in the cluster so ceph was not reporting HEALTH_OK to start with.
 
 Is there a better way?
 
 
 We've had some discussions about this recently, but there's no great way of 
 doing this right now.

OK. So would the following conservative rule work for now?
* Don't mark the OSD out until and unless you have ceph HEALTH_OK
* Then mark it out
* Then you are safe to remove only when it returns to ceph HEALTH_OK

The instructions at present say watch ceph -w, but don't say exactly what to 
watch for.

 We should probably have a query option that returns number of PGs on this 
 OSD or some such.

That would be very useful.

-- 
Alex Bligh




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Determining when an 'out' OSD is actually unused

2013-05-21 Thread Dan Mick
Yes, with the proviso that you really mean kill the osd when clean.
Marking out is step 1.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: RGW

2013-05-21 Thread Gandalf Corvotempesta
-- Forwarded message --
From: Gandalf Corvotempesta gandalf.corvotempe...@gmail.com
Date: 2013/5/20
Subject: RGW
To: ceph-users@lists.ceph.com ceph-users@lists.ceph.com


Hi,
i'm receiving an EntityTooLarge error when trying to upload an object of 100MB

I've already set LimitRequestBody to 0 in apache. Anyting else to check ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mon IO usage

2013-05-21 Thread Sylvain Munaut
Hi,


I've just added some monitoring to the IO usage of mon (trying to
track down that growing mon issue), and I'm kind of surprised by the
amount of IO generated by the monitor process.

I get continuous 4 Mo/s / 75 iops with added big spikes at each
compaction every 3 min or so.

Is there a description somewhere of what the monitor does exactly ?  I
mean the monmap / pgmap / osdmap / mdsmap / election epoch don't
change that often (pgmap is like 1 per second and that's the fastest
change by several orders of magnitude). So what exactly does the
monitor do with all that IO ???


Cheers,

Sylvain
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon IO usage

2013-05-21 Thread Mike Dawson

Sylvain,

I can confirm I see a similar traffic pattern.

Any time I have lots of writes going to my cluster (like heavy writes 
from RBD or remapping/backfilling after losing an OSD), I see all sorts 
of monitor issues.


If my monitor leveldb store.db directories grow past some unknown point 
(maybe ~1GB or so), 'compact on trim' is insufficiently slow. The 
store.db grows faster than compact can trim the garbage. After that 
point, the only hope to rein in the store.db size is to stop the OSDs 
and get leveldb to compact without any ongoing writes.


I sent Sage and Joao a transaction dump of the growth yesterday. Sage 
looked, but the files are so large it is tough to get useful info.


http://tracker.ceph.com/issues/4895

I believe this issue has existed since 0.48.

- Mike

On 5/21/2013 8:16 AM, Sylvain Munaut wrote:

Hi,


I've just added some monitoring to the IO usage of mon (trying to
track down that growing mon issue), and I'm kind of surprised by the
amount of IO generated by the monitor process.

I get continuous 4 Mo/s / 75 iops with added big spikes at each
compaction every 3 min or so.

Is there a description somewhere of what the monitor does exactly ?  I
mean the monmap / pgmap / osdmap / mdsmap / election epoch don't
change that often (pgmap is like 1 per second and that's the fastest
change by several orders of magnitude). So what exactly does the
monitor do with all that IO ???


Cheers,

 Sylvain
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon IO usage

2013-05-21 Thread Sylvain Munaut
Hi,

 So, AFAICT, the bulk of the write would be writing out the pgmap to
 disk every second or so.

 It should be writing out the full map only every N commits... see 'paxos
 stash full interval', which defaults to 25.

But doesn't it also write it in full when there is a new pgmap ?

I have a new one about every second and its size * period seemed to
match the IO rate pretty well which it why I thought it was the reason
for the IO.


 Is it really needed to write it in full ? It doesn't change all that
 much AFAICT, so writing incremental changes with only periodic flush
 might be a better option ?

 Right.  It works this way now only because we haven't fully transitioned
 from the old scheme.  The next step is to store the PGMap over lots of
 leveldb keys (one per pg) so that there is no big encode/decode of the
 entire PGMap structure...

Makes sense. I'm not sure of the per-key overhead of leveldb though,
in case where there are lots (  10k ) PGs.


Cheers,

Sylvain
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon IO usage

2013-05-21 Thread Sage Weil
On Tue, 21 May 2013, Sylvain Munaut wrote:
 Hi,
 
  So, AFAICT, the bulk of the write would be writing out the pgmap to
  disk every second or so.
 
  It should be writing out the full map only every N commits... see 'paxos
  stash full interval', which defaults to 25.
 
 But doesn't it also write it in full when there is a new pgmap ?
 
 I have a new one about every second and its size * period seemed to
 match the IO rate pretty well which it why I thought it was the reason
 for the IO.

Hmm.  Can you generate a log with 'debug mon = 20', 'debug paxos = 20', 
'debug ms = 1' for a few minutes over which you see a high data rate and 
send it my way?  It sounds like there is something wrong with the 
stash_full logic.

Thanks!

  Is it really needed to write it in full ? It doesn't change all that
  much AFAICT, so writing incremental changes with only periodic flush
  might be a better option ?
 
  Right.  It works this way now only because we haven't fully transitioned
  from the old scheme.  The next step is to store the PGMap over lots of
  leveldb keys (one per pg) so that there is no big encode/decode of the
  entire PGMap structure...
 
 Makes sense. I'm not sure of the per-key overhead of leveldb though,
 in case where there are lots (  10k ) PGs.

Yeah, it will be larger on-disk, but the io rate will at least be 
proportional to the update rate.  :)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon IO usage

2013-05-21 Thread Sylvain Munaut
Hi,

 Hmm.  Can you generate a log with 'debug mon = 20', 'debug paxos = 20',
 'debug ms = 1' for a few minutes over which you see a high data rate and
 send it my way?  It sounds like there is something wrong with the
 stash_full logic.

Mm, actually I may have been fooled by the instrumentation ... it does
30 sec average, so when looking closer I don't have 4 Mo/s constantly,
it's more like 50 Mo every 15/20 sec as a burst.

In anycase, that seems like a lot of data being written.

logs can be downloaded from  http://ge.tt/9MOeKHh/v/0


Cheers,

Sylvan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG's, repair ineffective

2013-05-21 Thread David Zafman

I can't reproduce this on v0.61-2.  Could the disks for osd.13  osd.22 be 
unwritable?

In your case it looks like the 3rd replica is probably the bad one, since 
osd.13 and osd.22 are the same.  You probably want to manually repair the 3rd 
replica.

David Zafman
Senior Developer
http://www.inktank.com




On May 21, 2013, at 6:45 AM, John Nielsen li...@jnielsen.net wrote:

 Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64.
 
 On May 21, 2013, at 12:13 AM, David Zafman david.zaf...@inktank.com wrote:
 
 
 What version of ceph are you running?
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 On May 20, 2013, at 9:14 AM, John Nielsen li...@jnielsen.net wrote:
 
 Some scrub errors showed up on our cluster last week. We had some issues 
 with host stability a couple weeks ago; my guess is that errors were 
 introduced at that point and a recent background scrub detected them. I was 
 able to clear most of them via ceph pg repair, but several remain. Based 
 on some other posts, I'm guessing that they won't repair because it is the 
 primary copy that has the error. All of our pools are set to size 3 so 
 there _ought_ to be a way to verify and restore the correct data, right?
 
 Below is some log output about one of the problem PG's. Can anyone suggest 
 a way to fix the inconsistencies?
 
 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 
 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 
 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 
 19.1b repair 0 missing, 1 inconsistent objects
 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 
 19.1b repair 2 errors, 2 fixed
 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 
 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 
 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 
 19.1b deep-scrub 0 missing, 1 inconsistent objects
 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 
 19.1b deep-scrub 2 errors
 
 Thanks,
 
 JN
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG's, repair ineffective

2013-05-21 Thread John Nielsen
I've checked, all the disks are fine and the cluster is healthy except for the 
inconsistent objects.

How would I go about manually repairing?

On May 21, 2013, at 3:26 PM, David Zafman david.zaf...@inktank.com wrote:

 
 I can't reproduce this on v0.61-2.  Could the disks for osd.13  osd.22 be 
 unwritable?
 
 In your case it looks like the 3rd replica is probably the bad one, since 
 osd.13 and osd.22 are the same.  You probably want to manually repair the 3rd 
 replica.
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 
 
 On May 21, 2013, at 6:45 AM, John Nielsen li...@jnielsen.net wrote:
 
 Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64.
 
 On May 21, 2013, at 12:13 AM, David Zafman david.zaf...@inktank.com wrote:
 
 
 What version of ceph are you running?
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 On May 20, 2013, at 9:14 AM, John Nielsen li...@jnielsen.net wrote:
 
 Some scrub errors showed up on our cluster last week. We had some issues 
 with host stability a couple weeks ago; my guess is that errors were 
 introduced at that point and a recent background scrub detected them. I 
 was able to clear most of them via ceph pg repair, but several remain. 
 Based on some other posts, I'm guessing that they won't repair because it 
 is the primary copy that has the error. All of our pools are set to size 3 
 so there _ought_ to be a way to verify and restore the correct data, right?
 
 Below is some log output about one of the problem PG's. Can anyone suggest 
 a way to fix the inconsistencies?
 
 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 
 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 
 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 
 19.1b repair 0 missing, 1 inconsistent objects
 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 
 19.1b repair 2 errors, 2 fixed
 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 
 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 
 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 
 19.1b deep-scrub 0 missing, 1 inconsistent objects
 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 
 19.1b deep-scrub 2 errors
 
 Thanks,
 
 JN
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com