Re: Stable/Master debian/rules not stripping all packages
On Tuesday 03 July 2012 wrote Sage Weil: On Tue, 3 Jul 2012, Amon Ott wrote: just found out that some Debian binary packages do not get stripped - a 53MB ceph-mds does look a bit weird. Identified packages ceph-mds and gceph and added these lines: dh_strip -pceph-mds --dbg-package=ceph-mds-dbg dh_strip -pgceph --dbg-package=gceph-dbg I added stripping for ceph-mds, but gceph has been removed... I'm curious which you're looking at that has both ceph-mds and gceph? My fault, found that later when building again. I still had some old gceph-dbg package in my repository, deleted that now. Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. At this point, it's more about the upstream developers (of btrfs etc) than us; we're on good terms with them but not experts on the on-disk format(s). You might want to send an email to the relevant mailing lists before wiping the disks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. For the moment, on 8 nodes, 4 refuse to restart . 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem with the underlying fs as far as I can tell. So I think the scenario is : One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. If you think this scenario is highly improbable in real life (that is, btrfs will probably be fixed for good, and then, corruption can't happen), it's ok. But I wonder if this scenario can be triggered with other problem, and bad data can be transmitted to other sane nodes (power outage, out of memory condition, disk full... for example) That's why I proposed you a crashed ceph volume image (I shouldn't have talked about a crashed fs, sorry for the confusion) Talking about btrfs, there is a lot of fixes in btrfs between 3.4 and 3.5rc. After the crash, I couldn't mount the btrfs volume. With 3.5rc I can , and there is no sign of problem on it. It does'nt mean data is safe there, but i think it's a sign that at least, some bugs have been corrected in btrfs code. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD support for primary storage in Apache CloudStack
Hi, On 03-07-12 20:22, Ross Turk wrote: Hey Wido! This is really cool. I think it'd be useful to have a guide that people can follow to stand up CloudStack with Ceph. Even though it's still in active development, I'd like to encourage people to try it out. Would you be willing to work with the Inktank team to create something like that? I think we can do most of the writing, but we'll need help if we get stuck. Yes, would be great! I think most users will start to use it via the WebGUI, so we should document that first. A fast and short write-up of the steps: 1. Get a Ceph cluster running 2. Have one or multiple hosts with the proper Qemu and libvirt (= 0.9.13) code 3. Set up CloudStack from the RBD branch (compile by hand) 4. Set up your zones with NFS primary storage 5. Add the RBD primary storage and add the tag 'rbd' 6. Create a disk offering with the storage tag 'rbd' NFS primary storage is still needed for the System VM's inside CloudStack. There is however still one libvirt patch outstanding for when people are not using cephx: https://www.redhat.com/archives/libvir-list/2012-June/msg01119.html I'm also hunting a bug under Ubuntu 12.04 where stored 'secrets' in libvirt get corrupted. The root has been found, but it's an external library which is causing problems. The thread: https://www.redhat.com/archives/libvir-list/2012-July/msg00135.html Wido Cheers, Ross On Friday, June 29, 2012 at 9:01 AM, Wido den Hollander wrote: Hi, I'm cross-posting this to the ceph-devel list since there might be people around here running CloudStack and are interested in this. After a couple of months worth of work I'm happy to announce that the RBD support for primary storage in CloudStack seems to be reaching a point where it's good enough to be reviewed. If you are planning to test RBD, please do read this e-mail carefully since there are still some catches. Although the code inside CloudStack doesn't seem like a lot of code, I had to modify code outside CloudStack to get RBD support working: 1. RBD storage pool support in libvirt. [0] [1] 2. Fix a couple of bugs in the libvirt-java bindings. [2] With those issues addressed I could implement RBD inside CloudStack. While doing so I ran into multiple issues inside CloudStack which delayed everything a bit. Now, the RBD support for primary storage knows limitations: - It only works with KVM - You are NOT able to snapshot RBD volumes. This is due to CloudStack wanting to backup snapshots to the secondary storage and uses 'qemu-img convert' for this. That doesn't work with RBD, but it's also very inefficient. RBD supports native snapshots inside the Ceph cluster. RBD disks also have the potential to reach very large sizes. Disks of 1TB won't be the exception. It would stress your network heavily. I'm thinking about implementing internal snapshots, but that is step #2. For now no snapshots. - You are able create a template from a RBD volume, but creating a new instance with RBD storage from a template is still a hit-and-miss. Working on that one. Other than these limitations, everything works. You can create instances and attach RBD disks. It also supports cephx authorization, so no problem there! What do you need to run this patch? - A Ceph cluster - libvirt with RBD storage pool support (0.9.12) - Modified libvirt-java bindings (jar is in the patch) - Qemu with RBD support (0.14) - A extra field user_info in the storage pool table, see the SQL change in the patch You can fetch the code on my Github account [3]. Warning: I'll be rebasing against the master branch regularly, so be aware of git pull not always working nicely. I'd like to see this code reviewed while I'm working on the latest stuff and getting all the patches upstream in other projects (mainly the libvirt Java bindings). Any suggestions or comments? Thank you! Wido [0]: http://libvirt.org/git/?p=libvirt.git;a=commit;h=74951eadef85e2d100c7dc7bd9ae1093fbda722f [1]: http://libvirt.org/git/?p=libvirt.git;a=commit;h=122fa379de44a2fd0a6d5fbcb634535d647ada17 [2]: https://github.com/wido/libvirt-java/commits/cloudstack [3]: https://github.com/wido/CloudStack/commits/rbd -- Ross Turk VP of Community, Inktank @rossturk @inktank @ceph Any sufficiently advanced technology is indistinguishable from magic. -- Arthur C. Clarke -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
monitor not starting
Hi List, i today upgraded from 0.43 to 0.48 and now i have one monitor which does not want to start up anymore: ceph version 0.48argonaut-125-g4e774fb (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8) 1: /usr/bin/ceph-mon() [0x52f9c9] 2: (()+0xeff0) [0x7fb08dd11ff0] 3: (gsignal()+0x35) [0x7fb08c4f41b5] 4: (abort()+0x180) [0x7fb08c4f6fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb08cd88dc5] 6: (()+0xcb166) [0x7fb08cd87166] 7: (()+0xcb193) [0x7fb08cd87193] 8: (()+0xcb28e) [0x7fb08cd8728e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x940) [0x55b310] 10: /usr/bin/ceph-mon() [0x497317] 11: (Monitor::init()+0xc5a) [0x4857fa] 12: (main()+0x2789) [0x46ac79] 13: (__libc_start_main()+0xfd) [0x7fb08c4e0c8d] 14: /usr/bin/ceph-mon() [0x468309] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- How can i find out why it does not startup anymore? osd and mds is running fine.. -- Mit freundlichen Grüßen, Florian Wiessner Smart Weblications GmbH Martinsberger Str. 1 D-95119 Naila fon.: +49 9282 9638 200 fax.: +49 9282 9638 205 24/7: +49 900 144 000 00 - 0,99 EUR/Min* http://www.smart-weblications.de -- Sitz der Gesellschaft: Naila Geschäftsführer: Florian Wiessner HRB-Nr.: HRB 3840 Amtsgericht Hof *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] librados: Bump the version to 0.48
Signed-off-by: Wido den Hollander w...@widodh.nl --- src/include/rados/librados.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/include/rados/librados.h b/src/include/rados/librados.h index 9f723f7..4870b0f 100644 --- a/src/include/rados/librados.h +++ b/src/include/rados/librados.h @@ -23,7 +23,7 @@ extern C { #endif #define LIBRADOS_VER_MAJOR 0 -#define LIBRADOS_VER_MINOR 44 +#define LIBRADOS_VER_MINOR 48 #define LIBRADOS_VER_EXTRA 0 #define LIBRADOS_VERSION(maj, min, extra) ((maj 16) + (min 8) + extra) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Generate URL-safe base64 strings for keys.
On Wed, 4 Jul 2012, Wido den Hollander wrote: By using this we prevent scenarios where cephx keys are not accepted in various situations. Replacing the + and / by - and _ we generate URL-safe base64 keys Signed-off-by: Wido den Hollander w...@widodh.nl Do already properly decode URL-sage base64 encoding? sage --- src/common/armor.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/common/armor.c b/src/common/armor.c index d1d5664..7f73da1 100644 --- a/src/common/armor.c +++ b/src/common/armor.c @@ -9,7 +9,7 @@ * base64 encode/decode. */ -const char *pem_key = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/; +const char *pem_key = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_; static int encode_bits(int c) { @@ -24,9 +24,9 @@ static int decode_bits(char c) return c - 'a' + 26; if (c = '0' c = '9') return c - '0' + 52; - if (c == '+') + if (c == '+' || c == '-') return 62; - if (c == '/') + if (c == '/' || c == '_') return 63; if (c == '=') return 0; /* just non-negative, please */ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Generate URL-safe base64 strings for keys.
- Oorspronkelijk bericht - On Wed, 4 Jul 2012, Wido den Hollander wrote: By using this we prevent scenarios where cephx keys are not accepted in various situations. Replacing the + and / by - and _ we generate URL-safe base64 keys Signed-off-by: Wido den Hollander w...@widodh.nl Do already properly decode URL-sage base64 encoding? Yes, it decodes URL-safe base64 as well. See the if statements for 62 and 63, + and - are treated equally, just like / and _. Wido sage --- src/common/armor.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/common/armor.c b/src/common/armor.c index d1d5664..7f73da1 100644 --- a/src/common/armor.c +++ b/src/common/armor.c @@ -9,7 +9,7 @@ * base64 encode/decode. */ -const char *pem_key = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/; +const char *pem_key = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_; static int encode_bits(int c) { @@ -24,9 +24,9 @@ static int decode_bits(char c) return c - 'a' + 26; if (c = '0' c = '9') return c - '0' + 52; - if (c == '+') + if (c == '+' || c == '-') return 62; - if (c == '/') + if (c == '/' || c == '_') return 63; if (c == '=') return 0; /* just non-negative, please */ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Generate URL-safe base64 strings for keys.
On Wed, 4 Jul 2012, Wido den Hollander wrote: On Wed, 4 Jul 2012, Wido den Hollander wrote: By using this we prevent scenarios where cephx keys are not accepted in various situations. Replacing the + and / by - and _ we generate URL-safe base64 keys Signed-off-by: Wido den Hollander w...@widodh.nl Do already properly decode URL-sage base64 encoding? Yes, it decodes URL-safe base64 as well. See the if statements for 62 and 63, + and - are treated equally, just like / and _. Oh, got it. The commit description confused me... I thought this was related encoding only. I think we should break the encode and decode patches into separate versions, and apply the decode to a stable branch (argonaut) and the encode to the master. That should avoid most problems with a rolling/staggered upgrade... sage Wido sage --- src/common/armor.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/common/armor.c b/src/common/armor.c index d1d5664..7f73da1 100644 --- a/src/common/armor.c +++ b/src/common/armor.c @@ -9,7 +9,7 @@ * base64 encode/decode. */ -const char *pem_key = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/; +const char *pem_key = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_; static int encode_bits(int c) { @@ -24,9 +24,9 @@ static int decode_bits(char c) return c - 'a' + 26; if (c = '0' c = '9') return c - '0' + 52; - if (c == '+') + if (c == '+' || c == '-') return 62; - if (c == '/') + if (c == '/' || c == '_') return 63; if (c == '=') return 0; /* just non-negative, please */ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: domino-style OSD crash
On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr (mailto:yann.dup...@univ-nantes.fr) wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. At this point, it's more about the upstream developers (of btrfs etc) than us; we're on good terms with them but not experts on the on-disk format(s). You might want to send an email to the relevant mailing lists before wiping the disks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. For the moment, on 8 nodes, 4 refuse to restart . 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem with the underlying fs as far as I can tell. So I think the scenario is : One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/ If you think this scenario is highly improbable in real life (that is, btrfs will probably be fixed for good, and then, corruption can't happen), it's ok. But I wonder if this scenario can be triggered with other problem, and bad data can be transmitted to other sane nodes (power outage, out of memory condition, disk full... for example) That's why I proposed you a crashed ceph volume image (I shouldn't have talked about a crashed fs, sorry for the confusion) I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a ceph image won't really help in doing so. I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: monitor not starting
On Wednesday, July 4, 2012 at 4:45 AM, Smart Weblications GmbH - Florian Wiessner wrote: Hi List, i today upgraded from 0.43 to 0.48 and now i have one monitor which does not want to start up anymore: ceph version 0.48argonaut-125-g4e774fb (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8) 1: /usr/bin/ceph-mon() [0x52f9c9] 2: (()+0xeff0) [0x7fb08dd11ff0] 3: (gsignal()+0x35) [0x7fb08c4f41b5] 4: (abort()+0x180) [0x7fb08c4f6fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb08cd88dc5] 6: (()+0xcb166) [0x7fb08cd87166] 7: (()+0xcb193) [0x7fb08cd87193] 8: (()+0xcb28e) [0x7fb08cd8728e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x940) [0x55b310] 10: /usr/bin/ceph-mon() [0x497317] 11: (Monitor::init()+0xc5a) [0x4857fa] 12: (main()+0x2789) [0x46ac79] 13: (__libc_start_main()+0xfd) [0x7fb08c4e0c8d] 14: /usr/bin/ceph-mon() [0x468309] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- How can i find out why it does not startup anymore? osd and mds is running fine.. Is that all the output you get? There should be a line somewhere which says what the assert is, and what line number it's on. :) And while you're at it, is the rest of the cluster in fact working? I don't think 0.43 to 0.48 is an upgrade path we tested. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] librados: Bump the version to 0.48
Hmmm — we generally try to modify these versions when the API changes, not on every sprint. It looks to me like Sage added one function in 0.45 where we maybe should have bumped it, but that was a long time ago and at this point we should maybe just eat it? -Greg On Wednesday, July 4, 2012 at 6:46 AM, Wido den Hollander wrote: Signed-off-by: Wido den Hollander w...@widodh.nl (mailto:w...@widodh.nl) --- src/include/rados/librados.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/include/rados/librados.h b/src/include/rados/librados.h index 9f723f7..4870b0f 100644 --- a/src/include/rados/librados.h +++ b/src/include/rados/librados.h @@ -23,7 +23,7 @@ extern C { #endif #define LIBRADOS_VER_MAJOR 0 -#define LIBRADOS_VER_MINOR 44 +#define LIBRADOS_VER_MINOR 48 #define LIBRADOS_VER_EXTRA 0 #define LIBRADOS_VERSION(maj, min, extra) ((maj 16) + (min 8) + extra) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] librados: Bump the version to 0.48
On Wed, 4 Jul 2012, Gregory Farnum wrote: Hmmm ÿÿ we generally try to modify these versions when the API changes, not on every sprint. It looks to me like Sage added one function in 0.45 where we maybe should have bumped it, but that was a long time ago and at this point we should maybe just eat it? Yeah, I went ahead and applied this to stable (argonaut) since it's as good a reference point as any. Moving forward, we should try to sync this up with API changes as they happen. Hmm, like that assert ObjectOperation that just went into master... sage
Re: domino-style OSD crash
Le 04/07/2012 18:21, Gregory Farnum a écrit : On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr (mailto:yann.dup...@univ-nantes.fr) wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be quite big. At this point, it's more about the upstream developers (of btrfs etc) than us; we're on good terms with them but not experts on the on-disk format(s). You might want to send an email to the relevant mailing lists before wiping the disks. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causing corruption in ceph data on this node, and then the subsequent crash of other nodes. RIGHT now btrfs on this node is OK. I can access the filesystem without errors. For the moment, on 8 nodes, 4 refuse to restart . 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem with the underlying fs as far as I can tell. So I think the scenario is : One node had problem with btrfs, leading first to kernel problem , probably corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/ ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, If you think this scenario is highly improbable in real life (that is, btrfs will probably be fixed for good, and then, corruption can't happen), it's ok. But I wonder if this scenario can be triggered with other problem, and bad data can be transmitted to other sane nodes (power outage, out of memory condition, disk full... for example) That's why I proposed you a crashed ceph volume image (I shouldn't have talked about a crashed fs, sorry for the confusion) I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a ceph image won't really help in doing so. ok, no problem. I'll restart from scratch, freshly formated. I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, Yes. I designed the cluster that way. All nodes are identical hardware (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD) or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can? Ceph was running 0.47.2 at that time - (debian package for ceph). After the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without success. Nothing particular for upgrades, because for the moment ceph is broken, so just apt-get upgrade with new version. ceph -s show that : root@label5:~# ceph -s health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%); 1814/1245570 unfound (0.146%) monmap e1: 3 mons at {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa osdmap e2404: 8 osds: 3 up, 3 in pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5 active+recovering+remapped, 32 active+clean+replay, 11 active+recovering+degraded, 25 active+remapped, 710 down+peering, 222 active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, 8 stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%); 1814/1245570 unfound (0.146%) mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of the 4 surviving OSD didn't
Re: monitor not starting
On Wednesday, July 4, 2012 at 10:02 AM, Smart Weblications GmbH - Florian Wiessner wrote: Am 04.07.2012 18:25, schrieb Gregory Farnum: On Wednesday, July 4, 2012 at 4:45 AM, Smart Weblications GmbH - Florian Wiessner wrote: Hi List, i today upgraded from 0.43 to 0.48 and now i have one monitor which does not want to start up anymore: ceph version 0.48argonaut-125-g4e774fb (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8) 1: /usr/bin/ceph-mon() [0x52f9c9] 2: (()+0xeff0) [0x7fb08dd11ff0] 3: (gsignal()+0x35) [0x7fb08c4f41b5] 4: (abort()+0x180) [0x7fb08c4f6fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb08cd88dc5] 6: (()+0xcb166) [0x7fb08cd87166] 7: (()+0xcb193) [0x7fb08cd87193] 8: (()+0xcb28e) [0x7fb08cd8728e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x940) [0x55b310] 10: /usr/bin/ceph-mon() [0x497317] 11: (Monitor::init()+0xc5a) [0x4857fa] 12: (main()+0x2789) [0x46ac79] 13: (__libc_start_main()+0xfd) [0x7fb08c4e0c8d] 14: /usr/bin/ceph-mon() [0x468309] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- How can i find out why it does not startup anymore? osd and mds is running fine.. Is that all the output you get? There should be a line somewhere which says what the assert is, and what line number it's on. :) Is this what you are looking for: 2012-07-04 11:20:24.448430 7f423d943780 1 mon.3@-1(probing) e1 init fsid 4553d0f6-1b31-4ba5-9d97-edae55bcaab4 2012-07-04 11:20:24.448994 7f423d943780 -1 mon/Paxos.cc (http://Paxos.cc): In function 'bool Paxos::is_consistent()' thread 7f423d943780 time 2012-07-04 11:20:24.448637 mon/Paxos.cc (http://Paxos.cc): 1031: FAILED assert(consistent || (slurping == 1)) Yep, that line. This means the monitor's on-disk state is inconsistent, but I can think of a number of scenarios which could have caused this, depending on how you upgraded your cluster (older monitors didn't mark on-disk whenever they deliberately went inconsistent on a catchup, which I bet is what happened here). ceph version 0.48argonaut-125-g4e774fb (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8) 1: /usr/bin/ceph-mon() [0x497317] 2: (Monitor::init()+0xc5a) [0x4857fa] 3: (main()+0x2789) [0x46ac79] 4: (__libc_start_main()+0xfd) [0x7f423bcfbc8d] 5: /usr/bin/ceph-mon() [0x468309] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -3 2012-07-04 11:20:24.447613 7f423d943780 1 store(/data/ceph/mon) mount -2 2012-07-04 11:20:24.447722 7f423d943780 0 ceph version 0.48argonaut-125-g4e774fb (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8), process ceph-mon, pid 7436 -1 2012-07-04 11:20:24.448430 7f423d943780 1 mon.3@-1(probing) e1 init fsid 4553d0f6-1b31-4ba5-9d97-edae55bcaab4 0 2012-07-04 11:20:24.448994 7f423d943780 -1 mon/Paxos.cc (http://Paxos.cc): In function 'bool Paxos::is_consistent()' thread 7f423d943780 time 2012-07-04 11:20:24.448637 mon/Paxos.cc (http://Paxos.cc): 1031: FAILED assert(consistent || (slurping == 1)) ceph version 0.48argonaut-125-g4e774fb (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8) 1: /usr/bin/ceph-mon() [0x497317] 2: (Monitor::init()+0xc5a) [0x4857fa] 3: (main()+0x2789) [0x46ac79] 4: (__libc_start_main()+0xfd) [0x7f423bcfbc8d] 5: /usr/bin/ceph-mon() [0x468309] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- end dump of recent events --- 2012-07-04 11:20:24.449567 7f423d943780 -1 *** Caught signal (Aborted) ** in thread 7f423d943780 And while you're at it, is the rest of the cluster in fact working? I don't think 0.43 to 0.48 is an upgrade path we tested. Anyway, i removed the mon and did a ceph-mon --mkfs with the 3 mons that were still working after the upgrade and got it up and running again. Yes, the cluster is still working after the upgrade. Also upgraded to linux 3.4.4 - it feels like the ceph-fuse and kernel ceph client is a little less robust than in 0.43... when i start copying from /ceph to other mp, then it seems that for the copy operation or in general for any operation, /ceph is unusable to other processes which then makes the client behave very sluggish... :( Well, it shouldn't have gotten less stable since we haven't made any big changes there…but you aren't the only one reporting that things seem to be a little bit slower. We're going to have to look at that once people are back in the office after Independence Day. i can send you the contents of the monitor directory where it did not work after the upgrade if you want me to.. No, that won't be necessary. Thanks though! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to
Re: Ceph for email storage
On Wednesday, July 4, 2012 at 11:29 AM, Mitsue Acosta Murakami wrote: Hello, We are examining Ceph to use as email storage. In our current system, several clients servers with different services (imap, smtp, etc) access a NFS storage server. The mailboxes are stored in Maildir format, with many small files. We use Amazon AWS EC2 for clients and storage server. In this scenario, we have some questions about Ceph: 1. Is Ceph recommended for heavy write/read of small files? 2. Is there any problem in installing Ceph on Amazon instances? 3. Does Ceph already support quota? 4. What File System would you encourage us to use? Are you interested in using RBD to back your mail servers, or in using the Ceph FS to provide shared storage? Ceph FS isn't considered production-ready at this time, but RBD should be, for appropriate use cases. In general: 1) If you allow your caching layers to do their job, any Ceph system should handle small writes fine. Reads will require normal disk accesses. 2) There shouldn't be. 3) None of the Ceph systems support quotas right now, although CephFS does easy usage reports. 4) Assuming you mean for the OSDs, XFS seems to be your best bet right now, but we work to make Ceph perform as well as possible under btrfs and ext4 too. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow request warnings on 0.48
On 4 Jul 2012, at 19:59, Gregory Farnum wrote: That's odd — there isn't too much that went into the OSD between 0.47 and 0.48 that I can think of, and most of that only impact OSDs when they go through bootup. What does ceph -s display — are all the PGs healthy? -Greg Hi Greg, The PGs all seem to be healthy: root@store1:~# ceph -s health HEALTH_OK monmap e1: 3 mons at {0=10.0.1.40:6789/0,1=10.0.1.41:6789/0,2=10.0.1.42:6789/0}, election epoch 40, quorum 0,1,2 0,1,2 osdmap e342: 7 osds: 7 up, 7 in pgmap v5403: 1344 pgs: 1344 active+clean; 4620 MB data, 9617 MB used, 1368 GB / 1377 GB avail mdsmap e50: 0/0/1 up -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
What does replica size mean?
Hi, all: Just want to make sure one thing. If I set replica size as 2, that means one data with 2 copies, right? Therefore, if I measure the performance of rbd is 100MB/s, I can imagine the actually io throughputs on hard disk is over 100MB/s *3 = 300 MB/s. Am I correct? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What does replica size mean?
On Thu, 5 Jul 2012, eric_yh_c...@wiwynn.com wrote: Hi, all: Just want to make sure one thing. If I set replica size as 2, that means one data with 2 copies, right? Therefore, if I measure the performance of rbd is 100MB/s, I can imagine the actually io throughputs on hard disk is over 100MB/s *3 = 300 MB/s. Am I correct? Right. pool size = pg size = number of osds in each PG = number of replicas So a pool with 'size 3' means 3x replication. sage
Re: Osd placement rule questions
On Thu, 5 Jul 2012, Mark Kirkwood wrote: Hi, I am experimenting with ceph (rbd only for now), and have a few questions about what is possible via placement rules. For example I am looking at a setup with a local datacenter (datacenter0) and a remote one (datacenter1). I'm using a placement rule: rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take datacenter0 step chooseleaf firstn -1 type host step emit step take datacenter1 step chooseleaf firstn 1 type host step emit } and I have the rdb pool set to size 3. So I *think* I am saying I want 2 replicas in datacenter0 and one in datacenter1 [1]. That's right! The questions I have are: 1/ I would like to be able to have a way to say something like: Make 2 copies at datacenter0, 1 at datacenter1 - wait for the ones at datacenter0 to be written but not the ones at datacenter1 (so asynchronous for the latter). Is this possible, or planned? It is not possible yet, but planned for the future. 2/ Also I would like to be able to say make my number of copies 3, but if I lose datacenter0 (where 2 copies are), don't try to have 3 copies at datacenter1 (so run degraded in that case). Is that possible? That is what you get now. Doing the opposite (2 copies in DC1, 1 in DC2, but if DC2 is down 3 in DC1) is not currently possible with the crush rules. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Osd placement rule questions
On 05/07/12 15:57, Sage Weil wrote: On Thu, 5 Jul 2012, Mark Kirkwood wrote: 2/ Also I would like to be able to say make my number of copies 3, but if I lose datacenter0 (where 2 copies are), don't try to have 3 copies at datacenter1 (so run degraded in that case). Is that possible? That is what you get now. Doing the opposite (2 copies in DC1, 1 in DC2, but if DC2 is down 3 in DC1) is not currently possible with the crush rules. Ah, right - excellent and thanks for clarifying! I guess I was unconsciously (and incorrectly) thinking that the crush rule would be modified when (say) datacenter0 was not available. Cheers Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html