Re: Stable/Master debian/rules not stripping all packages

2012-07-04 Thread Amon Ott
On Tuesday 03 July 2012 wrote Sage Weil:
 On Tue, 3 Jul 2012, Amon Ott wrote:
  just found out that some Debian binary packages do not get stripped - a
  53MB ceph-mds does look a bit weird.
 
  Identified packages ceph-mds and gceph and added these lines:
  dh_strip -pceph-mds --dbg-package=ceph-mds-dbg
  dh_strip -pgceph --dbg-package=gceph-dbg

 I added stripping for ceph-mds, but gceph has been removed... I'm
 curious which you're looking at that has both ceph-mds and gceph?

My fault, found that later when building again. I still had some old gceph-dbg 
package in my repository, deleted that now.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH   Tel: +49 30 24342334
Am Köllnischen Park 1Fax: +49 30 24342336
10179 Berlin http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont

Le 03/07/2012 23:38, Tommi Virtanen a écrit :

On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

In the case I could repair, do you think a crashed FS as it is right now is
valuable for you, for future reference , as I saw you can't reproduce the
problem ? I can make an archive (or a btrfs dump ?), but it will be quite
big.

At this point, it's more about the upstream developers (of btrfs etc)
than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Well, I probably wasn't clear enough. I talked about crashed FS, but i 
was talking about ceph. The underlying FS (btrfs in that case) of 1 node 
(and only one) has PROBABLY crashed in the past, causing corruption in 
ceph data on this node, and then the subsequent crash of other nodes.


RIGHT now btrfs on this node is OK. I can access the filesystem without 
errors.


For the moment, on 8 nodes, 4 refuse to restart .
1 of the 4 nodes was the crashed node , the 3 others didn't had broblem 
with the underlying fs as far as I can tell.


So I think the scenario is :

One node had problem with btrfs, leading first to kernel problem , 
probably corruption (in disk/ in memory maybe ?) ,and ultimately to a 
kernel oops. Before that ultimate kernel oops, bad data has been 
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses 
nodes.


If you think this scenario is highly improbable in real life (that is, 
btrfs will probably be fixed for good, and then, corruption can't 
happen), it's ok.


But I wonder if this scenario can be triggered with other problem, and 
bad data can be transmitted to other sane nodes (power outage, out of 
memory condition, disk full... for example)


That's why I proposed you a crashed ceph volume image (I shouldn't have 
talked about a crashed fs, sorry for the confusion)


Talking about btrfs, there is a lot of fixes in btrfs between 3.4 and 
3.5rc. After the crash, I couldn't mount the btrfs volume. With 3.5rc I 
can , and there is no sign of problem on it. It does'nt mean data is 
safe there, but i think it's a sign that at least, some bugs have been 
corrected in btrfs code.


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD support for primary storage in Apache CloudStack

2012-07-04 Thread Wido den Hollander

Hi,

On 03-07-12 20:22, Ross Turk wrote:


Hey Wido! This is really cool.

I think it'd be useful to have a guide that people can follow to stand up 
CloudStack with Ceph.  Even though it's still in active development, I'd like to 
encourage people to try it out.  Would you be willing to work with the Inktank team to 
create something like that?  I think we can do most of the writing, but we'll need help 
if we get stuck.



Yes, would be great! I think most users will start to use it via the 
WebGUI, so we should document that first.


A fast and short write-up of the steps:

1. Get a Ceph cluster running
2. Have one or multiple hosts with the proper Qemu and libvirt (= 
0.9.13) code

3. Set up CloudStack from the RBD branch (compile by hand)
4. Set up your zones with NFS primary storage
5. Add the RBD primary storage and add the tag 'rbd'
6. Create a disk offering with the storage tag 'rbd'

NFS primary storage is still needed for the System VM's inside CloudStack.

There is however still one libvirt patch outstanding for when people are 
not using cephx: 
https://www.redhat.com/archives/libvir-list/2012-June/msg01119.html


I'm also hunting a bug under Ubuntu 12.04 where stored 'secrets' in 
libvirt get corrupted. The root has been found, but it's an external 
library which is causing problems.


The thread: 
https://www.redhat.com/archives/libvir-list/2012-July/msg00135.html


Wido


Cheers,
Ross



On Friday, June 29, 2012 at 9:01 AM, Wido den Hollander wrote:

Hi,

I'm cross-posting this to the ceph-devel list since there might be
people around here running CloudStack and are interested in this.

After a couple of months worth of work I'm happy to announce that the
RBD support for primary storage in CloudStack seems to be reaching a
point where it's good enough to be reviewed.

If you are planning to test RBD, please do read this e-mail carefully
since there are still some catches.

Although the code inside CloudStack doesn't seem like a lot of code, I
had to modify code outside CloudStack to get RBD support working:

1. RBD storage pool support in libvirt. [0] [1]
2. Fix a couple of bugs in the libvirt-java bindings. [2]

With those issues addressed I could implement RBD inside CloudStack.

While doing so I ran into multiple issues inside CloudStack which
delayed everything a bit.

Now, the RBD support for primary storage knows limitations:

- It only works with KVM

- You are NOT able to snapshot RBD volumes. This is due to CloudStack
wanting to backup snapshots to the secondary storage and uses 'qemu-img
convert' for this. That doesn't work with RBD, but it's also very
inefficient.

RBD supports native snapshots inside the Ceph cluster. RBD disks also
have the potential to reach very large sizes. Disks of 1TB won't be the
exception. It would stress your network heavily. I'm thinking about
implementing internal snapshots, but that is step #2. For now no
snapshots.

- You are able create a template from a RBD volume, but creating a new
instance with RBD storage from a template is still a hit-and-miss.
Working on that one.

Other than these limitations, everything works. You can create instances
and attach RBD disks. It also supports cephx authorization, so no
problem there!

What do you need to run this patch?
- A Ceph cluster
- libvirt with RBD storage pool support (0.9.12)
- Modified libvirt-java bindings (jar is in the patch)
- Qemu with RBD support (0.14)
- A extra field user_info in the storage pool table, see the SQL
change in the patch

You can fetch the code on my Github account [3].

Warning: I'll be rebasing against the master branch regularly, so be
aware of git pull not always working nicely.

I'd like to see this code reviewed while I'm working on the latest stuff
and getting all the patches upstream in other projects (mainly the
libvirt Java bindings).

Any suggestions or comments?

Thank you!

Wido


[0]:
http://libvirt.org/git/?p=libvirt.git;a=commit;h=74951eadef85e2d100c7dc7bd9ae1093fbda722f
[1]:
http://libvirt.org/git/?p=libvirt.git;a=commit;h=122fa379de44a2fd0a6d5fbcb634535d647ada17
[2]: https://github.com/wido/libvirt-java/commits/cloudstack
[3]: https://github.com/wido/CloudStack/commits/rbd



--
Ross Turk
VP of Community, Inktank
@rossturk @inktank @ceph


Any sufficiently advanced technology is indistinguishable from magic.
-- Arthur C. Clarke





--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


monitor not starting

2012-07-04 Thread Smart Weblications GmbH - Florian Wiessner
Hi List,


i today upgraded from 0.43 to 0.48 and now i have one monitor which does not
want to start up anymore:

 ceph version 0.48argonaut-125-g4e774fb
(commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8)
 1: /usr/bin/ceph-mon() [0x52f9c9]
 2: (()+0xeff0) [0x7fb08dd11ff0]
 3: (gsignal()+0x35) [0x7fb08c4f41b5]
 4: (abort()+0x180) [0x7fb08c4f6fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb08cd88dc5]
 6: (()+0xcb166) [0x7fb08cd87166]
 7: (()+0xcb193) [0x7fb08cd87193]
 8: (()+0xcb28e) [0x7fb08cd8728e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x940)
[0x55b310]
 10: /usr/bin/ceph-mon() [0x497317]
 11: (Monitor::init()+0xc5a) [0x4857fa]
 12: (main()+0x2789) [0x46ac79]
 13: (__libc_start_main()+0xfd) [0x7fb08c4e0c8d]
 14: /usr/bin/ceph-mon() [0x468309]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
interpret this.

--- end dump of recent events ---


How can i find out why it does not startup anymore? osd and mds is running 
fine..
-- 

Mit freundlichen Grüßen,

Florian Wiessner

Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila

fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de

--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] librados: Bump the version to 0.48

2012-07-04 Thread Wido den Hollander

Signed-off-by: Wido den Hollander w...@widodh.nl
---
 src/include/rados/librados.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/rados/librados.h b/src/include/rados/librados.h
index 9f723f7..4870b0f 100644
--- a/src/include/rados/librados.h
+++ b/src/include/rados/librados.h
@@ -23,7 +23,7 @@ extern C {
 #endif
 
 #define LIBRADOS_VER_MAJOR 0
-#define LIBRADOS_VER_MINOR 44
+#define LIBRADOS_VER_MINOR 48
 #define LIBRADOS_VER_EXTRA 0
 
 #define LIBRADOS_VERSION(maj, min, extra) ((maj  16) + (min  8) + extra)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Generate URL-safe base64 strings for keys.

2012-07-04 Thread Sage Weil
On Wed, 4 Jul 2012, Wido den Hollander wrote:
 By using this we prevent scenarios where cephx keys are not accepted
 in various situations.
 
 Replacing the + and / by - and _ we generate URL-safe base64 keys
 
 Signed-off-by: Wido den Hollander w...@widodh.nl

Do already properly decode URL-sage base64 encoding?

sage

 ---
  src/common/armor.c |6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/src/common/armor.c b/src/common/armor.c
 index d1d5664..7f73da1 100644
 --- a/src/common/armor.c
 +++ b/src/common/armor.c
 @@ -9,7 +9,7 @@
   * base64 encode/decode.
   */
  
 -const char *pem_key = 
 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
 +const char *pem_key = 
 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_;
  
  static int encode_bits(int c)
  {
 @@ -24,9 +24,9 @@ static int decode_bits(char c)
   return c - 'a' + 26;
   if (c = '0'  c = '9')
   return c - '0' + 52;
 - if (c == '+')
 + if (c == '+' || c == '-')
   return 62;
 - if (c == '/')
 + if (c == '/' || c == '_')
   return 63;
   if (c == '=')
   return 0; /* just non-negative, please */
 -- 
 1.7.9.5
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Generate URL-safe base64 strings for keys.

2012-07-04 Thread Wido den Hollander

- Oorspronkelijk bericht -
 On Wed, 4 Jul 2012, Wido den Hollander wrote:
  By using this we prevent scenarios where cephx keys are not accepted
  in various situations.
  
  Replacing the + and / by - and _ we generate URL-safe base64 keys
  
  Signed-off-by: Wido den Hollander w...@widodh.nl
 
 Do already properly decode URL-sage base64 encoding?
 

Yes, it decodes URL-safe base64 as well.

See the if statements for 62 and 63, + and - are treated equally, just like / 
and _.

Wido


 sage
 
  ---
  src/common/armor.c |       6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)
  
  diff --git a/src/common/armor.c b/src/common/armor.c
  index d1d5664..7f73da1 100644
  --- a/src/common/armor.c
  +++ b/src/common/armor.c
  @@ -9,7 +9,7 @@
  * base64 encode/decode.
  */
  
  -const char *pem_key =
  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
  +const char *pem_key =
  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_;
  
  static int encode_bits(int c)
  {
  @@ -24,9 +24,9 @@ static int decode_bits(char c)
          return c - 'a' + 26;
      if (c = '0'  c = '9')
          return c - '0' + 52;
  -    if (c == '+')
  +    if (c == '+' || c == '-')
          return 62;
  -    if (c == '/')
  +    if (c == '/' || c == '_')
          return 63;
      if (c == '=')
          return 0; /* just non-negative, please */
  -- 
  1.7.9.5
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org
  More majordomo info at   http://vger.kernel.org/majordomo-info.html
  
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at   http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Generate URL-safe base64 strings for keys.

2012-07-04 Thread Sage Weil
On Wed, 4 Jul 2012, Wido den Hollander wrote:
  On Wed, 4 Jul 2012, Wido den Hollander wrote:
   By using this we prevent scenarios where cephx keys are not accepted
   in various situations.
   
   Replacing the + and / by - and _ we generate URL-safe base64 keys
   
   Signed-off-by: Wido den Hollander w...@widodh.nl
  
  Do already properly decode URL-sage base64 encoding?
  
 
 Yes, it decodes URL-safe base64 as well.
 
 See the if statements for 62 and 63, + and - are treated equally, just 
 like / and _.

Oh, got it.  The commit description confused me... I thought this was 
related encoding only.

I think we should break the encode and decode patches into separate 
versions, and apply the decode to a stable branch (argonaut) and the 
encode to the master.  That should avoid most problems with a 
rolling/staggered upgrade...

sage


 
 Wido
 
 
  sage
  
   ---
   src/common/armor.c |       6 +++---
   1 file changed, 3 insertions(+), 3 deletions(-)
   
   diff --git a/src/common/armor.c b/src/common/armor.c
   index d1d5664..7f73da1 100644
   --- a/src/common/armor.c
   +++ b/src/common/armor.c
   @@ -9,7 +9,7 @@
   * base64 encode/decode.
   */
   
   -const char *pem_key =
   ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
   +const char *pem_key =
   ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_;
   
   static int encode_bits(int c)
   {
   @@ -24,9 +24,9 @@ static int decode_bits(char c)
           return c - 'a' + 26;
       if (c = '0'  c = '9')
           return c - '0' + 52;
   -    if (c == '+')
   +    if (c == '+' || c == '-')
           return 62;
   -    if (c == '/')
   +    if (c == '/' || c == '_')
           return 63;
       if (c == '=')
           return 0; /* just non-negative, please */
   -- 
   1.7.9.5
   
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel
   in the body of a message to majord...@vger.kernel.org
   More majordomo info at   http://vger.kernel.org/majordomo-info.html
   
   
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at   http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 

Re: domino-style OSD crash

2012-07-04 Thread Gregory Farnum
On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
 Le 03/07/2012 23:38, Tommi Virtanen a écrit :
  On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr 
  (mailto:yann.dup...@univ-nantes.fr) wrote:
   In the case I could repair, do you think a crashed FS as it is right now 
   is
   valuable for you, for future reference , as I saw you can't reproduce the
   problem ? I can make an archive (or a btrfs dump ?), but it will be quite
   big.
   
   
  At this point, it's more about the upstream developers (of btrfs etc)
  than us; we're on good terms with them but not experts on the on-disk
  format(s). You might want to send an email to the relevant mailing
  lists before wiping the disks.
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
  
  
 Well, I probably wasn't clear enough. I talked about crashed FS, but i  
 was talking about ceph. The underlying FS (btrfs in that case) of 1 node  
 (and only one) has PROBABLY crashed in the past, causing corruption in  
 ceph data on this node, and then the subsequent crash of other nodes.
  
 RIGHT now btrfs on this node is OK. I can access the filesystem without  
 errors.
  
 For the moment, on 8 nodes, 4 refuse to restart .
 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem  
 with the underlying fs as far as I can tell.
  
 So I think the scenario is :
  
 One node had problem with btrfs, leading first to kernel problem ,  
 probably corruption (in disk/ in memory maybe ?) ,and ultimately to a  
 kernel oops. Before that ultimate kernel oops, bad data has been  
 transmitted to other (sane) nodes, leading to ceph-osd crash on thoses  
 nodes.

I don't think that's actually possible — the OSDs all do quite a lot of 
interpretation between what they get off the wire and what goes on disk. What 
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do 
that through the interfaces we have. :/
  
  
 If you think this scenario is highly improbable in real life (that is,  
 btrfs will probably be fixed for good, and then, corruption can't  
 happen), it's ok.
  
 But I wonder if this scenario can be triggered with other problem, and  
 bad data can be transmitted to other sane nodes (power outage, out of  
 memory condition, disk full... for example)
  
 That's why I proposed you a crashed ceph volume image (I shouldn't have  
 talked about a crashed fs, sorry for the confusion)

I appreciate the offer, but I don't think this will help much — it's a disk 
state managed by somebody else, not our logical state, which has broken. If we 
could figure out how that state got broken that'd be good, but a ceph image 
won't really help in doing so.

I wonder if maybe there's a confounding factor here — are all your nodes 
similar to each other, or are they running on different kinds of hardware? How 
did you do your Ceph upgrades? What's ceph -s display when the cluster is 
running as best it can?
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: monitor not starting

2012-07-04 Thread Gregory Farnum


On Wednesday, July 4, 2012 at 4:45 AM, Smart Weblications GmbH - Florian 
Wiessner wrote:

 Hi List,
 
 
 i today upgraded from 0.43 to 0.48 and now i have one monitor which does not
 want to start up anymore:
 
 ceph version 0.48argonaut-125-g4e774fb
 (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8)
 1: /usr/bin/ceph-mon() [0x52f9c9]
 2: (()+0xeff0) [0x7fb08dd11ff0]
 3: (gsignal()+0x35) [0x7fb08c4f41b5]
 4: (abort()+0x180) [0x7fb08c4f6fc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb08cd88dc5]
 6: (()+0xcb166) [0x7fb08cd87166]
 7: (()+0xcb193) [0x7fb08cd87193]
 8: (()+0xcb28e) [0x7fb08cd8728e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
 const*)+0x940)
 [0x55b310]
 10: /usr/bin/ceph-mon() [0x497317]
 11: (Monitor::init()+0xc5a) [0x4857fa]
 12: (main()+0x2789) [0x46ac79]
 13: (__libc_start_main()+0xfd) [0x7fb08c4e0c8d]
 14: /usr/bin/ceph-mon() [0x468309]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
 interpret this.
 
 --- end dump of recent events ---
 
 
 How can i find out why it does not startup anymore? osd and mds is running 
 fine..
Is that all the output you get? There should be a line somewhere which says 
what the assert is, and what line number it's on. :)

And while you're at it, is the rest of the cluster in fact working? I don't 
think 0.43 to 0.48 is an upgrade path we tested.

-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] librados: Bump the version to 0.48

2012-07-04 Thread Gregory Farnum
Hmmm — we generally try to modify these versions when the API changes, not on 
every sprint. It looks to me like Sage added one function in 0.45 where we 
maybe should have bumped it, but that was a long time ago and at this point we 
should maybe just eat it?
-Greg


On Wednesday, July 4, 2012 at 6:46 AM, Wido den Hollander wrote:

  
 Signed-off-by: Wido den Hollander w...@widodh.nl (mailto:w...@widodh.nl)
 ---
 src/include/rados/librados.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  
 diff --git a/src/include/rados/librados.h b/src/include/rados/librados.h
 index 9f723f7..4870b0f 100644
 --- a/src/include/rados/librados.h
 +++ b/src/include/rados/librados.h
 @@ -23,7 +23,7 @@ extern C {
 #endif
  
 #define LIBRADOS_VER_MAJOR 0
 -#define LIBRADOS_VER_MINOR 44
 +#define LIBRADOS_VER_MINOR 48
 #define LIBRADOS_VER_EXTRA 0
  
 #define LIBRADOS_VERSION(maj, min, extra) ((maj  16) + (min  8) + extra)
 --  
 1.7.9.5
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] librados: Bump the version to 0.48

2012-07-04 Thread Sage Weil
On Wed, 4 Jul 2012, Gregory Farnum wrote:
 Hmmm ÿÿ we generally try to modify these versions when the API changes, 
 not on every sprint. It looks to me like Sage added one function in 0.45 
 where we maybe should have bumped it, but that was a long time ago and 
 at this point we should maybe just eat it?

Yeah, I went ahead and applied this to stable (argonaut) since it's as 
good a reference point as any.  Moving forward, we should try to sync 
this up with API changes as they happen.  Hmm, like that assert 
ObjectOperation that just went into master... 

sage

Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont

Le 04/07/2012 18:21, Gregory Farnum a écrit :

On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:

Le 03/07/2012 23:38, Tommi Virtanen a écrit :

On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr 
(mailto:yann.dup...@univ-nantes.fr) wrote:

In the case I could repair, do you think a crashed FS as it is right now is
valuable for you, for future reference , as I saw you can't reproduce the
problem ? I can make an archive (or a btrfs dump ?), but it will be quite
big.
  
  
At this point, it's more about the upstream developers (of btrfs etc)

than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org 
(mailto:majord...@vger.kernel.org)
More majordomo info at http://vger.kernel.org/majordomo-info.html
  
  
Well, I probably wasn't clear enough. I talked about crashed FS, but i

was talking about ceph. The underlying FS (btrfs in that case) of 1 node
(and only one) has PROBABLY crashed in the past, causing corruption in
ceph data on this node, and then the subsequent crash of other nodes.
  
RIGHT now btrfs on this node is OK. I can access the filesystem without

errors.
  
For the moment, on 8 nodes, 4 refuse to restart .

1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
with the underlying fs as far as I can tell.
  
So I think the scenario is :
  
One node had problem with btrfs, leading first to kernel problem ,

probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
kernel oops. Before that ultimate kernel oops, bad data has been
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
nodes.

I don't think that's actually possible — the OSDs all do quite a lot of 
interpretation between what they get off the wire and what goes on disk. What 
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do 
that through the interfaces we have. :/


ok, so as all nodes were identical, I probably have hit a btrfs bug 
(like a erroneous out of space ) in more or less the same time. And when 
1 osd was out,
   
  
If you think this scenario is highly improbable in real life (that is,

btrfs will probably be fixed for good, and then, corruption can't
happen), it's ok.
  
But I wonder if this scenario can be triggered with other problem, and

bad data can be transmitted to other sane nodes (power outage, out of
memory condition, disk full... for example)
  
That's why I proposed you a crashed ceph volume image (I shouldn't have

talked about a crashed fs, sorry for the confusion)

I appreciate the offer, but I don't think this will help much — it's a disk state managed 
by somebody else, not our logical state, which has broken. If we could figure out how 
that state got broken that'd be good, but a ceph image won't really help in 
doing so.

ok, no problem. I'll restart from scratch, freshly formated.


I wonder if maybe there's a confounding factor here — are all your nodes 
similar to each other,


Yes. I designed the cluster that way. All nodes are identical hardware 
(powerEdge M610, 10G intel ethernet + emulex fibre channel attached to 
storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)



  or are they running on different kinds of hardware? How did you do your Ceph 
upgrades? What's ceph -s display when the cluster is running as best it can?


Ceph was running 0.47.2 at that time - (debian package for ceph). After 
the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 
without success.


Nothing particular for upgrades, because for the moment ceph is broken, 
so just apt-get upgrade with new version.



ceph -s show that :

root@label5:~# ceph -s
   health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 
32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck 
stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded 
(10.729%); 1814/1245570 unfound (0.146%)
   monmap e1: 3 mons at 
{chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, 
election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa

   osdmap e2404: 8 osds: 3 up, 3 in
pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5 
active+recovering+remapped, 32 active+clean+replay, 11 
active+recovering+degraded, 25 active+remapped, 710 down+peering, 222 
active+degraded, 7 stale+active+recovering+degraded, 61 
stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, 8 
stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB 
used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%); 
1814/1245570 unfound (0.146%)

   mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby



BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of 
the 4 surviving OSD didn't 

Re: monitor not starting

2012-07-04 Thread Gregory Farnum
On Wednesday, July 4, 2012 at 10:02 AM, Smart Weblications GmbH - Florian 
Wiessner wrote:
 Am 04.07.2012 18:25, schrieb Gregory Farnum:
   
   
  On Wednesday, July 4, 2012 at 4:45 AM, Smart Weblications GmbH - Florian 
  Wiessner wrote:
   
   Hi List,


   i today upgraded from 0.43 to 0.48 and now i have one monitor which does 
   not
   want to start up anymore:

   ceph version 0.48argonaut-125-g4e774fb
   (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8)
   1: /usr/bin/ceph-mon() [0x52f9c9]
   2: (()+0xeff0) [0x7fb08dd11ff0]
   3: (gsignal()+0x35) [0x7fb08c4f41b5]
   4: (abort()+0x180) [0x7fb08c4f6fc0]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb08cd88dc5]
   6: (()+0xcb166) [0x7fb08cd87166]
   7: (()+0xcb193) [0x7fb08cd87193]
   8: (()+0xcb28e) [0x7fb08cd8728e]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
   const*)+0x940)
   [0x55b310]
   10: /usr/bin/ceph-mon() [0x497317]
   11: (Monitor::init()+0xc5a) [0x4857fa]
   12: (main()+0x2789) [0x46ac79]
   13: (__libc_start_main()+0xfd) [0x7fb08c4e0c8d]
   14: /usr/bin/ceph-mon() [0x468309]
   NOTE: a copy of the executable, or `objdump -rdS executable` is needed 
   to
   interpret this.

   --- end dump of recent events ---


   How can i find out why it does not startup anymore? osd and mds is 
   running fine..
  Is that all the output you get? There should be a line somewhere which says 
  what the assert is, and what line number it's on. :)
  
  
  
  
 Is this what you are looking for:
 2012-07-04 11:20:24.448430 7f423d943780 1 mon.3@-1(probing) e1 init fsid
 4553d0f6-1b31-4ba5-9d97-edae55bcaab4
 2012-07-04 11:20:24.448994 7f423d943780 -1 mon/Paxos.cc (http://Paxos.cc): In 
 function 'bool
 Paxos::is_consistent()' thread 7f423d943780 time 2012-07-04 11:20:24.448637
 mon/Paxos.cc (http://Paxos.cc): 1031: FAILED assert(consistent || (slurping 
 == 1))

Yep, that line. This means the monitor's on-disk state is inconsistent, but I 
can think of a number of scenarios which could have caused this, depending on 
how you upgraded your cluster (older monitors didn't mark on-disk whenever they 
deliberately went inconsistent on a catchup, which I bet is what happened here).
  
 ceph version 0.48argonaut-125-g4e774fb
 (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8)
 1: /usr/bin/ceph-mon() [0x497317]
 2: (Monitor::init()+0xc5a) [0x4857fa]
 3: (main()+0x2789) [0x46ac79]
 4: (__libc_start_main()+0xfd) [0x7f423bcfbc8d]
 5: /usr/bin/ceph-mon() [0x468309]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
 interpret this.
  
 --- begin dump of recent events ---
 -3 2012-07-04 11:20:24.447613 7f423d943780 1 store(/data/ceph/mon) mount
 -2 2012-07-04 11:20:24.447722 7f423d943780 0 ceph version
 0.48argonaut-125-g4e774fb (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8),
 process ceph-mon, pid 7436
 -1 2012-07-04 11:20:24.448430 7f423d943780 1 mon.3@-1(probing) e1 init
 fsid 4553d0f6-1b31-4ba5-9d97-edae55bcaab4
 0 2012-07-04 11:20:24.448994 7f423d943780 -1 mon/Paxos.cc (http://Paxos.cc): 
 In function
 'bool Paxos::is_consistent()' thread 7f423d943780 time 2012-07-04 
 11:20:24.448637
 mon/Paxos.cc (http://Paxos.cc): 1031: FAILED assert(consistent || (slurping 
 == 1))
  
 ceph version 0.48argonaut-125-g4e774fb
 (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8)
 1: /usr/bin/ceph-mon() [0x497317]
 2: (Monitor::init()+0xc5a) [0x4857fa]
 3: (main()+0x2789) [0x46ac79]
 4: (__libc_start_main()+0xfd) [0x7f423bcfbc8d]
 5: /usr/bin/ceph-mon() [0x468309]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
 interpret this.
  
 --- end dump of recent events ---
 2012-07-04 11:20:24.449567 7f423d943780 -1 *** Caught signal (Aborted) **
 in thread 7f423d943780
  
  
   
  And while you're at it, is the rest of the cluster in fact working? I don't 
  think 0.43 to 0.48 is an upgrade path we tested.
  
 Anyway, i removed the mon and did a ceph-mon --mkfs with the 3 mons that were
 still working after the upgrade and got it up and running again.
  
 Yes, the cluster is still working after the upgrade. Also upgraded to linux
 3.4.4 - it feels like the ceph-fuse and kernel ceph client is a little less
 robust than in 0.43...
  
 when i start copying from /ceph to other mp, then it seems that for the copy
 operation or in general for any operation, /ceph is unusable to other 
 processes
 which then makes the client behave very sluggish... :(

Well, it shouldn't have gotten less stable since we haven't made any big 
changes there…but you aren't the only one reporting that things seem to be a 
little bit slower. We're going to have to look at that once people are back in 
the office after Independence Day.
  
  
 i can send you the contents of the monitor directory where it did not work 
 after
 the upgrade if you want me to..

No, that won't be necessary. Thanks though!  

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to 

Re: Ceph for email storage

2012-07-04 Thread Gregory Farnum
On Wednesday, July 4, 2012 at 11:29 AM, Mitsue Acosta Murakami wrote:
 Hello,
 
 We are examining Ceph to use as email storage. In our current system, 
 several clients servers with different services (imap, smtp, etc) access 
 a NFS storage server. The mailboxes are stored in Maildir format, with 
 many small files. We use Amazon AWS EC2 for clients and storage server. 
 In this scenario, we have some questions about Ceph:
 
 1. Is Ceph recommended for heavy write/read of small files?
 
 2. Is there any problem in installing Ceph on Amazon instances?
 
 3. Does Ceph already support quota?
 
 4. What File System would you encourage us to use?
Are you interested in using RBD to back your mail servers, or in using the Ceph 
FS to provide shared storage? Ceph FS isn't considered production-ready at this 
time, but RBD should be, for appropriate use cases.

In general:
1) If you allow your caching layers to do their job, any Ceph system should 
handle small writes fine. Reads will require normal disk accesses.
2) There shouldn't be.
3) None of the Ceph systems support quotas right now, although CephFS does easy 
usage reports.
4) Assuming you mean for the OSDs, XFS seems to be your best bet right now, but 
we work to make Ceph perform as well as possible under btrfs and ext4 too.
-Greg


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow request warnings on 0.48

2012-07-04 Thread David Blundell
On 4 Jul 2012, at 19:59, Gregory Farnum wrote:

 That's odd — there isn't too much that went into the OSD between 0.47 and 
 0.48 that I can think of, and most of that only impact OSDs when they go 
 through bootup. What does ceph -s display — are all the PGs healthy?  
 -Greg
 


Hi Greg,

The PGs all seem to be healthy:

root@store1:~# ceph -s
   health HEALTH_OK
   monmap e1: 3 mons at 
{0=10.0.1.40:6789/0,1=10.0.1.41:6789/0,2=10.0.1.42:6789/0}, election epoch 40, 
quorum 0,1,2 0,1,2
   osdmap e342: 7 osds: 7 up, 7 in
pgmap v5403: 1344 pgs: 1344 active+clean; 4620 MB data, 9617 MB used, 1368 
GB / 1377 GB avail
   mdsmap e50: 0/0/1 up

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


What does replica size mean?

2012-07-04 Thread Eric_YH_Chen
Hi, all:

Just want to make sure one thing. 
If I set replica size as 2, that means one data with 2 copies, right?
Therefore, if I measure the performance of rbd is 100MB/s, 
I can imagine the actually io throughputs on hard disk is over 100MB/s *3 = 300 
MB/s.  
Am I correct? 

Thanks!

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What does replica size mean?

2012-07-04 Thread Sage Weil
On Thu, 5 Jul 2012, eric_yh_c...@wiwynn.com wrote:
 Hi, all:
 
 Just want to make sure one thing. 
 If I set replica size as 2, that means one data with 2 copies, right?
 Therefore, if I measure the performance of rbd is 100MB/s, 
 I can imagine the actually io throughputs on hard disk is over 100MB/s *3 = 
 300 MB/s.  
 Am I correct? 

Right.

pool size = pg size = number of osds in each PG = number of replicas

So a pool with 'size 3' means 3x replication.

sage

Re: Osd placement rule questions

2012-07-04 Thread Sage Weil
On Thu, 5 Jul 2012, Mark Kirkwood wrote:
 Hi,
 
 I am experimenting with ceph (rbd only for now), and have a few questions
 about what is possible via placement rules.
 
 For example I am looking at a setup with a local datacenter (datacenter0) and
 a remote one (datacenter1). I'm using a placement rule:
 
 rule rbd {
 ruleset 2
 type replicated
 min_size 1
 max_size 10
 step take datacenter0
 step chooseleaf firstn -1 type host
 step emit
 step take datacenter1
 step chooseleaf firstn 1 type host
 step emit
 }
 
 and I have the rdb pool set to size 3.
 
 So I *think* I am saying I want 2 replicas in datacenter0 and one in
 datacenter1 [1].

That's right!

 The questions I have are:
 
 1/ I would like to be able to have a way to say something like: Make 2 copies
 at datacenter0, 1 at datacenter1 - wait for the ones at datacenter0 to be
 written but not the ones at datacenter1 (so asynchronous for the latter). Is
 this possible, or planned?

It is not possible yet, but planned for the future.

 2/ Also I would like to be able to say make my number of copies 3, but if I
 lose datacenter0 (where 2 copies are), don't try to have 3 copies at
 datacenter1 (so run degraded in that case). Is that possible?

That is what you get now.  Doing the opposite (2 copies in DC1, 1 in DC2, 
but if DC2 is down 3 in DC1) is not currently possible with the crush 
rules.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Osd placement rule questions

2012-07-04 Thread Mark Kirkwood

On 05/07/12 15:57, Sage Weil wrote:

On Thu, 5 Jul 2012, Mark Kirkwood wrote:


2/ Also I would like to be able to say make my number of copies 3, but if I
lose datacenter0 (where 2 copies are), don't try to have 3 copies at
datacenter1 (so run degraded in that case). Is that possible?

That is what you get now.  Doing the opposite (2 copies in DC1, 1 in DC2,
but if DC2 is down 3 in DC1) is not currently possible with the crush
rules.



Ah, right - excellent and thanks for clarifying! I guess I was 
unconsciously (and incorrectly) thinking that the crush rule would be 
modified when (say) datacenter0 was not available.


Cheers

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html