Strange behavior after upgrading to 0.48

2012-07-05 Thread Xiaopong Tran

Hi,

I put up a small cluster with 3 osds, 2 mds, 3 mons, on 3 machines.
They were running 0.47.2, and this is a test to do rolling upgrade to
0.48.

I shutdown, upgraded the software, then restarted. One node at a time.
The first two seemed to be ok. The third one gave me some weird thing.
While it was doing the conversion and recovering, the command ceph -s 
gives things like this:



root@china:/tmp# ceph -s
2012-07-05 14:28:41.069470 7fa3c8443780  2 auth: KeyRing::load: loaded 
key file /etc/ceph/client.admin.keyring

2012-07-05 14:28:41.594229 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.596313 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.598949 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.601158 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.603069 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.605020 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.607436 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.609304 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.611047 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.667980 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.670283 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.672274 7fa3c030e700  0 monclient: hunting for new mon


And it never stopped. I was thinking, maybe it just behaved like
that during recovery. But after the recovery is done, it still
get the same thing:

root@china:/tmp# ceph health
2012-07-05 14:28:55.077364 7f8306a0d780  2 auth: KeyRing::load: loaded 
key file /etc/ceph/client.admin.keyring

HEALTH_OK
root@china:/tmp# ceph -s
2012-07-05 14:30:49.688017 7feb6338e780  2 auth: KeyRing::load: loaded 
key file /etc/ceph/client.admin.keyring

2012-07-05 14:30:49.691690 7feb5b259700  0 monclient: hunting for new mon
2012-07-05 14:30:49.694295 7feb5b259700  0 monclient: hunting for new mon
2012-07-05 14:30:49.696487 7feb5b259700  0 monclient: hunting for new mon
2012-07-05 14:30:49.698953 7feb5b259700  0 monclient: hunting for new mon
2012-07-05 14:30:49.700833 7feb5b259700  0 monclient: hunting for new mon


Upgrading the first two nodes have no such problem. This first two
nodes all run osd, mds, and mon. The third only runs osd and mon.

The mon log on the 3rd node shows this, not sure if this is helpful:


925291 lease_expire=2012-07-05 02:38:14.149966 has v44 lc 44
2012-07-05 02:38:12.572107 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap 
active c 29531..30031) is_readable now=2012-07-05 02:38:12.572114 
lease_expire=2012-07-05 02:38:15.889056 has v0 lc 30031
2012-07-05 02:38:12.572128 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap 
active c 29531..30031) is_readable now=2012-07-05 02:38:12.572129 
lease_expire=2012-07-05 02:38:15.889056 has v0 lc 30031
2012-07-05 02:38:15.120439 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap 
active c 1..44) is_readable now=2012-07-05 02:38:15.120446 
lease_expire=2012-07-05 02:38:17.149967 has v44 lc 44
2012-07-05 02:38:15.925349 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap 
active c 1..44) is_readable now=2012-07-05 02:38:15.925356 
lease_expire=2012-07-05 02:38:20.149971 has v44 lc 44
2012-07-05 02:38:17.572181 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap 
active c 29531..30031) is_readable now=2012-07-05 02:38:17.572189 
lease_expire=2012-07-05 02:38:21.889065 has v0 lc 30031
2012-07-05 02:38:17.572204 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap 
active c 29531..30031) is_readable now=2012-07-05 02:38:17.572205 
lease_expire=2012-07-05 02:38:21.889065 has v0 lc 30031
2012-07-05 02:38:19.120463 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap 
active c 1..44) is_readable now=2012-07-05 02:38:19.120470 
lease_expire=2012-07-05 02:38:23.149973 has v44 lc 44
2012-07-05 02:38:19.925323 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap 
active c 1..44) is_readable now=2012-07-05 02:38:19.925330 
lease_expire=2012-07-05 02:38:23.149973 has v44 lc 44


Could someone give a hint on this?

Thanks

Xiaopong
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior after upgrading to 0.48

2012-07-05 Thread Xiaopong Tran

When I run the command ceph -s, I see the following information on
the mon log:

2012-07-05 02:44:13.298942 7f7d92b14700  0 can't decode unknown message 
type 54 MSG_AUTH=17
2012-07-05 02:44:13.301588 7f7d9401b700  1 mon.a@0(leader).paxos(auth 
active c 412..432) is_readable now=2012-07-05 02:44:13.301590 
lease_expire=2012-07-05 02:44:17.566529 has v0 lc 432
2012-07-05 02:44:13.302113 7f7d9401b700  1 mon.a@0(leader).paxos(auth 
active c 412..432) is_readable now=2012-07-05 02:44:13.302114 
lease_expire=2012-07-05 02:44:17.566529 has v0 lc 432
2012-07-05 02:44:13.303072 7f7d92b14700  0 can't decode unknown message 
type 54 MSG_AUTH=17
2012-07-05 02:44:13.309450 7f7d9401b700  1 mon.a@0(leader).paxos(auth 
active c 412..432) is_readable now=2012-07-05 02:44:13.309452 
lease_expire=2012-07-05 02:44:17.566529 has v0 lc 432
2012-07-05 02:44:13.309845 7f7d9401b700  1 mon.a@0(leader).paxos(auth 
active c 412..432) is_readable now=2012-07-05 02:44:13.309847 
lease_expire=2012-07-05 02:44:17.566529 has v0 lc 432



Couldn't find any helpful information regarding can't decode
error message, unless digging into the codes.

Thanks for any hint.

Xiaopong


On 07/05/2012 02:41 PM, Xiaopong Tran wrote:

Hi,

I put up a small cluster with 3 osds, 2 mds, 3 mons, on 3 machines.
They were running 0.47.2, and this is a test to do rolling upgrade to
0.48.

I shutdown, upgraded the software, then restarted. One node at a time.
The first two seemed to be ok. The third one gave me some weird thing.
While it was doing the conversion and recovering, the command ceph -s
gives things like this:


root@china:/tmp# ceph -s
2012-07-05 14:28:41.069470 7fa3c8443780  2 auth: KeyRing::load: loaded
key file /etc/ceph/client.admin.keyring
2012-07-05 14:28:41.594229 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.596313 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.598949 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.601158 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.603069 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.605020 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.607436 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.609304 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.611047 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.667980 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.670283 7fa3c030e700  0 monclient: hunting for new mon
2012-07-05 14:28:41.672274 7fa3c030e700  0 monclient: hunting for new mon


And it never stopped. I was thinking, maybe it just behaved like
that during recovery. But after the recovery is done, it still
get the same thing:

root@china:/tmp# ceph health
2012-07-05 14:28:55.077364 7f8306a0d780  2 auth: KeyRing::load: loaded
key file /etc/ceph/client.admin.keyring
HEALTH_OK
root@china:/tmp# ceph -s
2012-07-05 14:30:49.688017 7feb6338e780  2 auth: KeyRing::load: loaded
key file /etc/ceph/client.admin.keyring
2012-07-05 14:30:49.691690 7feb5b259700  0 monclient: hunting for new mon
2012-07-05 14:30:49.694295 7feb5b259700  0 monclient: hunting for new mon
2012-07-05 14:30:49.696487 7feb5b259700  0 monclient: hunting for new mon
2012-07-05 14:30:49.698953 7feb5b259700  0 monclient: hunting for new mon
2012-07-05 14:30:49.700833 7feb5b259700  0 monclient: hunting for new mon


Upgrading the first two nodes have no such problem. This first two
nodes all run osd, mds, and mon. The third only runs osd and mon.

The mon log on the 3rd node shows this, not sure if this is helpful:


925291 lease_expire=2012-07-05 02:38:14.149966 has v44 lc 44
2012-07-05 02:38:12.572107 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap
active c 29531..30031) is_readable now=2012-07-05 02:38:12.572114
lease_expire=2012-07-05 02:38:15.889056 has v0 lc 30031
2012-07-05 02:38:12.572128 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap
active c 29531..30031) is_readable now=2012-07-05 02:38:12.572129
lease_expire=2012-07-05 02:38:15.889056 has v0 lc 30031
2012-07-05 02:38:15.120439 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap
active c 1..44) is_readable now=2012-07-05 02:38:15.120446
lease_expire=2012-07-05 02:38:17.149967 has v44 lc 44
2012-07-05 02:38:15.925349 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap
active c 1..44) is_readable now=2012-07-05 02:38:15.925356
lease_expire=2012-07-05 02:38:20.149971 has v44 lc 44
2012-07-05 02:38:17.572181 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap
active c 29531..30031) is_readable now=2012-07-05 02:38:17.572189
lease_expire=2012-07-05 02:38:21.889065 has v0 lc 30031
2012-07-05 02:38:17.572204 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap
active c 29531..30031) is_readable now=2012-07-05 02:38:17.572205
lease_expire=2012-07-05 02:38:21.889065 has v0 lc 30031
2012-07-05 02:38:19.120463 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap
active c 1..44) is_readable now=2012-07-05 02:38:19.120470

Re: [PATCH] librados: Bump the version to 0.48

2012-07-05 Thread Wido den Hollander



On 07/04/2012 06:33 PM, Sage Weil wrote:

On Wed, 4 Jul 2012, Gregory Farnum wrote:

Hmmm ÿÿ we generally try to modify these versions when the API changes,
not on every sprint. It looks to me like Sage added one function in 0.45
where we maybe should have bumped it, but that was a long time ago and
at this point we should maybe just eat it?


Yeah, I went ahead and applied this to stable (argonaut) since it's as
good a reference point as any.  Moving forward, we should try to sync
this up with API changes as they happen.  Hmm, like that assert
ObjectOperation that just went into master...


That was my reasoning. I compiled phprados against 0.48 and saw that 
librados was reporting 0.44 as version.


That could confuse users and they might think they still have an old 
library in place.


Imho the version numbering should be totally different from Ceph if you 
only want to bump the version on an API change.


Wido



sage



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Allow URL-safe base64 cephx keys to be decoded.

2012-07-05 Thread Wido den Hollander
In these cases + and / are replaced by - and _ to prevent problems when using
the base64 strings in URLs.

Signed-off-by: Wido den Hollander w...@widodh.nl
---
 src/common/armor.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/common/armor.c b/src/common/armor.c
index d1d5664..e4b8b86 100644
--- a/src/common/armor.c
+++ b/src/common/armor.c
@@ -24,9 +24,9 @@ static int decode_bits(char c)
return c - 'a' + 26;
if (c = '0'  c = '9')
return c - '0' + 52;
-   if (c == '+')
+   if (c == '+' || c == '-')
return 62;
-   if (c == '/')
+   if (c == '/' || c == '_')
return 63;
if (c == '=')
return 0; /* just non-negative, please */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Generate URL-safe base64 strings for keys.

2012-07-05 Thread Wido den Hollander



On 04-07-12 18:18, Sage Weil wrote:

On Wed, 4 Jul 2012, Wido den Hollander wrote:

On Wed, 4 Jul 2012, Wido den Hollander wrote:

By using this we prevent scenarios where cephx keys are not accepted
in various situations.

Replacing the + and / by - and _ we generate URL-safe base64 keys

Signed-off-by: Wido den Hollander w...@widodh.nl


Do already properly decode URL-sage base64 encoding?



Yes, it decodes URL-safe base64 as well.

See the if statements for 62 and 63, + and - are treated equally, just
like / and _.


Oh, got it.  The commit description confused me... I thought this was
related encoding only.

I think we should break the encode and decode patches into separate
versions, and apply the decode to a stable branch (argonaut) and the
encode to the master.  That should avoid most problems with a
rolling/staggered upgrade...


I just submitted a patch for decoding only.

During some tests I did I found out that libvirt uses GNUlib and won't 
handle URL-safe base64 encoded keys.


So, as long as Ceph allows them we're good. Users can always replace the 
+ and / in their key knowing it will be accepted by Ceph.


This works for me for now. The exact switch to base64url should be done 
at a later stage I think.


The RFC on this: http://tools.ietf.org/html/rfc4648#page-7

Wido



sage




Wido



sage


---
src/common/armor.c |   6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/common/armor.c b/src/common/armor.c
index d1d5664..7f73da1 100644
--- a/src/common/armor.c
+++ b/src/common/armor.c
@@ -9,7 +9,7 @@
* base64 encode/decode.
*/

-const char *pem_key =
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
+const char *pem_key =
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_;

static int encode_bits(int c)
{
@@ -24,9 +24,9 @@ static int decode_bits(char c)
 return c - 'a' + 26;
 if (c = '0'  c = '9')
 return c - '0' + 52;
-if (c == '+')
+if (c == '+' || c == '-')
 return 62;
-if (c == '/')
+if (c == '/' || c == '_')
 return 63;
 if (c == '=')
 return 0; /* just non-negative, please */
--
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org
More majordomo info at   http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at   http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qemu fails to open RBD image when auth_supported is not set to 'none'

2012-07-05 Thread Wido den Hollander



On 02-07-12 21:21, Wido den Hollander wrote:



On 06/25/2012 05:45 PM, Wido den Hollander wrote:

On 06/25/2012 05:20 PM, Wido den Hollander wrote:

Hi,

I just tried to start a VM with libvirt with the following disk:

disk type='network' device='disk'
   driver name='qemu' type='raw' cache='none'/
   source protocol='rbd'
name='rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7'
 host name='31.25.XX.XX' port='6789'/
   /source
   target dev='vda' bus='virtio'/
/disk

That fails with: Operation not supported

I tried qemu-img:

qemu-img info
rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789

Same result.

I then tried:

qemu-img info
rbd:rbd/8489c04f-aab8-4796-a22a-ebaa7be247a7:mon_host=31.25.XX.XX\\:6789:auth_supported=none




And that worked :)

This host does not have a local ceph.conf, all the parameters have to
come from the command line.

I know that recently auth_supported defaults to cephx, but that now
break the libvirt integration since it doesn't set auth_supported to
explicitly none when no auth section is present.

Should this be something that gets fixed in librados or in libvirt?


Thought about it, this is something in libvirt :)



If it's libvirt, I'll write a patch for it :)


Just did so, very simple patch:
https://www.redhat.com/archives/libvir-list/2012-June/msg01119.html


libvirt 0.9.13 just got out. The good news is that the RBD storage pool
is in this release, but the patch above did not make it in time.


The patch just made it into libvirt: 
http://libvirt.org/git/?p=libvirt.git;a=commit;h=ccb94785007d33365d49dd566e194eb0a022148d


You will need this libvirt patch if you are going to run RBD without 
cephx enabled


Wido



We'll have to wait for 0.9.14 to get that one in.



Wido



Wido
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rados mailbox? (was Re: Ceph for email storage)

2012-07-05 Thread Wido den Hollander

On 04-07-12 22:40, Sage Weil wrote:

Although Ceph fs would technically work for storing mail with maildir,
when you step back from the situation, Maildir + a distributed file system
is a pretty terrible way to approach mail storage.  Maildir was designed
to work around the limited consistency of NFS, and manages that, but
performs pretty horribly on almost any file system.  Mostly this is due to
the message-per-file approach and the fact that file systems' internal
management of inodes and directories mean lots and lots of seeks, even to
read message headers.  Ceph's MDS will probably do better than most due to
its embedded inodes, but it's hardly ideal.

However, and idea that has been kicking around here is building a mail
storage system directly on top of RADOS.  In principle, it should be a
relatively straightforward matter of implementing a library and plugging
it into the storage backend for something like Dovecot, or any other mail
system (delivery agent and/or IMAP/POP frontend) with a pluggable backend.
(I think postfix has pluggable delivery agents, but that's about where my
experience in this area runs out.)


When you first told me the idea about a couple of months ago I took a 
look at the Dovecot code and it's not that trivial to implement.


It seems that mbox and Maildir are pretty hardcoded in Dovecot, but 
there is an advantage:


You can use Dovecot as your LDA/VDA (Local/Virtual Delivery Agent) for 
Postfix, so you'd only have to implement this library in Dovecot and 
you'd be able to handle IMAP, POP3 and Delivery of e-mails to RADOS.


Source: http://wiki.dovecot.org/LDA/Postfix



The basic idea is this:

  - each mail message is a rados object, and immutable.
  - each mailbox is an index of messages, stored in a rados object.
- the index consists of omap records, one for each message.
- the key is some unique id
- the value is a copy of (a useful subset of) the message headers

This has a number of nice properties:

  - you can efficiently list messages in the mailbox using the omap
operations
  - you can (more) efficiently search messages (everything but the message
body) based on the index contents (since it's all stored in one object)
  - you can efficiently grab recent messages with the omap ops (e.g., list
keys  last_seen_msgid)
  - moving messages between folders involves updating the indices only; the
messages objects need not be copied/moved.
  - no metadata bottleneck: mailbox indices are distributed across the
entire cluster, just like the mail.
  - all the scaling benefits of rados for a growing mail system.

I don't know enough about what exactly the mail storage backends need to
support to know what issues will come up.  Presumably there are several.
E.g., if you delete a message, is the IMAP client expected to discover
that efficiently?  And do the mail storage backends attempt to do it
efficiently?


With IMAP a message gets marked as deleted until your do a PURGE, that 
will actually remove the message,


Problem with IMAP clients however is that there are a lot of bugs in 
them, especially outlook.


But if you can somehow plug into Dovecot and only handle the calls that 
it's doing you should be fine.




This also doesn't solve the problem of efficiently indexing/searching the
bodies of messages, although I suspect that indexing could be efficiently
implemented on top of this scheme.



Nowadays most clients keep a local cache, at least Thunderbird does and 
uses that for local search. Much faster!


Webmail clients like RoundCube have a local cache as well and 
applications like OpenXchange also have local caches.



So, a non-trivial project, but probably one that can be prototyped without
that much pain, and one that would perform and scale drastically better
than existing solutions I'm aware of.


Yes, MUCH better than Maildir over CephFS or NFS.



I'm hoping there are some motivated hackers lurking who understand the
pain that is maildir/mail infrastructure...



Plenty of motivation, not enough time I think.

Wido


sage



On Wed, 4 Jul 2012, Mitsue Acosta Murakami wrote:


Hello,

We are examining Ceph to use as email storage. In our current system, several
clients servers with different services (imap, smtp, etc) access a NFS storage
server. The mailboxes are stored in Maildir format, with many small files. We
use Amazon AWS EC2 for clients and storage server. In this scenario, we have
some questions about Ceph:

1. Is Ceph recommended for heavy write/read of small files?

2. Is there any problem in installing Ceph on Amazon instances?

3. Does Ceph already support quota?

4. What File System would you encourage us to use?


Thanks in advance,

--
Mitsue Acosta Murakami


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line 

Re: OSD doesn't start

2012-07-05 Thread Székelyi Szabolcs
On 2012. July 4. 09:34:04 Gregory Farnum wrote:
 Hrm, it looks like the OSD data directory got a little busted somehow. How
 did you perform your upgrade? (That is, how did you kill your daemons, in
 what order, and when did you bring them back up.)

Since it would be hard and long to describe in text, I've collected the 
relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The 
short story is that after seeing that the OSDs won't start, I tried to bring 
down the whole cluster and start it up from scratch. It didn't change 
anything, so I rebooted the two machines (running all three daemons), to see 
if it changes anything. It didn't and I gave up.

My ceph config is available at http://pastebin.com/KKNjmiWM .

Since this is my test cluster, I'm not very concerned about the data on it. 
But the other one, with the same config, is dying I think. ceph-fuse is eating 
around 75% CPU on the sole monitor (cc) node. The monitor about 15%. On the 
other two nodes, the OSD eats around 50%, the MDS 15%, the monitor another 
10%. No Ceph filesystem activity is going on at the moment. Blktrace reports 
about 1kB/s disk traffic on the partition hosting the OSD data dir. The data 
seems to be accessible at the moment, but I'm afraid that my production 
cluster will end up in a similar situation after upgrade, so I don't dare to 
touch it.

Do you have any suggestion what I should check?

Thanks,
-- 
cc

 On Wednesday, July 4, 2012 at 8:31 AM, Székelyi Szabolcs wrote:
  Hi,
  
  after upgrading to 0.48 Argonaut, my OSDs won't start up again. This
  problem might not be related to the upgrade, since the cluster had
  strange behavior before, too: ceph-fuse was spinning the CPU around 70%,
  so did the OSDs. This happened to both of my clusters. Thought that
  upgrading might solve the problem, but it just got worse.
  
  I've copied the log of the OSD run to http://pastebin.com/XYRtfFMU . I've
  rebooted all the nodes, but they still don't work.
  
  What should I do to resurrect my OSDs?
  
  Thanks,
  --
  cc
  
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  (mailto:majord...@vger.kernel.org) More majordomo info at
  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior after upgrading to 0.48

2012-07-05 Thread Sage Weil
Hi,

On Thu, 5 Jul 2012, Xiaopong Tran wrote:
 Hi,
 
 I put up a small cluster with 3 osds, 2 mds, 3 mons, on 3 machines.
 They were running 0.47.2, and this is a test to do rolling upgrade to
 0.48.
 
 I shutdown, upgraded the software, then restarted. One node at a time.
 The first two seemed to be ok. The third one gave me some weird thing.
 While it was doing the conversion and recovering, the command ceph -s gives
 things like this:
 
 
 root@china:/tmp# ceph -s
 2012-07-05 14:28:41.069470 7fa3c8443780  2 auth: KeyRing::load: loaded key
 file /etc/ceph/client.admin.keyring
 2012-07-05 14:28:41.594229 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.596313 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.598949 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.601158 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.603069 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.605020 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.607436 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.609304 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.611047 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.667980 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.670283 7fa3c030e700  0 monclient: hunting for new mon
 2012-07-05 14:28:41.672274 7fa3c030e700  0 monclient: hunting for new mon
 

The problem is that the ceph utility itself is pre-0.48, but the monitors 
are running 0.48.  You need to upgrade the utility as well.  (There was a 
note about this in the release announcement.)

This only affects the -s and -w commands.

sage


 
 And it never stopped. I was thinking, maybe it just behaved like
 that during recovery. But after the recovery is done, it still
 get the same thing:
 
 root@china:/tmp# ceph health
 2012-07-05 14:28:55.077364 7f8306a0d780  2 auth: KeyRing::load: loaded key
 file /etc/ceph/client.admin.keyring
 HEALTH_OK
 root@china:/tmp# ceph -s
 2012-07-05 14:30:49.688017 7feb6338e780  2 auth: KeyRing::load: loaded key
 file /etc/ceph/client.admin.keyring
 2012-07-05 14:30:49.691690 7feb5b259700  0 monclient: hunting for new mon
 2012-07-05 14:30:49.694295 7feb5b259700  0 monclient: hunting for new mon
 2012-07-05 14:30:49.696487 7feb5b259700  0 monclient: hunting for new mon
 2012-07-05 14:30:49.698953 7feb5b259700  0 monclient: hunting for new mon
 2012-07-05 14:30:49.700833 7feb5b259700  0 monclient: hunting for new mon
 
 
 Upgrading the first two nodes have no such problem. This first two
 nodes all run osd, mds, and mon. The third only runs osd and mon.
 
 The mon log on the 3rd node shows this, not sure if this is helpful:
 
 
 925291 lease_expire=2012-07-05 02:38:14.149966 has v44 lc 44
 2012-07-05 02:38:12.572107 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap active
 c 29531..30031) is_readable now=2012-07-05 02:38:12.572114
 lease_expire=2012-07-05 02:38:15.889056 has v0 lc 30031
 2012-07-05 02:38:12.572128 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap active
 c 29531..30031) is_readable now=2012-07-05 02:38:12.572129
 lease_expire=2012-07-05 02:38:15.889056 has v0 lc 30031
 2012-07-05 02:38:15.120439 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap active
 c 1..44) is_readable now=2012-07-05 02:38:15.120446 lease_expire=2012-07-05
 02:38:17.149967 has v44 lc 44
 2012-07-05 02:38:15.925349 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap active
 c 1..44) is_readable now=2012-07-05 02:38:15.925356 lease_expire=2012-07-05
 02:38:20.149971 has v44 lc 44
 2012-07-05 02:38:17.572181 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap active
 c 29531..30031) is_readable now=2012-07-05 02:38:17.572189
 lease_expire=2012-07-05 02:38:21.889065 has v0 lc 30031
 2012-07-05 02:38:17.572204 7f7d9381a700  1 mon.a@0(leader).paxos(pgmap active
 c 29531..30031) is_readable now=2012-07-05 02:38:17.572205
 lease_expire=2012-07-05 02:38:21.889065 has v0 lc 30031
 2012-07-05 02:38:19.120463 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap active
 c 1..44) is_readable now=2012-07-05 02:38:19.120470 lease_expire=2012-07-05
 02:38:23.149973 has v44 lc 44
 2012-07-05 02:38:19.925323 7f7d9401b700  1 mon.a@0(leader).paxos(mdsmap active
 c 1..44) is_readable now=2012-07-05 02:38:19.925330 lease_expire=2012-07-05
 02:38:23.149973 has v44 lc 44
 
 Could someone give a hint on this?
 
 Thanks
 
 Xiaopong
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Generate URL-safe base64 strings for keys.

2012-07-05 Thread Sage Weil
On Thu, 5 Jul 2012, Wido den Hollander wrote:
 On 04-07-12 18:18, Sage Weil wrote:
  On Wed, 4 Jul 2012, Wido den Hollander wrote:
On Wed, 4 Jul 2012, Wido den Hollander wrote:
 By using this we prevent scenarios where cephx keys are not accepted
 in various situations.
 
 Replacing the + and / by - and _ we generate URL-safe base64 keys
 
 Signed-off-by: Wido den Hollander w...@widodh.nl

Do already properly decode URL-sage base64 encoding?

   
   Yes, it decodes URL-safe base64 as well.
   
   See the if statements for 62 and 63, + and - are treated equally, just
   like / and _.
  
  Oh, got it.  The commit description confused me... I thought this was
  related encoding only.
  
  I think we should break the encode and decode patches into separate
  versions, and apply the decode to a stable branch (argonaut) and the
  encode to the master.  That should avoid most problems with a
  rolling/staggered upgrade...
 
 I just submitted a patch for decoding only.

Applied, thanks!

 During some tests I did I found out that libvirt uses GNUlib and won't handle
 URL-safe base64 encoded keys.
 
 So, as long as Ceph allows them we're good. Users can always replace the + and
 / in their key knowing it will be accepted by Ceph.
 
 This works for me for now. The exact switch to base64url should be done at a
 later stage I think.
 
 The RFC on this: http://tools.ietf.org/html/rfc4648#page-7

We could:
 - submit a patch for gnulib; someday it'll support it
 - kludge the secret generation code in ceph so that it rejects secrets 
   with problematic encoding... :/  (radosgw-admin does something 
   similar with +'s in the s3-style user keys.)

sage



 
 Wido
 
  
  sage
  
  
   
   Wido
   
   
sage

 ---
 src/common/armor.c |   6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/src/common/armor.c b/src/common/armor.c
 index d1d5664..7f73da1 100644
 --- a/src/common/armor.c
 +++ b/src/common/armor.c
 @@ -9,7 +9,7 @@
 * base64 encode/decode.
 */
 
 -const char *pem_key =
 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
 +const char *pem_key =
 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_;
 
 static int encode_bits(int c)
 {
 @@ -24,9 +24,9 @@ static int decode_bits(char c)
  return c - 'a' + 26;
  if (c = '0'  c = '9')
  return c - '0' + 52;
 -if (c == '+')
 +if (c == '+' || c == '-')
  return 62;
 -if (c == '/')
 +if (c == '/' || c == '_')
  return 63;
  if (c == '=')
  return 0; /* just non-negative, please */
 --
 1.7.9.5
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org
 More majordomo info at   http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at   http://vger.kernel.org/majordomo-info.html
   
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior after upgrading to 0.48

2012-07-05 Thread Xiaopong Tran


Sage Weil s...@inktank.com wrote:

Hi,

On Thu, 5 Jul 2012, Xiaopong Tran wrote:
 Hi,
 
 I put up a small cluster with 3 osds, 2 mds, 3 mons, on 3 machines.
 They were running 0.47.2, and this is a test to do rolling upgrade to
 0.48.
 
 I shutdown, upgraded the software, then restarted. One node at a
time.
 The first two seemed to be ok. The third one gave me some weird
thing.
 While it was doing the conversion and recovering, the command ceph -s
gives
 things like this:
 
 
 root@china:/tmp# ceph -s
 2012-07-05 14:28:41.069470 7fa3c8443780  2 auth: KeyRing::load:
loaded key
 file /etc/ceph/client.admin.keyring
 2012-07-05 14:28:41.594229 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.596313 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.598949 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.601158 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.603069 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.605020 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.607436 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.609304 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.611047 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.667980 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.670283 7fa3c030e700  0 monclient: hunting for new
mon
 2012-07-05 14:28:41.672274 7fa3c030e700  0 monclient: hunting for new
mon
 

The problem is that the ceph utility itself is pre-0.48, but the
monitors 
are running 0.48.  You need to upgrade the utility as well.  (There was
a 
note about this in the release announcement.)

This only affects the -s and -w commands.

sage

I have read the notes, andupgraded the utility first. There was no problem when 
the first two were upgraded and recovering. This only happened when the third 
node is upgraded.

The nodes are running debian wheezy, while the client admin node is running 
ubuntu 12.04.

thanks

Xiaopong


 
 And it never stopped. I was thinking, maybe it just behaved like
 that during recovery. But after the recovery is done, it still
 get the same thing:
 
 root@china:/tmp# ceph health
 2012-07-05 14:28:55.077364 7f8306a0d780  2 auth: KeyRing::load:
loaded key
 file /etc/ceph/client.admin.keyring
 HEALTH_OK
 root@china:/tmp# ceph -s
 2012-07-05 14:30:49.688017 7feb6338e780  2 auth: KeyRing::load:
loaded key
 file /etc/ceph/client.admin.keyring
 2012-07-05 14:30:49.691690 7feb5b259700  0 monclient: hunting for new
mon
 2012-07-05 14:30:49.694295 7feb5b259700  0 monclient: hunting for new
mon
 2012-07-05 14:30:49.696487 7feb5b259700  0 monclient: hunting for new
mon
 2012-07-05 14:30:49.698953 7feb5b259700  0 monclient: hunting for new
mon
 2012-07-05 14:30:49.700833 7feb5b259700  0 monclient: hunting for new
mon
 
 
 Upgrading the first two nodes have no such problem. This first two
 nodes all run osd, mds, and mon. The third only runs osd and mon.
 
 The mon log on the 3rd node shows this, not sure if this is helpful:
 
 
 925291 lease_expire=2012-07-05 02:38:14.149966 has v44 lc 44
 2012-07-05 02:38:12.572107 7f7d9381a700  1
mon.a@0(leader).paxos(pgmap active
 c 29531..30031) is_readable now=2012-07-05 02:38:12.572114
 lease_expire=2012-07-05 02:38:15.889056 has v0 lc 30031
 2012-07-05 02:38:12.572128 7f7d9381a700  1
mon.a@0(leader).paxos(pgmap active
 c 29531..30031) is_readable now=2012-07-05 02:38:12.572129
 lease_expire=2012-07-05 02:38:15.889056 has v0 lc 30031
 2012-07-05 02:38:15.120439 7f7d9401b700  1
mon.a@0(leader).paxos(mdsmap active
 c 1..44) is_readable now=2012-07-05 02:38:15.120446
lease_expire=2012-07-05
 02:38:17.149967 has v44 lc 44
 2012-07-05 02:38:15.925349 7f7d9401b700  1
mon.a@0(leader).paxos(mdsmap active
 c 1..44) is_readable now=2012-07-05 02:38:15.925356
lease_expire=2012-07-05
 02:38:20.149971 has v44 lc 44
 2012-07-05 02:38:17.572181 7f7d9381a700  1
mon.a@0(leader).paxos(pgmap active
 c 29531..30031) is_readable now=2012-07-05 02:38:17.572189
 lease_expire=2012-07-05 02:38:21.889065 has v0 lc 30031
 2012-07-05 02:38:17.572204 7f7d9381a700  1
mon.a@0(leader).paxos(pgmap active
 c 29531..30031) is_readable now=2012-07-05 02:38:17.572205
 lease_expire=2012-07-05 02:38:21.889065 has v0 lc 30031
 2012-07-05 02:38:19.120463 7f7d9401b700  1
mon.a@0(leader).paxos(mdsmap active
 c 1..44) is_readable now=2012-07-05 02:38:19.120470
lease_expire=2012-07-05
 02:38:23.149973 has v44 lc 44
 2012-07-05 02:38:19.925323 7f7d9401b700  1
mon.a@0(leader).paxos(mdsmap active
 c 1..44) is_readable now=2012-07-05 02:38:19.925330
lease_expire=2012-07-05
 02:38:23.149973 has v44 lc 44
 
 Could someone give a hint on this?
 
 Thanks
 
 Xiaopong
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 

--
To unsubscribe from this 

Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem

2012-07-05 Thread Sage Weil
On Wed, 4 Jul 2012, Sha Zhengju wrote:
 On 07/02/2012 10:49 PM, Sage Weil wrote:
  On Mon, 2 Jul 2012, Sha Zhengju wrote:
   On 06/29/2012 01:21 PM, Sage Weil wrote:
On Thu, 28 Jun 2012, Sha Zhengju wrote:

 From: Sha Zhengjuhandai@taobao.com
 
 Following we will treat SetPageDirty and dirty page accounting as an
 integrated
 operation. Filesystems had better use vfs interface directly to avoid
 those details.
 
 Signed-off-by: Sha Zhengjuhandai@taobao.com
 ---
fs/buffer.c |2 +-
fs/ceph/addr.c  |   20 ++--
include/linux/buffer_head.h |2 ++
3 files changed, 5 insertions(+), 19 deletions(-)
 
 diff --git a/fs/buffer.c b/fs/buffer.c
 index e8d96b8..55522dd 100644
 --- a/fs/buffer.c
 +++ b/fs/buffer.c
 @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 * If warn is true, then emit a warning if the page is not uptodate
 and
 has
 * not been truncated.
 */
 -static int __set_page_dirty(struct page *page,
 +int __set_page_dirty(struct page *page,
   struct address_space *mapping, int warn)
{
   if (unlikely(!mapping))
This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
continue to build as a module.

With that fixed, the ceph bits are a welcome cleanup!

Acked-by: Sage Weils...@inktank.com
   Further, I check the path again and may it be reworked as follows to avoid
   undo?
   
   __set_page_dirty();
   __set_page_dirty();
   ceph operations;== if (page-mapping)
   if (page-mapping)ceph
   operations;
;
   else
undo = 1;
   if (undo)
xxx;
  Yep.  Taking another look at the original code, though, I'm worried that
  one reason the __set_page_dirty() actions were spread out the way they are
  is because we wanted to ensure that the ceph operations were always
  performed when PagePrivate was set.
  
 
 Sorry, I've lost something:
 
 __set_page_dirty();__set_page_dirty();
 ceph operations;
 if(page-mapping) ==  if(page-mapping) {
SetPagePrivate;SetPagePrivate;
 else  ceph operations;
 undo = 1;  }
 
 if (undo)
 XXX;
 
 I think this can ensure that ceph operations are performed together with
 SetPagePrivate.

Yeah, that looks right, as long as the ceph accounting operations happen 
before SetPagePrivate.  I think it's no more or less racy than before, at 
least. 

The patch doesn't apply without the previous ones in the series, it looks 
like.  Do you want to prepare a new version or should I?

Thanks!
sage

 
  It looks like invalidatepage won't get called if private isn't set, and
  presumably it handles the truncate race with __set_page_dirty() properly
  (right?).  What about writeback?  Do we need to worry about writepage[s]
  getting called with a NULL page-private?
 
 __set_page_dirty does handle racing conditions with truncate and
 writeback writepage[s] also take page-private into consideration
 which is done inside specific filesystems. I notice that ceph has handled
 this in ceph_writepage().
 Sorry, not vfs expert and maybe I've not caught your point...

 
 
 Thanks,
 Sha
 
  Thanks!
  sage
  
  
  
   
   
   Thanks,
   Sha
   
 diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
 index 8b67304..d028fbe 100644
 --- a/fs/ceph/addr.c
 +++ b/fs/ceph/addr.c
 @@ -5,6 +5,7 @@
#includelinux/mm.h
#includelinux/pagemap.h
#includelinux/writeback.h/* generic_writepages */
 +#includelinux/buffer_head.h
#includelinux/slab.h
#includelinux/pagevec.h
#includelinux/task_io_accounting_ops.h
 @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
   int undo = 0;
   struct ceph_snap_context *snapc;
 
 - if (unlikely(!mapping))
 - return !TestSetPageDirty(page);
 -
 - if (TestSetPageDirty(page)) {
 - dout(%p set_page_dirty %p idx %lu -- already
 dirty\n,
 -  mapping-host, page, page-index);
 + if (!__set_page_dirty(page, mapping, 1))
   return 0;
 - }
 
   inode = mapping-host;
   ci = ceph_inode(inode);
 @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
snapc, snapc-seq, snapc-num_snaps);
   spin_unlock(ci-i_ceph_lock);
 
 - /* now adjust page */
 - spin_lock_irq(mapping-tree_lock);
   if (page-mapping) {/* Race with truncate? */
 - WARN_ON_ONCE(!PageUptodate(page));
 - account_page_dirtied(page, page-mapping);
 - 

Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem

2012-07-05 Thread Sha Zhengju
On Thu, Jul 5, 2012 at 11:20 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 4 Jul 2012, Sha Zhengju wrote:
 On 07/02/2012 10:49 PM, Sage Weil wrote:
  On Mon, 2 Jul 2012, Sha Zhengju wrote:
   On 06/29/2012 01:21 PM, Sage Weil wrote:
On Thu, 28 Jun 2012, Sha Zhengju wrote:
   
 From: Sha Zhengjuhandai@taobao.com

 Following we will treat SetPageDirty and dirty page accounting as an
 integrated
 operation. Filesystems had better use vfs interface directly to avoid
 those details.

 Signed-off-by: Sha Zhengjuhandai@taobao.com
 ---
fs/buffer.c |2 +-
fs/ceph/addr.c  |   20 ++--
include/linux/buffer_head.h |2 ++
3 files changed, 5 insertions(+), 19 deletions(-)

 diff --git a/fs/buffer.c b/fs/buffer.c
 index e8d96b8..55522dd 100644
 --- a/fs/buffer.c
 +++ b/fs/buffer.c
 @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 * If warn is true, then emit a warning if the page is not 
 uptodate
 and
 has
 * not been truncated.
 */
 -static int __set_page_dirty(struct page *page,
 +int __set_page_dirty(struct page *page,
   struct address_space *mapping, int warn)
{
   if (unlikely(!mapping))
This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
continue to build as a module.
   
With that fixed, the ceph bits are a welcome cleanup!
   
Acked-by: Sage Weils...@inktank.com
   Further, I check the path again and may it be reworked as follows to 
   avoid
   undo?
  
   __set_page_dirty();
   __set_page_dirty();
   ceph operations;== if 
   (page-mapping)
   if (page-mapping)ceph
   operations;
;
   else
undo = 1;
   if (undo)
xxx;
  Yep.  Taking another look at the original code, though, I'm worried that
  one reason the __set_page_dirty() actions were spread out the way they are
  is because we wanted to ensure that the ceph operations were always
  performed when PagePrivate was set.
 

 Sorry, I've lost something:

 __set_page_dirty();__set_page_dirty();
 ceph operations;
 if(page-mapping) ==  if(page-mapping) {
SetPagePrivate;SetPagePrivate;
 else  ceph operations;
 undo = 1;  }

 if (undo)
 XXX;

 I think this can ensure that ceph operations are performed together with
 SetPagePrivate.

 Yeah, that looks right, as long as the ceph accounting operations happen
 before SetPagePrivate.  I think it's no more or less racy than before, at
 least.

 The patch doesn't apply without the previous ones in the series, it looks
 like.  Do you want to prepare a new version or should I?


Good. I'm doing some test then I'll send out a new version patchset, please
wait a bit. : )


Thanks,
Sha
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Setting a big maxosd kills all mons

2012-07-05 Thread Florian Haas
Hi guys,

Someone I worked with today pointed me to a quick and easy way to
bring down an entire cluster, by making all mons kill themselves in
mass suicide:

ceph osd setmaxosd 2147483647
2012-07-05 16:29:41.893862 b5962b70  0 monclient: hunting for new mon

I don't know what the actual threshold is, but setting your maxosd to
any sufficiently big number should do it. I had hoped 2^31-1 would be
fine, but evidently it's not.

This is what's in the mon log -- the first line is obviously only on
the leader at the time of the command, the others are on all mons.

-1 2012-07-05 16:29:41.829470 b41a1b70  0 mon.daisy@0(leader) e1
handle_command mon_command(osd setmaxosd 2147483647 v 0) v1
 0 2012-07-05 16:29:41.887590 b41a1b70 -1 *** Caught signal (Aborted) **
 in thread b41a1b70

 ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
 1: /usr/bin/ceph-mon() [0x816f461]
 2: [0xb7738400]
 3: [0xb7738424]
 4: (gsignal()+0x51) [0xb731a781]
 5: (abort()+0x182) [0xb731dbb2]
 6: (__gnu_cxx::__verbose_terminate_handler()+0x14f) [0xb753b53f]
 7: (()+0xbd405) [0xb7539405]
 8: (()+0xbd442) [0xb7539442]
 9: (()+0xbd581) [0xb7539581]
 10: (()+0x11dea) [0xb7582dea]
 11: (tc_new()+0x26) [0xb75a1636]
 12: (std::vectorunsigned char, std::allocatorunsigned char
::_M_fill_insert(__gnu_cxx::__normal_iteratorunsigned char*,
std::vectorunsigned char, std::allocatorunsigned char  , unsigned
int, unsigned char const)+0x79) [0x8185629]
 13: (OSDMap::set_max_osd(int)+0x497) [0x817c6b7]

From src/mon/OSDMonitor.cc:

  int newmax = atoi(m-cmd[2].c_str());
  if (newmax  osdmap.crush-get_max_devices()) {
err = -ERANGE;
ss  cannot set max_osd to   newmax   which is  crush
max_devices 
osdmap.crush-get_max_devices();
goto out;
  }

I think that counts as unchecked user input, or has cmd[2] been
sanitized at any time before it gets here?

Also, is there a way to recover from this, short of reinitializing all mons?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Writes to mounted Ceph FS fail silently if client has no write capability on data pool

2012-07-05 Thread Florian Haas
Hi everyone,

please enlighten me if I'm misinterpreting something, but I think the
Ceph FS layer could handle the following situation better.

How to reproduce (this is on a 3.2.0 kernel):

1. Create a client, mine is named test, with the following capabilities:

client.test
key: key
caps: [mds] allow
caps: [mon] allow r
caps: [osd] allow rw pool=testpool

Note the client only has access to a single pool, testpool.

2. Export the client's secret and mount a Ceph FS.

mount -t ceph -o name=test,secretfile=/etc/ceph/test.secret
daisy,eric,frank:/ /mnt

This succeeds, despite us not even having read access to the data pool.

3. Write something to a file.

root@alice:/mnt# echo hello world  hello.txt
root@alice:/mnt# cat hello.txt

This too succeeds.

4. Sync and clear caches.

root@alice:/mnt# sync
root@alice:/mnt# echo 3  /proc/sys/vm/drop_caches

5. Check file size and contents.

root@alice:/mnt# ls -la
total 5
drwxr-xr-x  1 root root0 Jul  5 17:15 .
drwxr-xr-x 21 root root 4096 Jun 11 09:03 ..
-rw-r--r--  1 root root   12 Jul  5 17:15 hello.txt
root@alice:/mnt# cat hello.txt
root@alice:/mnt#

Note the reported file size in unchanged, but the file is empty.

Checking the data pool with client.admin credentials obviously shows
that that pool is empty, so objects are never written. Interestingly,
cephfs hello.txt show_location does list an object_name, identifying
an object which doesn't exist.

Is there any way to make the client fail with -EIO, -EPERM,
-EOPNOTSUPP or whatever else is appropriate, rather than pretending to
write when it can't?

Also, going down the rabbit hole, how would this behavior change if I
used cephfs to set the default layout on some directory to use a
different pool?

All thoughts appreciated.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


cephfs show_location produces kernel divide error: 0000 [#1] when run against a directory that is not the filesystem root

2012-07-05 Thread Florian Haas
And one more issue report for today... :)

Really easy to reproduce on my 3.2.0 Debian squeeze-backports kernel:
mount a Ceph FS, create a directory in it. Then run cephfs dir
show_location.

dmesg stacktrace:

[ 7153.714260] libceph: mon2 192.168.42.116:6789 session established
[ 7308.584193] divide error:  [#1] SMP
[ 7308.584936] Modules linked in: cryptd aes_i586 aes_generic cbc ceph
libceph nfsd lockd nfs_acl auth_rpcgss sunrpc fuse joydev usbhid hid
snd_pcm snd_timer snd processor soundcore snd_page_alloc thermal_sys
button tpm_tis tpm tpm_bios psmouse i2c_piix4 evdev serio_raw i2c_core
virtio_balloon pcspkr ext3 jbd mbcache btrfs zlib_deflate crc32c
libcrc32c sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix
uhci_hcd ehci_hcd libata usbcore floppy scsi_mod virtio_pci usb_common
[last unloaded: scsi_wait_scan]
[ 7308.588013]
[ 7308.588013] Pid: 1444, comm: cephfs Not tainted
3.2.0-0.bpo.2-686-pae #1 Bochs Bochs
[ 7308.588013] EIP: 0060:[f848c6c2] EFLAGS: 00010246 CPU: 0
[ 7308.588013] EIP is at ceph_calc_file_object_mapping+0x44/0xe8 [libceph]
[ 7308.588013] EAX:  EBX:  ECX:  EDX: 
[ 7308.588013] ESI:  EDI:  EBP:  ESP: f7495ce4
[ 7308.588013]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7308.588013] Process cephfs (pid: 1444, ti=f7494000 task=f7266a60
task.ti=f7494000)
[ 7308.588013] Stack:
[ 7308.588013]     0001b053 f5f20624 f5f203f0
f749a800 f5f20420
[ 7308.588013]  f84ca6a7 f7495d40 f7495d58 f7495d50 f7495d38 0001
0246 f5f20420
[ 7308.588013]  f749a90c bff6ff70 c14203a4 fffba978 000a0050 
f79f0298 0001
[ 7308.588013] Call Trace:
[ 7308.588013]  [f84ca6a7] ? ceph_ioctl_get_dataloc+0x9e/0x213 [ceph]
[ 7308.588013]  [c10b6781] ? __do_fault+0x3ee/0x42b
[ 7308.588013]  [c10b75f3] ? handle_pte_fault+0x3aa/0xa67
[ 7308.588013]  [c10e0844] ? path_openat+0x27f/0x294
[ 7308.588013]  [f84cac16] ? ceph_ioctl+0x3fa/0x460 [ceph]
[ 7308.588013]  [c10d9fdb] ? cp_new_stat64+0xee/0x100
[ 7308.588013]  [c10b7ebe] ? handle_mm_fault+0x20e/0x224
[ 7308.588013]  [f84ca81c] ? ceph_ioctl_get_dataloc+0x213/0x213 [ceph]

I unfortunately don't have a more recent kernel to test with, so if
this has been fixed upstream feel free to ignore me. Otherwise,
perhaps something that could go into the 3.5-rc cycle.

Doing show_location on a file, and on the root directory of the fs,
both work fine.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting a big maxosd kills all mons

2012-07-05 Thread Gregory Farnum
On Thu, Jul 5, 2012 at 10:39 AM, Florian Haas flor...@hastexo.com wrote:
 Hi guys,

 Someone I worked with today pointed me to a quick and easy way to
 bring down an entire cluster, by making all mons kill themselves in
 mass suicide:

 ceph osd setmaxosd 2147483647
 2012-07-05 16:29:41.893862 b5962b70  0 monclient: hunting for new mon
Ungh. Can you file a bug report? The problem here is that the monitor
is trying to allocate a number of maps and arrays with that many
entries; we probably need to put an artificial cap in place as a
config option.


 I don't know what the actual threshold is, but setting your maxosd to
 any sufficiently big number should do it. I had hoped 2^31-1 would be
 fine, but evidently it's not.

 This is what's in the mon log -- the first line is obviously only on
 the leader at the time of the command, the others are on all mons.

     -1 2012-07-05 16:29:41.829470 b41a1b70  0 mon.daisy@0(leader) e1
 handle_command mon_command(osd setmaxosd 2147483647 v 0) v1
      0 2012-07-05 16:29:41.887590 b41a1b70 -1 *** Caught signal (Aborted) **
  in thread b41a1b70

  ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
  1: /usr/bin/ceph-mon() [0x816f461]
  2: [0xb7738400]
  3: [0xb7738424]
  4: (gsignal()+0x51) [0xb731a781]
  5: (abort()+0x182) [0xb731dbb2]
  6: (__gnu_cxx::__verbose_terminate_handler()+0x14f) [0xb753b53f]
  7: (()+0xbd405) [0xb7539405]
  8: (()+0xbd442) [0xb7539442]
  9: (()+0xbd581) [0xb7539581]
  10: (()+0x11dea) [0xb7582dea]
  11: (tc_new()+0x26) [0xb75a1636]
  12: (std::vectorunsigned char, std::allocatorunsigned char
::_M_fill_insert(__gnu_cxx::__normal_iteratorunsigned char*,
 std::vectorunsigned char, std::allocatorunsigned char  , unsigned
 int, unsigned char const)+0x79) [0x8185629]
  13: (OSDMap::set_max_osd(int)+0x497) [0x817c6b7]

 From src/mon/OSDMonitor.cc:

       int newmax = atoi(m-cmd[2].c_str());
       if (newmax  osdmap.crush-get_max_devices()) {
         err = -ERANGE;
         ss  cannot set max_osd to   newmax   which is  crush
 max_devices 
             osdmap.crush-get_max_devices();
         goto out;
       }

 I think that counts as unchecked user input, or has cmd[2] been
 sanitized at any time before it gets here?

Yeah, there's all kinds of unsanitized user input in the monitor
command-parsing code.

 Also, is there a way to recover from this, short of reinitializing all mons?
Hmm. We can do it by manipulating the disk format, but there's not any
programmatic way to do so. I *think* that if you turn off all the
monitors, and:
1) delete the latest osdmap and osdmap_full entries,
2) edit the osdmap and osdmap_full last_committed entries to be one
prior to what they are,
3) start the monitors
then you should be okay. But it's possible that the latest entry got
updated, in which case you'd also have to modify that to be an older
map.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow request warnings on 0.48

2012-07-05 Thread Mark Nelson

On 07/04/2012 11:58 AM, Alexandre DERUMIER wrote:

Hi, I see same messages here after upgrade to 0.48.

with random write benchmark.

I have more lags than before with 0.47 (but disks are at 100% usage, so can't 
tell if it's normal or not)


- Mail original -

De: David Blundelldavid.blund...@100percentit.com
À: ceph-devel@vger.kernel.org
Envoyé: Mercredi 4 Juillet 2012 18:53:02
Objet: Slow request warnings on 0.48

I have three servers running mon and osd using Ubuntu 12.04 that I have been 
testing with RADOS storing RBD KVM instances

0.47.3 worked extremely well (once I got over a few btrfs issues). The same servers 
running 0.48 give a large number of [WRN] slow request messages whenever I 
generate a lot of random IO in the KVM instances using iozone. The slow responses 
eventually leads to disk timeouts on the KVM instances.

I have erased the osds and recreated on new btrfs volumes with the same result.

I have also tried switching to xfs using mkfs.xfs -n size=64k with noatime, 
inode64,delaylog,logbufs=8,logbsize=256k

Xfs gives the same result - the iozone tests run fine until the random IO 
starts and then there are lots of slow request warnings.

Does anyone have any ideas about the best place to start troubleshooting / 
debugging?

Thanks,

David


Hi David and Alexandre,

Does this only happen with random writes or also sequential writes?  If 
it happens with sequential writes as well, does it happen with rados bench?


--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Slow request warnings on 0.48

2012-07-05 Thread David Blundell
 Hi David and Alexandre,
 
 Does this only happen with random writes or also sequential writes?  If it
 happens with sequential writes as well, does it happen with rados bench?
 
 --
 Mark Nelson
 Performance Engineer
 Inktank

Hi Mark,

I have only ever seen it with random writes.  I'll retry rados bench in a few 
minutes to double check - are there any other tests you would like me to run?

I'm currently waiting for some iozone tests to finish.  When the sequential 
tests are running the logs are fine, as soon as the random writes start, the 
logs start to fill with messages like:

2012-07-05 19:10:00.599250 osd.6 10.0.1.42:6802/2145 1151 : [WRN] slow request 
37.933071 seconds old, received at 2012-07-05 19:09:22.665917: 
osd_op(client.96416.0:91965 rb.0.1.015f [write 4022272~4096] 2.3777e91a 
snapc 11=[11,10]) v4 currently waiting for sub ops
2012-07-05 19:10:00.599258 osd.6 10.0.1.42:6802/2145 1152 : [WRN] slow request 
37.932836 seconds old, received at 2012-07-05 19:09:22.666152: 
osd_op(client.96416.0:91966 rb.0.1.015f [write 4034560~4096] 2.3777e91a 
snapc 11=[11,10]) v4 currently waiting for sub ops
2012-07-05 19:10:03.278141 mon.0 10.0.1.40:6789/0 493 : [INF] pgmap v7564: 1344 
pgs: 1344 active+clean; 5183 MB data, 11066 MB used, 1367 GB / 1377 GB avail
2012-07-05 19:09:55.388448 osd.3 10.0.1.41:6802/2540 160 : [WRN] 6 slow 
requests, 6 included below; oldest blocked for  32.622016 secs
2012-07-05 19:09:55.388463 osd.3 10.0.1.41:6802/2540 161 : [WRN] slow request 
32.622016 seconds old, received at 2012-07-05 19:09:22.766269: 
osd_op(client.96416.0:92308 rb.0.1.017b [write 4001792~4096] 2.f606a6c6 
snapc 11=[11,10]) v4 currently waiting for sub ops 


Re: Slow request warnings on 0.48

2012-07-05 Thread Alexandre DERUMIER
It was during a random write (fio benchmark).

I can't reproduce it now,I'll try to do tests again this week.

- Mail original - 

De: Mark Nelson mark.nel...@inktank.com 
À: Alexandre DERUMIER aderum...@odiso.com 
Cc: David Blundell david.blund...@100percentit.com, 
ceph-devel@vger.kernel.org 
Envoyé: Jeudi 5 Juillet 2012 19:58:27 
Objet: Re: Slow request warnings on 0.48 

On 07/04/2012 11:58 AM, Alexandre DERUMIER wrote: 
 Hi, I see same messages here after upgrade to 0.48. 
 
 with random write benchmark. 
 
 I have more lags than before with 0.47 (but disks are at 100% usage, so can't 
 tell if it's normal or not) 
 
 
 - Mail original - 
 
 De: David Blundelldavid.blund...@100percentit.com 
 À: ceph-devel@vger.kernel.org 
 Envoyé: Mercredi 4 Juillet 2012 18:53:02 
 Objet: Slow request warnings on 0.48 
 
 I have three servers running mon and osd using Ubuntu 12.04 that I have been 
 testing with RADOS storing RBD KVM instances 
 
 0.47.3 worked extremely well (once I got over a few btrfs issues). The same 
 servers running 0.48 give a large number of [WRN] slow request messages 
 whenever I generate a lot of random IO in the KVM instances using iozone. The 
 slow responses eventually leads to disk timeouts on the KVM instances. 
 
 I have erased the osds and recreated on new btrfs volumes with the same 
 result. 
 
 I have also tried switching to xfs using mkfs.xfs -n size=64k with noatime, 
 inode64,delaylog,logbufs=8,logbsize=256k 
 
 Xfs gives the same result - the iozone tests run fine until the random IO 
 starts and then there are lots of slow request warnings. 
 
 Does anyone have any ideas about the best place to start troubleshooting / 
 debugging? 
 
 Thanks, 
 
 David 

Hi David and Alexandre, 

Does this only happen with random writes or also sequential writes? If 
it happens with sequential writes as well, does it happen with rados bench? 

-- 
Mark Nelson 
Performance Engineer 
Inktank 



-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Slow request warnings on 0.48

2012-07-05 Thread David Blundell
 Hi David and Alexandre,
 
 Does this only happen with random writes or also sequential writes?  If it
 happens with sequential writes as well, does it happen with rados bench?
 
 --
 Mark Nelson
 Performance Engineer
 Inktank

Hi Mark,

I just ran rados -p data bench 60 write -t 16 and a few dd tests with no 
problems at all so at the moment it looks like only random IO triggers the slow 
writes.

Please do let me know if there are any other tests that I can do to help track 
down the cause.

David


Re: Slow request warnings on 0.48

2012-07-05 Thread Mark Nelson

On 07/05/2012 01:43 PM, David Blundell wrote:

Hi David and Alexandre,

Does this only happen with random writes or also sequential writes?  If it
happens with sequential writes as well, does it happen with rados bench?

--
Mark Nelson
Performance Engineer
Inktank


Hi Mark,

I just ran rados -p data bench 60 write -t 16 and a few dd tests with no 
problems at all so at the moment it looks like only random IO triggers the slow writes.

Please do let me know if there are any other tests that I can do to help track 
down the cause.

David


Thanks David!  We've got some people internally taking a look at this. 
I'll let you guys know if there is anything else we need!


Thanks,
Mark

--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writes to mounted Ceph FS fail silently if client has no write capability on data pool

2012-07-05 Thread Gregory Farnum
On Thu, Jul 5, 2012 at 10:40 AM, Florian Haas flor...@hastexo.com wrote:
 Hi everyone,

 please enlighten me if I'm misinterpreting something, but I think the
 Ceph FS layer could handle the following situation better.

 How to reproduce (this is on a 3.2.0 kernel):

 1. Create a client, mine is named test, with the following capabilities:

 client.test
         key: key
         caps: [mds] allow
         caps: [mon] allow r
         caps: [osd] allow rw pool=testpool

 Note the client only has access to a single pool, testpool.

 2. Export the client's secret and mount a Ceph FS.

 mount -t ceph -o name=test,secretfile=/etc/ceph/test.secret
 daisy,eric,frank:/ /mnt

 This succeeds, despite us not even having read access to the data pool.

 3. Write something to a file.

 root@alice:/mnt# echo hello world  hello.txt
 root@alice:/mnt# cat hello.txt

 This too succeeds.

 4. Sync and clear caches.

 root@alice:/mnt# sync
 root@alice:/mnt# echo 3  /proc/sys/vm/drop_caches

 5. Check file size and contents.

 root@alice:/mnt# ls -la
 total 5
 drwxr-xr-x  1 root root    0 Jul  5 17:15 .
 drwxr-xr-x 21 root root 4096 Jun 11 09:03 ..
 -rw-r--r--  1 root root   12 Jul  5 17:15 hello.txt
 root@alice:/mnt# cat hello.txt
 root@alice:/mnt#

 Note the reported file size in unchanged, but the file is empty.

 Checking the data pool with client.admin credentials obviously shows
 that that pool is empty, so objects are never written. Interestingly,
 cephfs hello.txt show_location does list an object_name, identifying
 an object which doesn't exist.

 Is there any way to make the client fail with -EIO, -EPERM,
 -EOPNOTSUPP or whatever else is appropriate, rather than pretending to
 write when it can't?

There definitely are, but I don't think we're going to fix that until
we get to working seriously on the filesystem. Create a bug! ;)

 Also, going down the rabbit hole, how would this behavior change if I
 used cephfs to set the default layout on some directory to use a
 different pool?

I'm not sure what you're asking here — if you have access to the
metadata server, you can change the pool that new files go into, and I
think you can set the pool to be whatever you like (and we should
probably harden all this, too). So you can fix it if it's a problem,
but you can also turn it into a problem.
Is that what you were after?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cephfs show_location produces kernel divide error: 0000 [#1] when run against a directory that is not the filesystem root

2012-07-05 Thread Florian Haas
On Thu, Jul 5, 2012 at 10:04 PM, Gregory Farnum g...@inktank.com wrote:
 But I have a few more queries while this is fresh. If you create a
 directory, unmount and remount, and get the location, does that work?

Nope, same error.

 (actually, just flushing caches would probably do it.)

Idem.

 If you create a
 directory on one node, and then go look at it on another node and try
 to get the location from there, does that work?

No.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow request warnings on 0.48

2012-07-05 Thread Samuel Just
David,

Could you try rados -p data bench 60 write -t 16 -b 4096?

rados bench defaults to 4MB objects, this'll give us results for 4k objects.

If you could give me the latency too, that would help.
-Sam

On Thu, Jul 5, 2012 at 12:49 PM, Mark Nelson mark.nel...@inktank.com wrote:
 On 07/05/2012 01:43 PM, David Blundell wrote:

 Hi David and Alexandre,

 Does this only happen with random writes or also sequential writes?  If
 it
 happens with sequential writes as well, does it happen with rados bench?

 --
 Mark Nelson
 Performance Engineer
 Inktank


 Hi Mark,

 I just ran rados -p data bench 60 write -t 16 and a few dd tests with no
 problems at all so at the moment it looks like only random IO triggers the
 slow writes.

 Please do let me know if there are any other tests that I can do to help
 track down the cause.

 David


 Thanks David!  We've got some people internally taking a look at this. I'll
 let you guys know if there is anything else we need!

 Thanks,
 Mark


 --
 Mark Nelson
 Performance Engineer
 Inktank
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writes to mounted Ceph FS fail silently if client has no write capability on data pool

2012-07-05 Thread Florian Haas
On Thu, Jul 5, 2012 at 10:01 PM, Gregory Farnum g...@inktank.com wrote:
 Also, going down the rabbit hole, how would this behavior change if I
 used cephfs to set the default layout on some directory to use a
 different pool?

 I'm not sure what you're asking here — if you have access to the
 metadata server, you can change the pool that new files go into, and I
 think you can set the pool to be whatever you like (and we should
 probably harden all this, too). So you can fix it if it's a problem,
 but you can also turn it into a problem.

I am aware that I would be able to do this.

My question was more along the lines of: if the pool that data is
written to can be set on a per-file or per-directory basis, and we can
also set read and write permissions per pool, how would the filesystem
behave properly? Hide files the mounting user doesn't have read access
to? Return -EIO or -EPERM on writes to files stored in pools we can't
write to? Failing a mount if we're missing some permission on any file
or directory in the fs? All of these sound painful in one way or
another, so I'm having trouble envisioning what the correct behavior
would look like.

Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cephfs show_location produces kernel divide error: 0000 [#1] when run against a directory that is not the filesystem root

2012-07-05 Thread Gregory Farnum
On Thu, Jul 5, 2012 at 1:19 PM, Florian Haas flor...@hastexo.com wrote:
 On Thu, Jul 5, 2012 at 10:04 PM, Gregory Farnum g...@inktank.com wrote:
 But I have a few more queries while this is fresh. If you create a
 directory, unmount and remount, and get the location, does that work?

 Nope, same error.

 (actually, just flushing caches would probably do it.)

 Idem.

 If you create a
 directory on one node, and then go look at it on another node and try
 to get the location from there, does that work?

 No.

 Cheers,
 Florian

Okay, this used to work at least some, so something definitely got
broken in the kernel. :/ Thanks for checking...
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writes to mounted Ceph FS fail silently if client has no write capability on data pool

2012-07-05 Thread Gregory Farnum
On Thu, Jul 5, 2012 at 1:25 PM, Florian Haas flor...@hastexo.com wrote:
 On Thu, Jul 5, 2012 at 10:01 PM, Gregory Farnum g...@inktank.com wrote:
 Also, going down the rabbit hole, how would this behavior change if I
 used cephfs to set the default layout on some directory to use a
 different pool?

 I'm not sure what you're asking here — if you have access to the
 metadata server, you can change the pool that new files go into, and I
 think you can set the pool to be whatever you like (and we should
 probably harden all this, too). So you can fix it if it's a problem,
 but you can also turn it into a problem.

 I am aware that I would be able to do this.

 My question was more along the lines of: if the pool that data is
 written to can be set on a per-file or per-directory basis, and we can
 also set read and write permissions per pool, how would the filesystem
 behave properly? Hide files the mounting user doesn't have read access
 to? Return -EIO or -EPERM on writes to files stored in pools we can't
 write to? Failing a mount if we're missing some permission on any file
 or directory in the fs? All of these sound painful in one way or
 another, so I'm having trouble envisioning what the correct behavior
 would look like.

Ah, yes. My feeling would be that we want to treat it like a local
file they aren't allowed to access — ie, return EPERM. I *think* that
is what will actually happen if they try to read those files, but the
write path works a bit differently (since the writes are flushed out
asynchronously) and so we would need to introduce some smarts into the
client to check the pool permissions and proactively apply them on any
attempted access.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-05 Thread Gregory Farnum
On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Le 04/07/2012 18:21, Gregory Farnum a écrit :

 On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:

 Le 03/07/2012 23:38, Tommi Virtanen a écrit :

 On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr
 (mailto:yann.dup...@univ-nantes.fr) wrote:

 In the case I could repair, do you think a crashed FS as it is right
 now is
 valuable for you, for future reference , as I saw you can't reproduce
 the
 problem ? I can make an archive (or a btrfs dump ?), but it will be
 quite
 big.

     At this point, it's more about the upstream developers (of btrfs
 etc)
 than us; we're on good terms with them but not experts on the on-disk
 format(s). You might want to send an email to the relevant mailing
 lists before wiping the disks.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html

     Well, I probably wasn't clear enough. I talked about crashed FS, but
 i
 was talking about ceph. The underlying FS (btrfs in that case) of 1 node
 (and only one) has PROBABLY crashed in the past, causing corruption in
 ceph data on this node, and then the subsequent crash of other nodes.
   RIGHT now btrfs on this node is OK. I can access the filesystem without
 errors.
   For the moment, on 8 nodes, 4 refuse to restart .
 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
 with the underlying fs as far as I can tell.
   So I think the scenario is :
   One node had problem with btrfs, leading first to kernel problem ,
 probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
 kernel oops. Before that ultimate kernel oops, bad data has been
 transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
 nodes.

 I don't think that's actually possible — the OSDs all do quite a lot of
 interpretation between what they get off the wire and what goes on disk.
 What you've got here are 4 corrupted LevelDB databases, and we pretty much
 can't do that through the interfaces we have. :/


 ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
 erroneous out of space ) in more or less the same time. And when 1 osd was
 out,



   If you think this scenario is highly improbable in real life (that is,
 btrfs will probably be fixed for good, and then, corruption can't
 happen), it's ok.
   But I wonder if this scenario can be triggered with other problem, and
 bad data can be transmitted to other sane nodes (power outage, out of
 memory condition, disk full... for example)
   That's why I proposed you a crashed ceph volume image (I shouldn't have
 talked about a crashed fs, sorry for the confusion)

 I appreciate the offer, but I don't think this will help much — it's a
 disk state managed by somebody else, not our logical state, which has
 broken. If we could figure out how that state got broken that'd be good, but
 a ceph image won't really help in doing so.

 ok, no problem. I'll restart from scratch, freshly formated.


 I wonder if maybe there's a confounding factor here — are all your nodes
 similar to each other,


 Yes. I designed the cluster that way. All nodes are identical hardware
 (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
 storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

Oh, interesting. Are the broken nodes all on the same set of arrays?




   or are they running on different kinds of hardware? How did you do your
 Ceph upgrades? What's ceph -s display when the cluster is running as best it
 can?


 Ceph was running 0.47.2 at that time - (debian package for ceph). After the
 crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without
 success.

 Nothing particular for upgrades, because for the moment ceph is broken, so
 just apt-get upgrade with new version.


 ceph -s show that :

 root@label5:~# ceph -s
    health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32
 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale;
 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%);
 1814/1245570 unfound (0.146%)
    monmap e1: 3 mons at
 {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
 election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
    osdmap e2404: 8 osds: 3 up, 3 in
     pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5
 active+recovering+remapped, 32 active+clean+replay, 11
 active+recovering+degraded, 25 active+remapped, 710 down+peering, 222
 active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering,
 20 stale+active+degraded, 6 down+remapped+peering, 8
 stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB
 used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);
 

Re: speedup ceph / scaling / find the bottleneck

2012-07-05 Thread Gregory Farnum
Could you send over the ceph.conf on your KVM host, as well as how
you're configuring KVM to use rbd?

On Tue, Jul 3, 2012 at 11:20 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 I'm sorry but this is the KVM Host Machine there is no ceph running on this
 machine.

 If i change the admin socket to:
 admin_socket=/var/run/ceph_$name.sock

 i don't have any socket at all ;-(

 Am 03.07.2012 17:31, schrieb Sage Weil:

 On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote:

 Hello,

 Am 02.07.2012 22:30, schrieb Josh Durgin:

 If you add admin_socket=/path/to/admin_socket for your client running
 qemu (in that client's ceph.conf section or manually in the qemu
 command line) you can check that caching is enabled:

 ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache

 And see statistics it generates (look for cache) with:

 ceph --admin-daemon /path/to/admin_socket perfcounters_dump


 This doesn't work for me:
 ceph --admin-daemon /var/run/ceph.sock show config
 read only got 0 bytes of 4 expected for response length; invalid
 command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0)
 AdminSocket:
 request 'show config' not defined


 Oh, it's 'config show'.  Also, 'help' will list the supported commands.

 Also perfcounters does not show anything:
 # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump
 {}


 There may be another daemon that tried to attach to the same socket file.
 You might want to set 'admin socket = /var/run/ceph/$name.sock' or
 something similar, or whatever else is necessary to make it a unique file.

 ~]# ceph -v
 ceph version 0.48argonaut-2-gb576faa
 (commit:b576faa6f24356f4d3ec7205e298d58659e29c68)


 Out of curiousity, what patches are you applying on top of the release?

 sage


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow request warnings on 0.48

2012-07-05 Thread David Blundell

On 5 Jul 2012, at 21:21, Samuel Just wrote:

 David,
 
 Could you try rados -p data bench 60 write -t 16 -b 4096?
 
 rados bench defaults to 4MB objects, this'll give us results for 4k objects.
 
 If you could give me the latency too, that would help.
 -Sam

Hi Sam,

I first ran this with the standard ceph settings giving 
http://pastebin.com/MWLxEazS

This did not cause any slow request warnings so I set filestore queue max ops 
= 5000 to increase the number of requests in flight.  This resulted in 
http://pastebin.com/yFnALmGW and also a small number of slow request warnings.  
I ran it again with similar results http://pastebin.com/VnKSVmsq

If there's anything else you need, please let me know.

David--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkcephfs failing on v0.48 argonaut

2012-07-05 Thread Sage Weil
Hi Paul,

On Wed, 4 Jul 2012, Paul Pettigrew wrote:
 Firstly, well done guys on achieving this version milestone. I 
 successfully upgraded to the 0.48 format uneventfully on a live (test) 
 system.
 
 The same system was then going through rebuild testing, to confirm 
 that also worked fine.
 
 
 Unfortunately, the mkcephfs command is failing:
 
 root@dsanb1-coy:~# mkcephfs -c /etc/ceph/ceph.conf --allhosts --mkbtrfs -k 
 /etc/ceph/keyring --crushmapsrc crushfile.txt -v
 temp dir is /tmp/mkcephfs.GaRCZ9i06a
 preparing monmap in /tmp/mkcephfs.GaRCZ9i06a/monmap
 /usr/bin/monmaptool --create --clobber --add alpha 10.32.0.10:6789 --add 
 bravo 10.32.0.25:6789 --add charlie 10.32.0.11:6789 --print 
 /tmp/mkcephfs.GaRCZ9i06a/monmap
 /usr/bin/monmaptool: monmap file /tmp/mkcephfs.GaRCZ9i06a/monmap
 /usr/bin/monmaptool: generated fsid c7202495-468c-4678-b678-115c3ee33402
 epoch 0
 fsid c7202495-468c-4678-b678-115c3ee33402
 last_changed 2012-07-04 15:02:31.732275
 created 2012-07-04 15:02:31.732275
 0: 10.32.0.10:6789/0 mon.alpha
 1: 10.32.0.11:6789/0 mon.charlie
 2: 10.32.0.25:6789/0 mon.bravo
 /usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.GaRCZ9i06a/monmap (3 
 monitors)
 /usr/bin/ceph-conf -c /etc/ceph/ceph.conf -n osd.0 user
 === osd.0 ===
 --- dsanb1-coy# /sbin/mkcephfs -d /tmp/mkcephfs.GaRCZ9i06a --prepare-osdfs 
 osd.0
 umount: /srv/osd.0: not mounted
 umount: /dev/disk/by-wwn/wwn-0x50014ee601246234: not mounted
 
 WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
 WARNING! - see http://btrfs.wiki.kernel.org before using
 
 fs created label (null) on /dev/disk/by-wwn/wwn-0x50014ee601246234
 nodesize 4096 leafsize 4096 sectorsize 4096 size 1.82TB
 Btrfs Btrfs v0.19
 Scanning for Btrfs filesystems
 mount: wrong fs type, bad option, bad superblock on /dev/sdc,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail  or so
 
 failed: '/sbin/mkcephfs -d /tmp/mkcephfs.GaRCZ9i06a --prepare-osdfs osd.0'

Hmm.  Can you try running with -v?  That will tell us exactly which 
command it is running, and hopefully we can work backwards from there.

 dmesg/syslog is spitting out at the time of this failure:
 
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.751945] device fsid 
 7de0d192-b710-4629-a201-849df1d9db17 devid 1 transid 27109 /dev/sdp
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.751987] device fsid 
 08fc3479-2fa2-4388-8b61-83e2a742a13e devid 1 transid 28699 /dev/sdo
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.752023] device fsid 
 8b4a7c43-1a05-4dcb-bbed-de2a5c933996 devid 1 transid 24346 /dev/sdn
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.752068] device fsid 
 ba5fb1ca-c642-49b1-8a41-7f56f8e59fbd devid 1 transid 27274 /dev/sdm
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.761453] device fsid 
 7fe8c5cf-bf8c-4276-90f2-c3f57f5275fb devid 1 transid 28724 /dev/sdi
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.761518] device fsid 
 93fa3631-1202-4d42-8908-e5ef4d3e600d devid 1 transid 25201 /dev/sdh
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.761579] device fsid 
 b9a1b5e4-3e5e-4381-a29a-33470f4b870f devid 1 transid 23375 /dev/sdg
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.761635] device fsid 
 280ea990-23f8-4c43-9e56-140c82340fdc devid 1 transid 25559 /dev/sdf
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.761693] device fsid 
 2f724cde-6de5-4262-b195-1ba3eea2256e devid 1 transid 176 /dev/sde
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.761732] device fsid 
 a66f890f-8b08-4393-aab0-f222637ca5a4 devid 1 transid 7 /dev/sdd
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.761769] device fsid 
 6c181a94-697c-4e0c-af0d-05eb04d3626c devid 1 transid 7 /dev/sdc
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.775931] device fsid 
 6c181a94-697c-4e0c-af0d-05eb04d3626c devid 1 transid 7 /dev/sdc
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.779716] btrfs bad fsid on block 
 20971520
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.791594] btrfs bad fsid on block 
 20971520
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.803608] btrfs bad fsid on block 
 20971520
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.815541] btrfs bad fsid on block 
 20971520
 Jul  4 15:02:31 dsanb1-coy kernel: [ 2306.815878] btrfs bad fsid on block 
 20971520
 Jul  4 15:02:32 dsanb1-coy kernel: [ 2306.823554] btrfs bad fsid on block 
 20971520
 Jul  4 15:02:32 dsanb1-coy kernel: [ 2306.823797] btrfs bad fsid on block 
 20971520
 Jul  4 15:02:32 dsanb1-coy kernel: [ 2306.823887] btrfs: failed to read chunk 
 root on sdc
 Jul  4 15:02:32 dsanb1-coy kernel: [ 2306.825622] btrfs: open_ctree failed

Long shot, but is the kernel on that machine recent?

 Also fails if not forcing to use btrfs, eg:
 
 root@dsanb1-coy:~# mkcephfs -c /etc/ceph/ceph.conf --allhosts -k 
 /etc/ceph/keyring --crushmapsrc crushfile.txt -v
 temp dir is /tmp/mkcephfs.ZOh6tBPAH0
 preparing monmap in /tmp/mkcephfs.ZOh6tBPAH0/monmap
 /usr/bin/monmaptool --create --clobber --add alpha 10.32.0.10:6789 --add 
 bravo 10.32.0.25:6789 --add 

Re: Strange behavior after upgrading to 0.48

2012-07-05 Thread Xiaopong Tran

On 07/05/2012 10:38 PM, Sage Weil wrote:

On Thu, 5 Jul 2012, Xiaopong Tran wrote:

The problem is that the ceph utility itself is pre-0.48, but the
monitors
are running 0.48.  You need to upgrade the utility as well.  (There was
a
note about this in the release announcement.)

This only affects the -s and -w commands.

sage


I have read the notes, andupgraded the utility first. There was no
problem when the first two were upgraded and recovering. This only
happened when the third node is upgraded.

The nodes are running debian wheezy, while the client admin node is
running ubuntu 12.04.


Oooh, maybe the package for wheezy in the repo is wrong.  Can you confirm
which version the ceph utility is with 'ceph -v'?

Thanks!
sage




Thanks for the quick reply, I didn't have the computer with me last
night. But you were right. I checked the version of ceph on ubuntu,
and it's still stuck with 0.47.3, despite upgrading. I redid the
upgrade, and it's still stuck with that version. That's something
I didn't pay attention to.

I had to purge the ceph, ceph-common and other related packages,
and re-install it, then I got 0.48. And now ceph -s works just
as it should.

So, somehow, the upgrade on ubuntu does not work properly.

Thinking about this issue just right now, I think ceph -s
still worked right because there was still an older version
of mon when the first two nodes were being upgraded. When
the last one was upgraded, there's no mon of the same version
anymore.

Sorry, should have checked if apt upgrade was done properly
first :)

Thanks


Xiaopong
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: mkcephfs failing on v0.48 argonaut

2012-07-05 Thread Paul Pettigrew
Hi Sage - thanks so much for the quick response :-)

Firstly, and it is a bit hard to see, but the command output below is run with 
the -v option. To help isolate what command line in the script is failing, I 
have added in some simple echo output, and the script now looks like:


### prepare-osdfs ###

if [ -n $prepareosdfs ]; then
SNIP
modprobe btrfs || true
echo RUNNING: mkfs.btrfs $btrfs_devs
mkfs.btrfs $btrfs_devs
btrfs device scan || btrfsctl -a
echo RUNNING: mount -t btrfs $btrfs_opt $first_dev $btrfs_path
mount -t btrfs $btrfs_opt $first_dev $btrfs_path
echo DID I GET HERE - OR CRASH OUT WITH mount ABOVE?
chown $osd_user $btrfs_path
chmod +w $btrfs_path

exit 0
fi

Per the modified script the above, here is the output displayed when running 
the script:

root@dsanb1-coy:/srv# /sbin/mkcephfs -c /etc/ceph/ceph.conf --allhosts 
--mkbtrfs -k /etc/ceph/keyring --crushmapsrc crushfile.txt -v
temp dir is /tmp/mkcephfs.uelzdJ82ej
preparing monmap in /tmp/mkcephfs.uelzdJ82ej/monmap
/usr/bin/monmaptool --create --clobber --add alpha 10.32.0.10:6789 --add bravo 
10.32.0.25:6789 --add charlie 10.32.0.11:6789 --print 
/tmp/mkcephfs.uelzdJ82ej/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.uelzdJ82ej/monmap
/usr/bin/monmaptool: generated fsid b254abdd-e036-4186-b6d5-e32b14e53b45
epoch 0
fsid b254abdd-e036-4186-b6d5-e32b14e53b45
last_changed 2012-07-06 12:31:38.416848
created 2012-07-06 12:31:38.416848
0: 10.32.0.10:6789/0 mon.alpha
1: 10.32.0.11:6789/0 mon.charlie
2: 10.32.0.25:6789/0 mon.bravo
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.uelzdJ82ej/monmap (3 
monitors)
/usr/bin/ceph-conf -c /etc/ceph/ceph.conf -n osd.0 user
=== osd.0 ===
--- dsanb1-coy# /sbin/mkcephfs -d /tmp/mkcephfs.uelzdJ82ej --prepare-osdfs osd.0
umount: /srv/osd.0: not mounted
umount: /dev/sdc: not mounted
RUNNING: mkfs.btrfs /dev/sdc

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/sdc
nodesize 4096 leafsize 4096 sectorsize 4096 size 1.82TB
Btrfs Btrfs v0.19
Scanning for Btrfs filesystems
RUNNING: mount -t btrfs -o noatime /dev/sdc /srv/osd.0
mount: wrong fs type, bad option, bad superblock on /dev/sdc,
   missing codepage or helper program, or other error
   In some cases useful info is found in syslog - try
   dmesg | tail  or so

failed: '/sbin/mkcephfs -d /tmp/mkcephfs.uelzdJ82ej --prepare-osdfs osd.0'


Which clearly isolates the issue to the mount command line.

The trouble is, I can run this precise line on the command line directly 
without error:

root@dsanb1-coy:/srv# mount -t btrfs -o noatime /dev/sdc /srv/osd.0 
root@dsanb1-coy:/srv# mount | grep btrfs
/dev/sdc on /srv/osd.0 type btrfs (rw,noatime)


Therefore, what could possibly be preventing the mkcephfs running a simple 
mount command on the first OSD disk it gets to, that otherwise works fine from 
the command line?

Many thanks Sage

Paul

PS: changing the  btrfs device scan || btrfsctl -a line as proposed had no 
effect, and neither did putting in a sleep 10 immediately before the mount 
line.
PPS: zerofilling the /dev/sdc and then re-creating a partition and mounting 
manually, then writing data to it is all fine. Same errors if we substitute any 
of the other HDD's in the server as 1st/osd.0. Ie, cannot see any issues with 
the hardware.





-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Friday, 6 July 2012 8:18 AM
To: Paul Pettigrew
Cc: ceph-devel@vger.kernel.org
Subject: Re: mkcephfs failing on v0.48 argonaut

Hi Paul,

On Wed, 4 Jul 2012, Paul Pettigrew wrote:
 Firstly, well done guys on achieving this version milestone. I 
 successfully upgraded to the 0.48 format uneventfully on a live (test) 
 system.
 
 The same system was then going through rebuild testing, to confirm 
 that also worked fine.
 
 
 Unfortunately, the mkcephfs command is failing:
 
 root@dsanb1-coy:~# mkcephfs -c /etc/ceph/ceph.conf --allhosts 
 --mkbtrfs -k /etc/ceph/keyring --crushmapsrc crushfile.txt -v temp dir 
 is /tmp/mkcephfs.GaRCZ9i06a preparing monmap in 
 /tmp/mkcephfs.GaRCZ9i06a/monmap /usr/bin/monmaptool --create --clobber 
 --add alpha 10.32.0.10:6789 --add bravo 10.32.0.25:6789 --add charlie 
 10.32.0.11:6789 --print /tmp/mkcephfs.GaRCZ9i06a/monmap
 /usr/bin/monmaptool: monmap file /tmp/mkcephfs.GaRCZ9i06a/monmap
 /usr/bin/monmaptool: generated fsid 
 c7202495-468c-4678-b678-115c3ee33402
 epoch 0
 fsid c7202495-468c-4678-b678-115c3ee33402
 last_changed 2012-07-04 15:02:31.732275 created 2012-07-04 
 15:02:31.732275
 0: 10.32.0.10:6789/0 mon.alpha
 1: 10.32.0.11:6789/0 mon.charlie
 2: 10.32.0.25:6789/0 mon.bravo
 /usr/bin/monmaptool: writing epoch 0 to 
 /tmp/mkcephfs.GaRCZ9i06a/monmap (3 monitors) /usr/bin/ceph-conf -c 
 /etc/ceph/ceph.conf -n osd.0 user
 === osd.0 ===
 --- dsanb1-coy# /sbin/mkcephfs -d /tmp/mkcephfs.GaRCZ9i06a 
 

Re: speedup ceph / scaling / find the bottleneck

2012-07-05 Thread Alexandre DERUMIER
Hi, 
Stefan is on vacation for the moment,I don't know if he can reply you.

But I can reoly for him for the kvm part (as we do same tests together in 
parallel).

- kvm is 1.1
- rbd 0.48
- drive option 
rbd:pool/volume:auth_supported=cephx;none;keyring=/etc/pve/priv/ceph/ceph.keyring:mon_host=X.X.X.X;
-using writeback

writeback tuning in ceph.conf on the kvm host

rbd_cache_size = 33554432 
rbd_cache_max_age = 2.0 

benchmark use in kvm guest:
fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 
--runtime=90 --group_reporting --name=file1

results show max 14000io/s with 1 vm, 7000io/s by vm with 2vm,...
so it doesn't scale

(bench is with directio, so maybe writeback cache don't help)

hardware for ceph , is 3 nodes with 4 intel ssd each. (1 drive can handle 
4io/s randwrite locally)


- Alexandre

- Mail original - 

De: Gregory Farnum g...@inktank.com 
À: Stefan Priebe s.pri...@profihost.ag 
Cc: ceph-devel@vger.kernel.org, Sage Weil s...@inktank.com 
Envoyé: Jeudi 5 Juillet 2012 23:33:18 
Objet: Re: speedup ceph / scaling / find the bottleneck 

Could you send over the ceph.conf on your KVM host, as well as how 
you're configuring KVM to use rbd? 

On Tue, Jul 3, 2012 at 11:20 AM, Stefan Priebe s.pri...@profihost.ag wrote: 
 I'm sorry but this is the KVM Host Machine there is no ceph running on this 
 machine. 
 
 If i change the admin socket to: 
 admin_socket=/var/run/ceph_$name.sock 
 
 i don't have any socket at all ;-( 
 
 Am 03.07.2012 17:31, schrieb Sage Weil: 
 
 On Tue, 3 Jul 2012, Stefan Priebe - Profihost AG wrote: 
 
 Hello, 
 
 Am 02.07.2012 22:30, schrieb Josh Durgin: 
 
 If you add admin_socket=/path/to/admin_socket for your client running 
 qemu (in that client's ceph.conf section or manually in the qemu 
 command line) you can check that caching is enabled: 
 
 ceph --admin-daemon /path/to/admin_socket show config | grep rbd_cache 
 
 And see statistics it generates (look for cache) with: 
 
 ceph --admin-daemon /path/to/admin_socket perfcounters_dump 
 
 
 This doesn't work for me: 
 ceph --admin-daemon /var/run/ceph.sock show config 
 read only got 0 bytes of 4 expected for response length; invalid 
 command?2012-07-03 09:46:57.931821 7fa75d129700 -1 asok(0x8115a0) 
 AdminSocket: 
 request 'show config' not defined 
 
 
 Oh, it's 'config show'. Also, 'help' will list the supported commands. 
 
 Also perfcounters does not show anything: 
 # ceph --admin-daemon /var/run/ceph.sock perfcounters_dump 
 {} 
 
 
 There may be another daemon that tried to attach to the same socket file. 
 You might want to set 'admin socket = /var/run/ceph/$name.sock' or 
 something similar, or whatever else is necessary to make it a unique file. 
 
 ~]# ceph -v 
 ceph version 0.48argonaut-2-gb576faa 
 (commit:b576faa6f24356f4d3ec7205e298d58659e29c68) 
 
 
 Out of curiousity, what patches are you applying on top of the release? 
 
 sage 
 
 
-- 
To unsubscribe from this list: send the line unsubscribe ceph-devel in 
the body of a message to majord...@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 

-- 





Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 


Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 


45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior after upgrading to 0.48

2012-07-05 Thread Sage Weil
On Fri, 6 Jul 2012, Mark Kirkwood wrote:
 On 06/07/12 14:38, Xiaopong Tran wrote:
  
  Thanks for the quick reply, I didn't have the computer with me last
  night. But you were right. I checked the version of ceph on ubuntu,
  and it's still stuck with 0.47.3, despite upgrading. I redid the
  upgrade, and it's still stuck with that version. That's something
  I didn't pay attention to.
  
  I had to purge the ceph, ceph-common and other related packages,
  and re-install it, then I got 0.48. And now ceph -s works just
  as it should.
  
  So, somehow, the upgrade on ubuntu does not work properly.
  
  Thinking about this issue just right now, I think ceph -s
  still worked right because there was still an older version
  of mon when the first two nodes were being upgraded. When
  the last one was upgraded, there's no mon of the same version
  anymore.
  
  Sorry, should have checked if apt upgrade was done properly
  first :)
  
  
 
 FYI: I ran into this too - you need to do:
 
 apt-get dist-upgrade
 
 for the 0.47-2 packages to be replaced by 0.48 (of course purging 'em and
 reinstalling works too...just a bit more drastic)!

That's strange... anyone know why?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior after upgrading to 0.48

2012-07-05 Thread Mark Kirkwood

On 06/07/12 16:17, Sage Weil wrote:

On Fri, 6 Jul 2012, Mark Kirkwood wrote:


FYI: I ran into this too - you need to do:

apt-get dist-upgrade

for the 0.47-2 packages to be replaced by 0.48 (of course purging 'em and
reinstalling works too...just a bit more drastic)!

That's strange... anyone know why?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


From the apt-get manual:

   upgrade
   upgrade is used to install the newest versions of all packages
   currently installed on the system from the sources enumerated in
   /etc/apt/sources.list. Packages currently installed with new
   versions available are retrieved and upgraded; under no
   circumstances are currently installed packages removed, or 
packages

   not already installed retrieved and installed. New versions of
   currently installed packages that cannot be upgraded without
   changing the install status of another package will be left at
   their current version. An update must be performed first so that
   apt-get knows that new versions of packages are available.

   dist-upgrade
   dist-upgrade in addition to performing the function of upgrade,
   also intelligently handles changing dependencies with new 
versions
   of packages; apt-get has a smart conflict resolution 
system, and

   it will attempt to upgrade the most important packages at the
   expense of less important ones if necessary. So, dist-upgrade
   command may remove some packages. The /etc/apt/sources.list file
   contains a list of locations from which to retrieve desired 
package
   files. See also apt_preferences(5) for a mechanism for 
overriding

   the general settings for individual packages.

Does 0.48 have new dependancies perhaps?


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior after upgrading to 0.48

2012-07-05 Thread Sage Weil
On Fri, 6 Jul 2012, Mark Kirkwood wrote:
 On 06/07/12 16:17, Sage Weil wrote:
  On Fri, 6 Jul 2012, Mark Kirkwood wrote:
   
   FYI: I ran into this too - you need to do:
   
   apt-get dist-upgrade
   
   for the 0.47-2 packages to be replaced by 0.48 (of course purging 'em and
   reinstalling works too...just a bit more drastic)!
  That's strange... anyone know why?
  
  sage
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 From the apt-get manual:
 
upgrade
upgrade is used to install the newest versions of all packages
currently installed on the system from the sources enumerated in
/etc/apt/sources.list. Packages currently installed with new
versions available are retrieved and upgraded; under no
circumstances are currently installed packages removed, or packages
not already installed retrieved and installed. New versions of
currently installed packages that cannot be upgraded without
changing the install status of another package will be left at
their current version. An update must be performed first so that
apt-get knows that new versions of packages are available.
 
dist-upgrade
dist-upgrade in addition to performing the function of upgrade,
also intelligently handles changing dependencies with new versions
of packages; apt-get has a smart conflict resolution system, and
it will attempt to upgrade the most important packages at the
expense of less important ones if necessary. So, dist-upgrade
command may remove some packages. The /etc/apt/sources.list file
contains a list of locations from which to retrieve desired package
files. See also apt_preferences(5) for a mechanism for overriding
the general settings for individual packages.
 
 Does 0.48 have new dependancies perhaps?

Oh, yeah.  We switched to libnss from libcrypto++ by default, among other 
things; that would explain it!

Thanks-
sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html