Re: Cannot delete some empty dirs and weird sizes

2012-02-03 Thread Amon Ott
On Thursday 02 February 2012 wrote Gregory Farnum:
 On Wed, Feb 1, 2012 at 9:02 AM, Amon Ott a@m-privacy.de wrote:
  ceph should have recovered here. Might also be caused by this setting
  that I tried for a while, it is off now:
  mds standby replay = true
  With this setting, if the active mds gets killed, no mds is able to
  become active, so everything hangs. Had to reboot again.

 Hrm. That setting simply tells the non-active MDSes that they should
 follow the journal of the active MDS(es). They should still go active
 if the MDS they're following fails — although it does slightly
 increase the chances of them running into the same bugs in code and
 dying at the same time.

Well, in this test the following MDS did not become active, it kept stuck in 
replay mode. As waking up without that following is also fast enough for us, 
I quickly turned it off again. Just wanted to mention that there might be 
another bug lurking.

  Then killed the active mds, another takes over and suddenly the missing
  file appears:
  .tmp/tiny14/.config/pcmanfm/LXDE/insgesamt 1
  drwxr-xr-x 1 32252 users 393  1. Feb 15:19 .
  drwxr-xr-x 1 32252 users   0  1. Feb 17:21 ..
  -rw-r--r-- 1 root  root  393 24. Jan 15:55 pcmanfm.conf

 So were you able to delete that file once it reappeared?

It got deleted correctly by nightly cleanup cron job.

  Restarted the original mds, it does not appear in ceph mds dump,
  although it is running at 100% cpu. Same happened with other mds
  processes after killing and starting, now I have only one left that is
  working correctly.

 Remind me, you do only have one active MDS, correct?
 Did you look at the logs and see what the MDS was doing with 100% cpu?

Four MDS defined, but using
max mds = 1

  Will leave the cluster in this state now and have another look tomorrow -
  maybe the spinning mds processes recover by some miracle.

After yet another complete reboot it seems to be stable again now. So far I 
can say that Ceph FS runs pretty stable with the following conditions:
- 4 nodes with each 8 to 12 CPU cores, 12 GB of RAM (CPUs and RAM mostly 
unused by Ceph)
- Ceph 0.41
- 3 mon, 4 mds (max mds 1), 4 osd, all 4 nodes are both server and client
- Kernel 3.1.10 with some hand integrated patches for Ceph fixes (3.2.2 was 
unstable, need to check)
- OSD storage on normal hard drive, ext4 without journal (btrfs crashes about 
once per day)
- OSD journals on separate SSD
- MDS does not get restarted without reboot (yes, only reboot helps)

From our experience, it could be a bit faster when writing into many different 
files, which is our major work load. However, the last few versions already 
brought significant improvements on stability and speed. The main problem we 
see is the MDS takeover and recovery, which has given a lot of trouble so 
far.

Thanks once more for your good work!

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH   Tel: +49 30 24342334
Am Köllnischen Park 1Fax: +49 30 24342336
10179 Berlin http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: couldn't use rbd

2012-02-03 Thread Masuko Tomoya
Hi Josh,

Thank you for reply !

 This might mean the rbd image list object can't be read for some
 reason, or the rbd tool is doing something weird that the rados tool
 isn't. Can you share the output of 'ceph -s' and
 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20
 --debug-auth 20'?

The output of 'ceph -s' is below.
---
root@ceph01:~# ceph -s
2012-02-03 17:01:47.881960pg v33: 6 pgs: 6 active+clean+degraded;
0 KB data, 1056 KB used, 15357 MB / 15358 MB avail
2012-02-03 17:01:47.882583   mds e9: 1/1/1 up {0=0=up:creating}
2012-02-03 17:01:47.882733   osd e21: 1 osds: 1 up, 1 in
2012-02-03 17:01:47.883042   log 2012-02-03 16:35:11.183897 osd.0
10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or
wrong addr
2012-02-03 17:01:47.883144   mon e1: 1 mons at {0=10.68.119.191:6789/0}

The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
--debug-monc 20 --debug-auth 20' is below.
--
root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
--debug-monc 20 --debug-auth 20
2012-02-03 17:02:10.910598 7f88cbb91720 monclient(hunting): build_initial_monmap
2012-02-03 17:02:10.910912 7f88cbb91720 -- :/0 messenger.start
2012-02-03 17:02:10.910966 7f88cbb91720 monclient(hunting): init
2012-02-03 17:02:10.911268 7f88cbb91720 auth: KeyRing::load: loaded
key file /etc/ceph/keyring.bin
2012-02-03 17:02:10.911952 7f88cbb91720 monclient(hunting): supporting
cephx auth protocol
2012-02-03 17:02:10.911984 7f88cbb91720 monclient(hunting): _reopen_session
2012-02-03 17:02:10.912098 7f88cbb91720 monclient(hunting):
_pick_new_mon picked mon.0 con 0x24603e0 addr 10.68.119.191:6789/0
2012-02-03 17:02:10.912124 7f88cbb91720 monclient(hunting):
_send_mon_message to mon.0 at 10.68.119.191:6789/0
2012-02-03 17:02:10.951553 7f88cbb91720 -- 10.68.119.191:0/1003500 --
10.68.119.191:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x24606f0 con
0x24603e0
2012-02-03 17:02:10.951729 7f88cbb8f710 -- 10.68.119.191:0/1003500
learned my addr 10.68.119.191:0/1003500
2012-02-03 17:02:10.952949 7f88cbb91720 monclient(hunting): renew_subs
2012-02-03 17:02:10.953012 7f88cbb91720 monclient(hunting):
authenticate will time out at 2012-02-03 17:02:40.952983
2012-02-03 17:02:10.953392 7f88c9366710 -- 10.68.119.191:0/1003500 ==
mon.0 10.68.119.191:6789/0 1  auth_reply(proto 2 0 Success) v1
 33+0+0 (1661896955 0 0) 0x24608d0 con 0x24603e0
2012-02-03 17:02:10.953444 7f88c9366710 cephx: set_have_need_key no
handler for service mon
2012-02-03 17:02:10.953455 7f88c9366710 cephx: set_have_need_key no
handler for service osd
2012-02-03 17:02:10.953462 7f88c9366710 cephx: set_have_need_key no
handler for service auth
2012-02-03 17:02:10.953504 7f88c9366710 cephx: validate_tickets want
37 have 0 need 37
2012-02-03 17:02:10.953526 7f88c9366710 monclient(hunting): my global_id is 4106
2012-02-03 17:02:10.953546 7f88c9366710 cephx client: handle_response ret = 0
2012-02-03 17:02:10.953557 7f88c9366710 cephx client:  got initial
server challenge 1501581746095987
2012-02-03 17:02:10.953575 7f88c9366710 cephx client: build_request
2012-02-03 17:02:10.953586 7f88c9366710 cephx client:
validate_tickets: want=37 need=37 have=0
2012-02-03 17:02:10.953592 7f88c9366710 cephx: set_have_need_key no
handler for service mon
2012-02-03 17:02:10.953598 7f88c9366710 cephx: set_have_need_key no
handler for service osd
2012-02-03 17:02:10.953604 7f88c9366710 cephx: set_have_need_key no
handler for service auth
2012-02-03 17:02:10.953610 7f88c9366710 cephx: validate_tickets want
37 have 0 need 37
2012-02-03 17:02:10.953617 7f88c9366710 cephx client: want=37 need=37 have=0
2012-02-03 17:02:10.953920 7f88c9366710 cephx client: get auth session
key: client_challenge 394141338555354357
2012-02-03 17:02:10.953953 7f88c9366710 monclient(hunting):
_send_mon_message to mon.0 at 10.68.119.191:6789/0
2012-02-03 17:02:10.953967 7f88c9366710 -- 10.68.119.191:0/1003500 --
10.68.119.191:6789/0 -- auth(proto 2 32 bytes) v1 -- ?+0 0x24608d0 con
0x24603e0
2012-02-03 17:02:10.954517 7f88c9366710 -- 10.68.119.191:0/1003500 ==
mon.0 10.68.119.191:6789/0 2  auth_reply(proto 2 0 Success) v1
 206+0+0 (707160412 0 0) 0x24608d0 con 0x24603e0
2012-02-03 17:02:10.954535 7f88c9366710 cephx client: handle_response ret = 0
2012-02-03 17:02:10.954555 7f88c9366710 cephx client:  get_auth_session_key
2012-02-03 17:02:10.954576 7f88c9366710 cephx:
verify_service_ticket_reply got 1 keys
2012-02-03 17:02:10.954587 7f88c9366710 cephx: got key for service_id auth
2012-02-03 17:02:10.954696 7f88c9366710 cephx:  ticket.secret_id=2
2012-02-03 17:02:10.954737 7f88c9366710 cephx:
verify_service_ticket_reply service auth secret_id 2 session_key
AQCClCtPyL7gOBAAaDicPRG6NHbgUs+4YNntWQ== validity=43200.00
2012-02-03 17:02:10.954756 7f88c9366710 cephx: ticket
expires=2012-02-04 05:02:10.954749 renew_after=2012-02-04
02:02:10.954749
2012-02-03 17:02:10.954764 7f88c9366710 cephx client:  want=37 need=37 have=0
2012-02-03 17:02:10.954771 

avoid crashes from faulty crushmap

2012-02-03 Thread Alexandre Oliva
I messed up a crush map the other day, mixing components of different
types in a single rule.  The crushmap compiler didn't complain, but mons
and osds would crash when applying those rules.  I had to use this patch
to recover the cluster.  Only the second hunk was relevant, but I
figured a BUG_ON that stops you from fixing the problem is best avoided
;-)

---BeginMessage---
It's very hard to recover from an invalid crushmap if mons fail
assertions while processing the map, and osds crash while advancing
past an already-fixed map.  Skip such broken rules instead of
aborting.

Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br
---
 src/crush/mapper.c |   14 +++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index 1e475b40..6ce4c97 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -354,7 +354,11 @@ static int crush_choose(const struct crush_map *map,
item = bucket_perm_choose(in, x, r);
else
item = crush_bucket_choose(in, x, r);
-   BUG_ON(item = map-max_devices);
+   if (item = map-max_devices) {
+   dprintk(  bad item %d\n, item);
+   skip_rep = 1;
+   break;
+   }
 
/* desired type? */
if (item  0)
@@ -365,8 +369,12 @@ static int crush_choose(const struct crush_map *map,
 
/* keep going? */
if (itemtype != type) {
-   BUG_ON(item = 0 ||
-  (-1-item) = map-max_buckets);
+   if (item = 0 ||
+   (-1-item) = map-max_buckets) {
+   dprintk(  bad item type %d\n, 
type)
+   skip_rep = 1;
+   break;
+   }
in = map-buckets[-1-item];
retry_bucket = 1;
continue;
-- 
1.7.7.6

---End Message---


-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer


[PATCH RESEND] ceph: avoid panic with mismatched symlink sizes in fill_inode()

2012-02-03 Thread Xi Wang
Return -EINVAL rather than panic if iinfo-symlink_len and
inode-i_size do not match.

Also use kstrndup rather than kmalloc/memcpy.

Signed-off-by: Xi Wang xi.w...@gmail.com
---
 fs/ceph/inode.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 2c48937..6c0205d 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -677,18 +677,19 @@ static int fill_inode(struct inode *inode,
case S_IFLNK:
inode-i_op = ceph_symlink_iops;
if (!ci-i_symlink) {
-   int symlen = iinfo-symlink_len;
+   u32 symlen = iinfo-symlink_len;
char *sym;
 
-   BUG_ON(symlen != inode-i_size);
spin_unlock(ci-i_ceph_lock);
 
+   err = -EINVAL;
+   if (symlen != inode-i_size)
+   goto out;
+
err = -ENOMEM;
-   sym = kmalloc(symlen+1, GFP_NOFS);
+   sym = kstrndup(iinfo-symlink, symlen, GFP_NOFS);
if (!sym)
goto out;
-   memcpy(sym, iinfo-symlink, symlen);
-   sym[symlen] = 0;
 
spin_lock(ci-i_ceph_lock);
if (!ci-i_symlink)
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-03 Thread Jim Schutt

On 02/02/2012 05:28 PM, Gregory Farnum wrote:

On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov  wrote:

I found 0 instances of waiting for commit in all my OSD logs for my last
run.

So I never waited on the journal?


Looks like it. Interesting.



So far I'm looking at two behaviours I've noticed that seem anomalous to
me.

One is that I instrumented ms_dispatch(), and I see it take
a half-second or more several hundred times, out of several
thousand messages.  Is that expected?



How did you instrument it? If you wrapped the whole function it's
possible that those longer runs are actually chewing through several
messages that had to get waitlisted for some reason previously.
(That's the call to do_waiters().)



Yep, I wrapped the whole function, and also instrumented taking osd_lock
while I was there.  About half the time that ms_dispatch() takes more than
0.5 seconds, taking osd_lock is responsible for the delay.  There's two
dispatch threads, one for ops and one for rep_ops, right?  So one's
waiting on the other?


There's just one main dispatcher; no split for the ops and rep_ops .
The reason for that dispatch_running is that if there are requests
waiting then the tick() function will run through them if the
messenger dispatch thread is currently idle.
But it is possible for the Messenger to try and dispatch, and for that
to be blocked while some amount of (usually trivial) work is being
done by a different thread, yes. I don't think we've ever observed it
being a problem for anything other than updating OSD maps, though...


Ah, OK.

I guess I was confused by my log output, e.g.:

osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch 
ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch 
ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680

I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.

I need to go study that code some more





Another is that once a message receive starts, I see ~50 messages
that take tens of seconds to receive, when the nominal receive time is
a half-second or less.  I'm in the process of tooling up to collect
tcpdump data on all my clients to try to catch what is going on with
that.



Again, how are you instrumenting that?



I post-process the logs, looking at the time difference between
reader got .* policy throttler and reader got .* osd_op(client.


I guess the logging output must have changed a bit at some pointer (or
was that one of your patches?). master has reader wants not reader
got for the policy throttler. (Just got a little confused when
checking the code.)


Yep, I added an extra message to make post-processing logs easier, sorry.




When I find a candidate message, I grep the log for just that reader thread,
and see, e.g., this:

osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000
seq 828/828 waiters 157/149 for src client.4301 tid=247
osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
41944358/
osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader got front 128
osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172

Note the ~2 minute delay (and ~300,000 lines of logging) between the first
and second reads.

During that time 129 sockets were processed - what makes sd=215 special?


Hrm. Well, you can try turning up the messenger debugging to 30 and
taking advantage of the reader reading reader read pair right
around tcp_read_nonblocking.


OK, I'll give that a try as well, thanks.



I've added tracepoints in my client kernel try_write(), and nothing seems
unusual (that's with running the patch to ceph_write_space() I posted
earlier):

 kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0
tid 179 seq 3 sent 4194304
 kworker/0:2-1790  

Re: couldn't use rbd

2012-02-03 Thread Josh Durgin

On 02/03/2012 12:51 AM, Masuko Tomoya wrote:

Hi Josh,

Thank you for reply !


This might mean the rbd image list object can't be read for some
reason, or the rbd tool is doing something weird that the rados tool
isn't. Can you share the output of 'ceph -s' and
'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20
--debug-auth 20'?


The output of 'ceph -s' is below.
---
root@ceph01:~# ceph -s
2012-02-03 17:01:47.881960pg v33: 6 pgs: 6 active+clean+degraded;
0 KB data, 1056 KB used, 15357 MB / 15358 MB avail
2012-02-03 17:01:47.882583   mds e9: 1/1/1 up {0=0=up:creating}
2012-02-03 17:01:47.882733   osd e21: 1 osds: 1 up, 1 in
2012-02-03 17:01:47.883042   log 2012-02-03 16:35:11.183897 osd.0
10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or
wrong addr
2012-02-03 17:01:47.883144   mon e1: 1 mons at {0=10.68.119.191:6789/0}

The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
--debug-monc 20 --debug-auth 20' is below.
--
root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
--debug-monc 20 --debug-auth 20


snip


2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit
oid rbd_directory @2 [read 0~0] tid 1 osd.0
2012-02-03 17:02:10.971465 7f88cbb91720 client.4106.objecter send_op 1 to osd.0
2012-02-03 17:02:10.971533 7f88cbb91720 -- 10.68.119.191:0/1003500 --
10.68.119.191:6801/2912 -- osd_op(client.4106.0:1 rbd_directory [read
0~0] 2.30a98c1c) v1 -- ?+0 0x24664c0 con 0x24661b0


Everything above here is normal - the rbd tool connected to the
monitors, got the monmap and osdmap, and sent a request to read the
'rbd_directory' object.

snip


2012-02-03 17:02:25.969338 7f88c7261710 client.4106.objecter  tid 1 on
osd.0 is laggy


This means the osd isn't responding to the read. Check the osd log for
errors. If there's nothing obvious, add this to the osd section of your
ceph.conf and restart the osd:

debug osd = 20
debug ms = 1
debug filestore = 20

Then run 'rbd ls' and look at what happens after
'osd_op.*rbd_directory' appears in the osd log.

rados lspools log



I compared those logs and found there is differences.
'rbd list'
2012-02-03 17:02:10.971770 7f88c9366710 -- 10.68.119.191:0/1003500==
mon.0 10.68.119.191:6789/0 10  osd_map(21..21 src has 1..21) v2
 1284+0+0 (473305567 0 0) 0x24655a0 con 0x24603e0
2012-02-03 17:02:10.971789 7f88c9366710 client.4106.objecter
handle_osd_map ignoring epochs [21,21]= 21
2012-02-03 17:02:10.971801 7f88c9366710 client.4106.objecter
dump_active .. 0 homeless
2012-02-03 17:02:10.971815 7f88c9366710 client.4106.objecter 1
2.30a98c1c  osd.0   rbd_directory   [read 0~0]
--(snip)--

'rados lspools'
2012-02-03 17:11:52.866072 7f9c5764b710 -- 10.68.119.191:0/1003868==
mon.0 10.68.119.191:6789/0 7  osd_map(21..21 src has 1..21) v2
 1284+0+0 (473305567 0 0) 0x770a70 con 0x771440
2012-02-03 17:11:52.866103 7f9c5764b710 client.4107.objecter
handle_osd_map got epochs [21,21]  0
2012-02-03 17:11:52.866111 7f9c5764b710 client.4107.objecter
handle_osd_map decoding full epoch 21
2012-02-03 17:11:52.866272 7f9c5764b710 client.4107.objecter
dump_active .. 0 homeless
data
metadata
rbd
--(snip)--

What do these logs mean ?


The difference is that 'rbd ls' talks to the monitors and osds, while 
'rados lspools' just needs to talk to the monitors. The objecter 
dump_active part is listing in-flight osd requests.





*ceph cluster is configured on single server.
*server is ubuntu 10.10 maverick.
*ceph, librados2 and librbd1 packages are installed.
  (version: 0.38-1maverick)
*apparmor is disable.



apparmor shouldn't matter if you have libvirt 0.9.9 or newer.


I use libvirt 0.8.3 (latest version for maverick), so I disabled apparmor.



2012/2/2 Josh Durginjosh.dur...@dreamhost.com:

On 02/02/2012 01:49 AM, Masuko Tomoya wrote:


Hi, all.

When I execute rbd command, it is not success.

root@ceph01:~# rbd list
(no response)

/var/log/ceph/mon.0.log
-
2012-02-02 17:58:19.801762 7ff4bbfb1710 -- 10.68.119.191:6789/0==
client.? 10.68.119.191:0/1002580 1  auth(proto 0 30 bytes) v1 
56+0+0 (625540289 0 0) 0x1619a00 con 0x1615a00
2012-02-02 17:58:19.801919 7ff4bbfb1710 -- 10.68.119.191:6789/0 --
10.68.119.191:0/1002580 -- auth_reply(proto 2 0 Success) v1 -- ?+0
0x1619c00 con 0x1615a00
2012-02-02 17:58:19.802505 7ff4bbfb1710 -- 10.68.119.191:6789/0==
client.? 10.68.119.191:0/1002580 2  auth(proto 2 32 bytes) v1 
58+0+0 (346146289 0 0) 0x161fc00 con 0x1615a00
2012-02-02 17:58:19.802673 7ff4bbfb1710 -- 10.68.119.191:6789/0 --
10.68.119.191:0/1002580 -- auth_reply(proto 2 0 Success) v1 -- ?+0
0x1619a00 con 0x1615a00
2012-02-02 17:58:19.803473 7ff4bbfb1710 -- 10.68.119.191:6789/0==
client.? 10.68.119.191:0/1002580 3  auth(proto 2 165 bytes) v1
 191+0+0 (3737796417 0 0) 0x1619600 con 0x1615a00
2012-02-02 17:58:19.803745 7ff4bbfb1710 -- 10.68.119.191:6789/0 --
10.68.119.191:0/1002580 -- auth_reply(proto 2 0 Success) v1 -- ?+0
0x161fc00 con 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-03 Thread Gregory Farnum
On Feb 3, 2012, at 8:18 AM, Jim Schutt jasc...@sandia.gov wrote:

 On 02/02/2012 05:28 PM, Gregory Farnum wrote:
 On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov  wrote:
 I found 0 instances of waiting for commit in all my OSD logs for my last
 run.

 So I never waited on the journal?

 Looks like it. Interesting.


 So far I'm looking at two behaviours I've noticed that seem anomalous to
 me.

 One is that I instrumented ms_dispatch(), and I see it take
 a half-second or more several hundred times, out of several
 thousand messages.  Is that expected?


 How did you instrument it? If you wrapped the whole function it's
 possible that those longer runs are actually chewing through several
 messages that had to get waitlisted for some reason previously.
 (That's the call to do_waiters().)


 Yep, I wrapped the whole function, and also instrumented taking osd_lock
 while I was there.  About half the time that ms_dispatch() takes more than
 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
 dispatch threads, one for ops and one for rep_ops, right?  So one's
 waiting on the other?

 There's just one main dispatcher; no split for the ops and rep_ops .
 The reason for that dispatch_running is that if there are requests
 waiting then the tick() function will run through them if the
 messenger dispatch thread is currently idle.
 But it is possible for the Messenger to try and dispatch, and for that
 to be blocked while some amount of (usually trivial) work is being
 done by a different thread, yes. I don't think we've ever observed it
 being a problem for anything other than updating OSD maps, though...

 Ah, OK.

 I guess I was confused by my log output, e.g.:

D'oh. Sorry, you confused me with your reference to repops, which
aren't special-cased or anything. But there are two messengers on the
OSD, each with their own dispatch thread. One of those messengers is
for clients and one is for other OSDs.

And now that you point that out, I wonder if the problem is lack of
Cond signaling in ms_dispatch. I'm on my phone right now but I believe
there's a chunk of commented-out code (why commented instead of
deleted? I don't know) that we want to uncomment for reasons that will
become clear when you look at it. :)
Try that and see how many of your problems disappear?



 osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 
 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 
 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680

 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.

 I need to go study that code some more



 Another is that once a message receive starts, I see ~50 messages
 that take tens of seconds to receive, when the nominal receive time is
 a half-second or less.  I'm in the process of tooling up to collect
 tcpdump data on all my clients to try to catch what is going on with
 that.


 Again, how are you instrumenting that?


 I post-process the logs, looking at the time difference between
 reader got .* policy throttler and reader got .* osd_op(client.

 I guess the logging output must have changed a bit at some pointer (or
 was that one of your patches?). master has reader wants not reader
 got for the policy throttler. (Just got a little confused when
 checking the code.)

 Yep, I added an extra message to make post-processing logs easier, sorry.


 When I find a candidate message, I grep the log for just that reader thread,
 and see, e.g., this:

 osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000
 seq 828/828 waiters 157/149 for src client.4301 tid=247
 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
 41944358/
 osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader got front 128
 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
 osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-03 Thread Sage Weil
On Fri, 3 Feb 2012, Jim Schutt wrote:
 On 02/02/2012 05:28 PM, Gregory Farnum wrote:
  On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov  wrote:
   I found 0 instances of waiting for commit in all my OSD logs for my last
   run.
   
   So I never waited on the journal?
  
  Looks like it. Interesting.
  
  
 So far I'm looking at two behaviours I've noticed that seem anomalous
 to
 me.
 
 One is that I instrumented ms_dispatch(), and I see it take
 a half-second or more several hundred times, out of several
 thousand messages.  Is that expected?


How did you instrument it? If you wrapped the whole function it's
possible that those longer runs are actually chewing through several
messages that had to get waitlisted for some reason previously.
(That's the call to do_waiters().)
   
   
   Yep, I wrapped the whole function, and also instrumented taking osd_lock
   while I was there.  About half the time that ms_dispatch() takes more than
   0.5 seconds, taking osd_lock is responsible for the delay.  There's two
   dispatch threads, one for ops and one for rep_ops, right?  So one's
   waiting on the other?
  
  There's just one main dispatcher; no split for the ops and rep_ops .
  The reason for that dispatch_running is that if there are requests
  waiting then the tick() function will run through them if the
  messenger dispatch thread is currently idle.
  But it is possible for the Messenger to try and dispatch, and for that
  to be blocked while some amount of (usually trivial) work is being
  done by a different thread, yes. I don't think we've ever observed it
  being a problem for anything other than updating OSD maps, though...
 
 Ah, OK.
 
 I guess I was confused by my log output, e.g.:
 
 osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch
 ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch
 ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680
 
 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.
 
 I need to go study that code some more

Oh... they are separate thread.  In the OSD's case two different 
messengers (the public and cluster ones) are wired up to the same 
dispatcher (OSD::ms_dispatch).  MOSDOps come in on the public thread, 
MOSDSubOps on the cluster one, but they're both fed to the same function.  
That's why there's some funkiness going on in, say, handle_osd_map().

But like Greg said, I don't think we've seen any significant latencies 
there except from map processing.  If you have a log, that would be 
interesting to look at!

sage


 
  
  
 Another is that once a message receive starts, I see ~50 messages
 that take tens of seconds to receive, when the nominal receive time is
 a half-second or less.  I'm in the process of tooling up to collect
 tcpdump data on all my clients to try to catch what is going on with
 that.


Again, how are you instrumenting that?
   
   
   I post-process the logs, looking at the time difference between
   reader got .* policy throttler and reader got .* osd_op(client.
  
  I guess the logging output must have changed a bit at some pointer (or
  was that one of your patches?). master has reader wants not reader
  got for the policy throttler. (Just got a little confused when
  checking the code.)
 
 Yep, I added an extra message to make post-processing logs easier, sorry.
 
  
   When I find a candidate message, I grep the log for just that reader
   thread,
   and see, e.g., this:
   
   osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader got 2670720 from policy throttler
   48809510/5000
   seq 828/828 waiters 157/149 for src client.4301 tid=247
   osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
   41944358/
   osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader got front 128
   osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
   osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
   osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
   osd.0.log:1561092:2012-02-02 

Possible RBD inconsistencies with kvm+Windows 7

2012-02-03 Thread Josh Pieper
I have a Windows 7 guest running under kvm/libvirt with RBD as a
backend to a cluster of 3 OSDs.  With this setup, I am seeing behavior
that looks suspiciously like disk corruption in the guest VM executing
some of our workloads.

For instance, in one occurance, there is a python function that
recursively deletes a large directory tree while the disk is otherwise
loaded.  For us, this occasionally fails because the OS reported that
all the files in the directory were deleted, but then reports the
directory is not empty when going to remove it.  In another, a simple
test application writes new files to a directory every 50ms, then
after 6s verifies that at least 3 files were written, also while the
disk is under heavy load.

We have never ever seen these failures on bare metal, or on kvm
instances backed by a LVM volume in years of operation, but they
happen every couple of hours with RBD.  Unfortunately, I have been
unsuccessful when attempting to create synthetic test cases to
demonstrate the inconsistent RBD behavior.

Has anyone else seen similar inconsistent RBD behavior, or have ideas
how to diagnose further?

For reference, I am running ceph 0.41, qemu-kvm 1.0 on ubuntu 11.10
amd64.

Regards,
Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to remove laggy mds ?

2012-02-03 Thread Jens Rehpöhler
Hi List,

one of my test mds servers died a few days ago. (hardware crash) I will
not buy a new one.

Is there any chance to remove this laggy mds ?

2012-02-03 20:38:53.801623   mds e86436: 2/2/1 up
{0=0=up:resolve,1=0=up:resolve(laggy or crashed)}
2012-02-03 20:39:08.943880   mds e86437: 2/2/1 up
{0=0=up:resolve,1=0=up:resolve(laggy or crashed)}
2012-02-03 20:39:24.094926   mds e86438: 2/2/1 up
{0=0=up:resolve,1=0=up:resolve(laggy or crashed)}
2012-02-03 20:39:39.236560   mds e86439: 2/2/1 up
{0=0=up:resolve,1=0=up:resolve(laggy or crashed)}

Thanks !!

Jens
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible RBD inconsistencies with kvm+Windows 7

2012-02-03 Thread Josh Durgin

On 02/03/2012 10:19 AM, Josh Pieper wrote:

I have a Windows 7 guest running under kvm/libvirt with RBD as a
backend to a cluster of 3 OSDs.  With this setup, I am seeing behavior
that looks suspiciously like disk corruption in the guest VM executing
some of our workloads.

For instance, in one occurance, there is a python function that
recursively deletes a large directory tree while the disk is otherwise
loaded.  For us, this occasionally fails because the OS reported that
all the files in the directory were deleted, but then reports the
directory is not empty when going to remove it.  In another, a simple
test application writes new files to a directory every 50ms, then
after 6s verifies that at least 3 files were written, also while the
disk is under heavy load.

We have never ever seen these failures on bare metal, or on kvm
instances backed by a LVM volume in years of operation, but they
happen every couple of hours with RBD.  Unfortunately, I have been
unsuccessful when attempting to create synthetic test cases to
demonstrate the inconsistent RBD behavior.

Has anyone else seen similar inconsistent RBD behavior, or have ideas
how to diagnose further?


What fs are your osds using? A while ago there was a bug in ext4's
fiemap that sometimes caused incorrect reads - if you set
filestore_fiemap_threshold larger than your object size, you can test
whether fiemap is the problem.

Are you using the rbd_writeback_window option? If so, does the
corruption occur without it?

In any case, a log of this occurring with debug_ms=1 and debug_rbd=20 
from qemu will tell us if there are out-of-order operations happening.




For reference, I am running ceph 0.41, qemu-kvm 1.0 on ubuntu 11.10
amd64.

Regards,
Josh


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible RBD inconsistencies with kvm+Windows 7

2012-02-03 Thread Josh Pieper
Josh Durgin wrote:
 On 02/03/2012 10:19 AM, Josh Pieper wrote:
 I have a Windows 7 guest running under kvm/libvirt with RBD as a
 backend to a cluster of 3 OSDs.  With this setup, I am seeing behavior
 that looks suspiciously like disk corruption in the guest VM executing
 some of our workloads.
 
 For instance, in one occurance, there is a python function that
 recursively deletes a large directory tree while the disk is otherwise
 loaded.  For us, this occasionally fails because the OS reported that
 all the files in the directory were deleted, but then reports the
 directory is not empty when going to remove it.  In another, a simple
 test application writes new files to a directory every 50ms, then
 after 6s verifies that at least 3 files were written, also while the
 disk is under heavy load.
 
 We have never ever seen these failures on bare metal, or on kvm
 instances backed by a LVM volume in years of operation, but they
 happen every couple of hours with RBD.  Unfortunately, I have been
 unsuccessful when attempting to create synthetic test cases to
 demonstrate the inconsistent RBD behavior.
 
 Has anyone else seen similar inconsistent RBD behavior, or have ideas
 how to diagnose further?
 
 What fs are your osds using? A while ago there was a bug in ext4's
 fiemap that sometimes caused incorrect reads - if you set
 filestore_fiemap_threshold larger than your object size, you can test
 whether fiemap is the problem.

The OSDs are using xfs.  In my testing with 0.40, btrfs had incredible
performance problems after a day or so of operation.  The last I
heard, ext4 could potentially have data loss due to its limited xattr
support.

 Are you using the rbd_writeback_window option? If so, does the
 corruption occur without it?

Yes I was.  In prior tests, performance was abysmal without it.  I
will test without it, but our runs will load the system very
differently when they are going so slowly.

 In any case, a log of this occurring with debug_ms=1 and
 debug_rbd=20 from qemu will tell us if there are out-of-order
 operations happening.

Great, I will attempt to record some.

Regards,
Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] ceph: avoid panic with mismatched symlink sizes in fill_inode()

2012-02-03 Thread Alex Elder
On Fri, 2012-02-03 at 09:55 -0500, Xi Wang wrote:
 Return -EINVAL rather than panic if iinfo-symlink_len and
 inode-i_size do not match.
 
 Also use kstrndup rather than kmalloc/memcpy.
 
 Signed-off-by: Xi Wang xi.w...@gmail.com

Looks good, though it might good to at least call
WARN_ON().  What do you think?

I will commit this (either with or without the
WARN_ON()).

Reviewed-by: Alex Elder el...@dreamhost.com


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] ceph: avoid panic with mismatched symlink sizes in fill_inode()

2012-02-03 Thread Xi Wang
Return -EINVAL rather than panic if iinfo-symlink_len and inode-i_size
do not match.

Also use kstrndup rather than kmalloc/memcpy.

Signed-off-by: Xi Wang xi.w...@gmail.com
---
 fs/ceph/inode.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 2c48937..9fff9f3 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -677,18 +677,19 @@ static int fill_inode(struct inode *inode,
case S_IFLNK:
inode-i_op = ceph_symlink_iops;
if (!ci-i_symlink) {
-   int symlen = iinfo-symlink_len;
+   u32 symlen = iinfo-symlink_len;
char *sym;
 
-   BUG_ON(symlen != inode-i_size);
spin_unlock(ci-i_ceph_lock);
 
+   err = -EINVAL;
+   if (WARN_ON(symlen != inode-i_size))
+   goto out;
+
err = -ENOMEM;
-   sym = kmalloc(symlen+1, GFP_NOFS);
+   sym = kstrndup(iinfo-symlink, symlen, GFP_NOFS);
if (!sym)
goto out;
-   memcpy(sym, iinfo-symlink, symlen);
-   sym[symlen] = 0;
 
spin_lock(ci-i_ceph_lock);
if (!ci-i_symlink)
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] ceph: avoid panic with mismatched symlink sizes in fill_inode()

2012-02-03 Thread Alex Elder
On Fri, 2012-02-03 at 15:49 -0500, Xi Wang wrote:
 On Feb 3, 2012, at 3:24 PM, Alex Elder wrote:
  Looks good, though it might good to at least call
  WARN_ON().  What do you think?
 
 Sounds good to me.  I will send a v2.  Thanks.

No need.  I can do it for you.  -Alex

 - xi
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to remove laggy mds ?

2012-02-03 Thread Tommi Virtanen
On Fri, Feb 3, 2012 at 11:48, Jens Rehpöhler j...@shadow.gt.owl.de wrote:
 one of my test mds servers died a few days ago. (hardware crash) I will
 not buy a new one.

 Is there any chance to remove this laggy mds ?

Start the mds daemon somewhere. If you don't want to fix/replace the
hardware, well, start it somewhere else. It has no local state.

Decreasing the number of active mdses is not currently supported.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to remove laggy mds ?

2012-02-03 Thread Gregory Farnum
On Fri, Feb 3, 2012 at 1:19 PM, Tommi Virtanen
tommi.virta...@dreamhost.com wrote:
 On Fri, Feb 3, 2012 at 11:48, Jens Rehpöhler j...@shadow.gt.owl.de wrote:
 one of my test mds servers died a few days ago. (hardware crash) I will
 not buy a new one.

 Is there any chance to remove this laggy mds ?

 Start the mds daemon somewhere. If you don't want to fix/replace the
 hardware, well, start it somewhere else. It has no local state.

 Decreasing the number of active mdses is not currently supported.

Actually, it should work, if you have a daemon running for that MDS.
Once it's running, execute ceph mds stop 1 and it should export
authority to the remaining MDS and shut down cleanly.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: couldn't use rbd

2012-02-03 Thread Masuko Tomoya
Hi Josh,

Thank you for your comments.

debug osd = 20
debug ms = 1
debug filestore = 20

I added this to the osd section of ceph.conf and ran /etc/init.d/ceph
stopstart.

The output of OSD.log when 'rbd list' was executed is below.

-
2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32
OSD::ms_verify_authorizer name=client.admin auid=0
2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32  new session
0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110
2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32  session 0x24f5240
has caps osdcaps(pools={} default allow= default_deny=)
2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992 ==
client.4201 10.68.119.191:0/1005110 1  osd_op(client.4201.0:1
rbd_directory [read 0~0] 2.30a98c1c) v3  143+0+0 (3720164172 0 0)
0x24d8900 con 0x24d4dc0
2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900
osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3
2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32
require_same_or_newer_map 32 (i am 32) 0x24d8900
2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming
client.4201 10.68.119.191:0/1005110 32
2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg
2.0, waiting
2012-02-04 04:29:22.641501 7fe0edc18710 osd.0 32 tick
2012-02-04 04:29:22.641695 7fe0edc18710 osd.0 32 scrub_should_schedule
loadavg 0  max 0.5 = no, randomly backing off
2012-02-04 04:29:22.718517 7fe0e12fe710 osd.0 32 update_osd_stat
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:22.718606 7fe0e12fe710 osd.0 32 heartbeat:
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:23.642595 7fe0edc18710 osd.0 32 tick
2012-02-04 04:29:23.642783 7fe0edc18710 osd.0 32 scrub_should_schedule
loadavg 0  max 0.5 = no, randomly backing off
2012-02-04 04:29:23.719789 7fe0e12fe710 osd.0 32 update_osd_stat
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:23.719861 7fe0e12fe710 osd.0 32 heartbeat:
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:24.620713 7fe0e12fe710 osd.0 32 update_osd_stat
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:24.620789 7fe0e12fe710 osd.0 32 heartbeat:
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:24.643651 7fe0edc18710 osd.0 32 tick
2012-02-04 04:29:24.643830 7fe0edc18710 osd.0 32 scrub_should_schedule
loadavg 0  max 0.5 = yes
2012-02-04 04:29:24.643845 7fe0edc18710 osd.0 32 sched_scrub
2012-02-04 04:29:24.643867 7fe0edc18710 osd.0 32  0.0p0 at 2012-02-03
10:25:51.380648  2012-02-03 04:29:24.643851 (86400 seconds ago)
2012-02-04 04:29:24.643875 7fe0edc18710 osd.0 32 sched_scrub done
2012-02-04 04:29:24.709018 7fe0e6c0a710 -- 10.68.119.191:6801/4992 ==
mds.0 10.68.119.191:6800/4916 444  ping v1  0+0+0 (0 0 0)
0x251b600 con 0x2500640
2012-02-04 04:29:24.709091 7fe0e6c0a710 osd.0 32 _dispatch 0x251b600 ping v1
2012-02-04 04:29:24.709102 7fe0e6c0a710 osd.0 32 ping from mds.0
2012-02-04 04:29:25.321771 7fe0e12fe710 osd.0 32 update_osd_stat
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:25.321849 7fe0e12fe710 osd.0 32 heartbeat:
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:25.644596 7fe0edc18710 osd.0 32 tick
2012-02-04 04:29:25.644812 7fe0edc18710 osd.0 32 scrub_should_schedule
loadavg 0  max 0.5 = no, randomly backing off
2012-02-04 04:29:25.644832 7fe0edc18710 osd.0 32 do_mon_report
2012-02-04 04:29:25.644843 7fe0edc18710 osd.0 32 send_alive up_thru
currently 29 want 29
2012-02-04 04:29:25.644853 7fe0edc18710 osd.0 32 send_pg_stats
2012-02-04 04:29:25.965641 7fe0ea411710 filestore(/data/osd0)
sync_entry woke after 5.000199
2012-02-04 04:29:25.965748 7fe0ea411710 filestore(/data/osd0)
sync_entry waiting for max_interval 5.00
2012-02-04 04:29:26.522826 7fe0e12fe710 osd.0 32 update_osd_stat
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:26.522899 7fe0e12fe710 osd.0 32 heartbeat:
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:26.645739 7fe0edc18710 osd.0 32 tick
2012-02-04 04:29:26.645943 7fe0edc18710 osd.0 32 scrub_should_schedule
loadavg 0  max 0.5 = no, randomly backing off
2012-02-04 04:29:27.646788 7fe0edc18710 osd.0 32 tick
2012-02-04 04:29:27.646982 7fe0edc18710 osd.0 32 scrub_should_schedule
loadavg 0  max 0.5 = yes
2012-02-04 04:29:27.646997 7fe0edc18710 osd.0 32 sched_scrub
2012-02-04 04:29:27.647018 7fe0edc18710 osd.0 32  0.0p0 at 2012-02-03
10:25:51.380648  2012-02-03 04:29:27.647003 (86400 seconds ago)
2012-02-04 04:29:27.647027 7fe0edc18710 osd.0 32 sched_scrub done
2012-02-04 04:29:27.723893 7fe0e12fe710 osd.0 32 update_osd_stat
osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[])
2012-02-04 04:29:27.723978 7fe0e12fe710 osd.0 32 heartbeat:
osd_stat(1400 KB used, 15357 MB 

Re: couldn't use rbd

2012-02-03 Thread Josh Durgin

On 02/03/2012 02:14 PM, Masuko Tomoya wrote:

Hi Josh,

Thank you for your comments.


debug osd = 20
debug ms = 1
debug filestore = 20


I added this to the osd section of ceph.conf and ran /etc/init.d/ceph
stopstart.

The output of OSD.log when 'rbd list' was executed is below.

-
2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32
OSD::ms_verify_authorizer name=client.admin auid=0
2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32  new session
0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110
2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32  session 0x24f5240
has caps osdcaps(pools={} default allow= default_deny=)
2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992==
client.4201 10.68.119.191:0/1005110 1  osd_op(client.4201.0:1
rbd_directory [read 0~0] 2.30a98c1c) v3  143+0+0 (3720164172 0 0)
0x24d8900 con 0x24d4dc0
2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900
osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3
2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32
require_same_or_newer_map 32 (i am 32) 0x24d8900
2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming
client.4201 10.68.119.191:0/1005110 32
2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg
2.0, waiting


The pg should have been created already. What's the output of 'ceph pg 
dump'?





2012/2/3 Josh Durginjosh.dur...@dreamhost.com:

On 02/03/2012 12:51 AM, Masuko Tomoya wrote:


Hi Josh,

Thank you for reply !


This might mean the rbd image list object can't be read for some
reason, or the rbd tool is doing something weird that the rados tool
isn't. Can you share the output of 'ceph -s' and
'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20
--debug-auth 20'?



The output of 'ceph -s' is below.
---
root@ceph01:~# ceph -s
2012-02-03 17:01:47.881960pg v33: 6 pgs: 6 active+clean+degraded;
0 KB data, 1056 KB used, 15357 MB / 15358 MB avail
2012-02-03 17:01:47.882583   mds e9: 1/1/1 up {0=0=up:creating}
2012-02-03 17:01:47.882733   osd e21: 1 osds: 1 up, 1 in
2012-02-03 17:01:47.883042   log 2012-02-03 16:35:11.183897 osd.0
10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or
wrong addr
2012-02-03 17:01:47.883144   mon e1: 1 mons at {0=10.68.119.191:6789/0}

The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
--debug-monc 20 --debug-auth 20' is below.
--
root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
--debug-monc 20 --debug-auth 20



snip



2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit
oid rbd_directory @2 [read 0~0] tid 1 osd.0
2012-02-03 17:02:10.971465 7f88cbb91720 client.4106.objecter send_op 1 to
osd.0
2012-02-03 17:02:10.971533 7f88cbb91720 -- 10.68.119.191:0/1003500 --
10.68.119.191:6801/2912 -- osd_op(client.4106.0:1 rbd_directory [read
0~0] 2.30a98c1c) v1 -- ?+0 0x24664c0 con 0x24661b0



Everything above here is normal - the rbd tool connected to the
monitors, got the monmap and osdmap, and sent a request to read the
'rbd_directory' object.

snip



2012-02-03 17:02:25.969338 7f88c7261710 client.4106.objecter  tid 1 on
osd.0 is laggy



This means the osd isn't responding to the read. Check the osd log for
errors. If there's nothing obvious, add this to the osd section of your
ceph.conf and restart the osd:

debug osd = 20
debug ms = 1
debug filestore = 20

Then run 'rbd ls' and look at what happens after
'osd_op.*rbd_directory' appears in the osd log.

rados lspools log




I compared those logs and found there is differences.
'rbd list'
2012-02-03 17:02:10.971770 7f88c9366710 -- 10.68.119.191:0/1003500==
mon.0 10.68.119.191:6789/0 10  osd_map(21..21 src has 1..21) v2
 1284+0+0 (473305567 0 0) 0x24655a0 con 0x24603e0
2012-02-03 17:02:10.971789 7f88c9366710 client.4106.objecter
handle_osd_map ignoring epochs [21,21]= 21
2012-02-03 17:02:10.971801 7f88c9366710 client.4106.objecter
dump_active .. 0 homeless
2012-02-03 17:02:10.971815 7f88c9366710 client.4106.objecter 1
2.30a98c1c  osd.0   rbd_directory   [read 0~0]
--(snip)--

'rados lspools'
2012-02-03 17:11:52.866072 7f9c5764b710 -- 10.68.119.191:0/1003868==
mon.0 10.68.119.191:6789/0 7  osd_map(21..21 src has 1..21) v2
 1284+0+0 (473305567 0 0) 0x770a70 con 0x771440
2012-02-03 17:11:52.866103 7f9c5764b710 client.4107.objecter
handle_osd_map got epochs [21,21]0
2012-02-03 17:11:52.866111 7f9c5764b710 client.4107.objecter
handle_osd_map decoding full epoch 21
2012-02-03 17:11:52.866272 7f9c5764b710 client.4107.objecter
dump_active .. 0 homeless
data
metadata
rbd
--(snip)--

What do these logs mean ?



The difference is that 'rbd ls' talks to the monitors and osds, while 'rados
lspools' just needs to talk to the monitors. The objecter dump_active part
is listing in-flight osd requests.





*ceph cluster is configured on single server.
*server is ubuntu 10.10 maverick.
*ceph, librados2 and librbd1 packages are installed.
  

Re: couldn't use rbd

2012-02-03 Thread Masuko Tomoya
Hi,

The output of 'ceph pg dump' is below.

root@ceph01:~# ceph pg dump
2012-02-04 07:50:15.453151 mon - [pg,dump]
2012-02-04 07:50:15.453734 mon.0 - 'dumped all in format plain' (0)
version 63
last_osdmap_epoch 37
last_pg_scan 1
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip degrunf kb  bytes   log
disklog state   v   reportedup  acting  last_scrub
1.1p0   0   0   0   0   0   0   0   0
 active+clean+degraded   0'0 34'30   [0] [0] 0'0
2012-02-03 10:25:55.383343
0.0p0   0   0   0   0   0   0   0   0
 active+clean+degraded   0'0 34'32   [0] [0] 0'0
2012-02-03 10:25:51.380648
1.0p0   0   0   0   0   0   0   0   0
 active+clean+degraded   0'0 34'30   [0] [0] 0'0
2012-02-03 10:25:53.381291
0.1p0   0   0   0   0   0   0   0   0
 active+clean+degraded   0'0 34'32   [0] [0] 0'0
2012-02-03 10:25:52.380881
2.0p0   0   0   0   0   0   0   0   0
 active+clean+degraded   0'0 34'30   [0] [0] 0'0
2012-02-03 10:25:59.387441
2.1p0   0   0   0   0   0   0   0   0
 active+clean+degraded   0'0 34'30   [0] [0] 0'0
2012-02-03 10:26:04.392778
pool 0  0   0   0   0   0   0   0   0
pool 1  0   0   0   0   0   0   0   0
pool 2  0   0   0   0   0   0   0   0
 sum0   0   0   0   0   0   0   0
osdstat kbused  kbavail kb  hb in   hb out
0   15681572603215727600[]  []
 sum15681572603215727600


2012/2/3 Josh Durgin josh.dur...@dreamhost.com:
 On 02/03/2012 02:14 PM, Masuko Tomoya wrote:

 Hi Josh,

 Thank you for your comments.

    debug osd = 20
    debug ms = 1
    debug filestore = 20


 I added this to the osd section of ceph.conf and ran /etc/init.d/ceph
 stopstart.

 The output of OSD.log when 'rbd list' was executed is below.

 -
 2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32
 OSD::ms_verify_authorizer name=client.admin auid=0
 2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32  new session
 0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110
 2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32  session 0x24f5240
 has caps osdcaps(pools={} default allow= default_deny=)
 2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992==
 client.4201 10.68.119.191:0/1005110 1  osd_op(client.4201.0:1
 rbd_directory [read 0~0] 2.30a98c1c) v3  143+0+0 (3720164172 0 0)
 0x24d8900 con 0x24d4dc0
 2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900
 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3
 2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32
 require_same_or_newer_map 32 (i am 32) 0x24d8900
 2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming
 client.4201 10.68.119.191:0/1005110 32
 2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg
 2.0, waiting


 The pg should have been created already. What's the output of 'ceph pg
 dump'?




 2012/2/3 Josh Durginjosh.dur...@dreamhost.com:

 On 02/03/2012 12:51 AM, Masuko Tomoya wrote:


 Hi Josh,

 Thank you for reply !

 This might mean the rbd image list object can't be read for some
 reason, or the rbd tool is doing something weird that the rados tool
 isn't. Can you share the output of 'ceph -s' and
 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc
 20
 --debug-auth 20'?



 The output of 'ceph -s' is below.
 ---
 root@ceph01:~# ceph -s
 2012-02-03 17:01:47.881960    pg v33: 6 pgs: 6 active+clean+degraded;
 0 KB data, 1056 KB used, 15357 MB / 15358 MB avail
 2012-02-03 17:01:47.882583   mds e9: 1/1/1 up {0=0=up:creating}
 2012-02-03 17:01:47.882733   osd e21: 1 osds: 1 up, 1 in
 2012-02-03 17:01:47.883042   log 2012-02-03 16:35:11.183897 osd.0
 10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or
 wrong addr
 2012-02-03 17:01:47.883144   mon e1: 1 mons at {0=10.68.119.191:6789/0}

 The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
 --debug-monc 20 --debug-auth 20' is below.
 --
 root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
 --debug-monc 20 --debug-auth 20



 snip


 2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit
 oid rbd_directory @2 [read 0~0] tid 1 osd.0
 2012-02-03 17:02:10.971465 7f88cbb91720 client.4106.objecter send_op 1
 to
 osd.0
 2012-02-03 17:02:10.971533 7f88cbb91720 -- 10.68.119.191:0/1003500 --
 10.68.119.191:6801/2912 -- osd_op(client.4106.0:1 rbd_directory [read
 0~0] 2.30a98c1c) v1 -- ?+0 0x24664c0 con 0x24661b0



 Everything above here is normal - the rbd tool connected to the
 monitors, got the monmap and osdmap, and sent a request to read the
 'rbd_directory' object.

 snip


 2012-02-03 17:02:25.969338 7f88c7261710 client.4106.objecter  tid 1 on
 osd.0 is laggy


Re: couldn't use rbd

2012-02-03 Thread Josh Durgin

On 02/03/2012 02:54 PM, Masuko Tomoya wrote:

Hi,

The output of 'ceph pg dump' is below.

root@ceph01:~# ceph pg dump
2012-02-04 07:50:15.453151 mon- [pg,dump]
2012-02-04 07:50:15.453734 mon.0 -  'dumped all in format plain' (0)
version 63
last_osdmap_epoch 37
last_pg_scan 1
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip degrunf kb  bytes   log
disklog state   v   reportedup  acting  last_scrub
1.1p0   0   0   0   0   0   0   0   0
  active+clean+degraded   0'0 34'30   [0] [0] 0'0
2012-02-03 10:25:55.383343
0.0p0   0   0   0   0   0   0   0   0
  active+clean+degraded   0'0 34'32   [0] [0] 0'0
2012-02-03 10:25:51.380648
1.0p0   0   0   0   0   0   0   0   0
  active+clean+degraded   0'0 34'30   [0] [0] 0'0
2012-02-03 10:25:53.381291
0.1p0   0   0   0   0   0   0   0   0
  active+clean+degraded   0'0 34'32   [0] [0] 0'0
2012-02-03 10:25:52.380881
2.0p0   0   0   0   0   0   0   0   0
  active+clean+degraded   0'0 34'30   [0] [0] 0'0
2012-02-03 10:25:59.387441
2.1p0   0   0   0   0   0   0   0   0
  active+clean+degraded   0'0 34'30   [0] [0] 0'0
2012-02-03 10:26:04.392778
pool 0  0   0   0   0   0   0   0   0
pool 1  0   0   0   0   0   0   0   0
pool 2  0   0   0   0   0   0   0   0
  sum0   0   0   0   0   0   0   0
osdstat kbused  kbavail kb  hb in   hb out
0   15681572603215727600[]  []
  sum15681572603215727600



You hit a bug in 0.38 that made the default crushmap for one osd
contain no pgs. This was fixed in 0.39, so I'd suggest upgrading and
re-running mkcephfs.



2012/2/3 Josh Durginjosh.dur...@dreamhost.com:

On 02/03/2012 02:14 PM, Masuko Tomoya wrote:


Hi Josh,

Thank you for your comments.


debug osd = 20
debug ms = 1
debug filestore = 20



I added this to the osd section of ceph.conf and ran /etc/init.d/ceph
stopstart.

The output of OSD.log when 'rbd list' was executed is below.

-
2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32
OSD::ms_verify_authorizer name=client.admin auid=0
2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32  new session
0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110
2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32  session 0x24f5240
has caps osdcaps(pools={} default allow= default_deny=)
2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992==
client.4201 10.68.119.191:0/1005110 1  osd_op(client.4201.0:1
rbd_directory [read 0~0] 2.30a98c1c) v3  143+0+0 (3720164172 0 0)
0x24d8900 con 0x24d4dc0
2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900
osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3
2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32
require_same_or_newer_map 32 (i am 32) 0x24d8900
2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming
client.4201 10.68.119.191:0/1005110 32
2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg
2.0, waiting



The pg should have been created already. What's the output of 'ceph pg
dump'?





2012/2/3 Josh Durginjosh.dur...@dreamhost.com:


On 02/03/2012 12:51 AM, Masuko Tomoya wrote:



Hi Josh,

Thank you for reply !


This might mean the rbd image list object can't be read for some
reason, or the rbd tool is doing something weird that the rados tool
isn't. Can you share the output of 'ceph -s' and
'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc
20
--debug-auth 20'?




The output of 'ceph -s' is below.
---
root@ceph01:~# ceph -s
2012-02-03 17:01:47.881960pg v33: 6 pgs: 6 active+clean+degraded;
0 KB data, 1056 KB used, 15357 MB / 15358 MB avail
2012-02-03 17:01:47.882583   mds e9: 1/1/1 up {0=0=up:creating}
2012-02-03 17:01:47.882733   osd e21: 1 osds: 1 up, 1 in
2012-02-03 17:01:47.883042   log 2012-02-03 16:35:11.183897 osd.0
10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or
wrong addr
2012-02-03 17:01:47.883144   mon e1: 1 mons at {0=10.68.119.191:6789/0}

The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
--debug-monc 20 --debug-auth 20' is below.
--
root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
--debug-monc 20 --debug-auth 20




snip



2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit
oid rbd_directory @2 [read 0~0] tid 1 osd.0
2012-02-03 17:02:10.971465 7f88cbb91720 client.4106.objecter send_op 1
to
osd.0
2012-02-03 17:02:10.971533 7f88cbb91720 -- 10.68.119.191:0/1003500 --
10.68.119.191:6801/2912 -- osd_op(client.4106.0:1 rbd_directory [read
0~0] 2.30a98c1c) v1 -- ?+0 0x24664c0 con 0x24661b0




Everything above here is normal - the rbd tool connected to the
monitors, got the monmap and 

Re: couldn't use rbd

2012-02-03 Thread Masuko Tomoya
Hi,

I upgraded ceph to 0.41 and re-running mkcephfs.
I found my issue is fixed.

-
root@ceph01:~# rbd list
pool rbd doesn't contain rbd images
root@ceph01:~# rbd create test --size 1024
root@ceph01:~# rbd list
test
-

Josh, thank you for your advices.


2012/2/3 Josh Durgin josh.dur...@dreamhost.com:
 On 02/03/2012 02:54 PM, Masuko Tomoya wrote:

 Hi,

 The output of 'ceph pg dump' is below.

 root@ceph01:~# ceph pg dump
 2012-02-04 07:50:15.453151 mon- [pg,dump]
 2012-02-04 07:50:15.453734 mon.0 -  'dumped all in format plain' (0)
 version 63
 last_osdmap_epoch 37
 last_pg_scan 1
 full_ratio 0.95
 nearfull_ratio 0.85
 pg_stat objects mip     degr    unf     kb      bytes   log
 disklog state   v       reported        up      acting  last_scrub
 1.1p0   0       0       0       0       0       0       0       0
  active+clean+degraded   0'0     34'30   [0]     [0] 0'0
 2012-02-03 10:25:55.383343
 0.0p0   0       0       0       0       0       0       0       0
  active+clean+degraded   0'0     34'32   [0]     [0] 0'0
 2012-02-03 10:25:51.380648
 1.0p0   0       0       0       0       0       0       0       0
  active+clean+degraded   0'0     34'30   [0]     [0] 0'0
 2012-02-03 10:25:53.381291
 0.1p0   0       0       0       0       0       0       0       0
  active+clean+degraded   0'0     34'32   [0]     [0] 0'0
 2012-02-03 10:25:52.380881
 2.0p0   0       0       0       0       0       0       0       0
  active+clean+degraded   0'0     34'30   [0]     [0] 0'0
 2012-02-03 10:25:59.387441
 2.1p0   0       0       0       0       0       0       0       0
  active+clean+degraded   0'0     34'30   [0]     [0] 0'0
 2012-02-03 10:26:04.392778
 pool 0  0       0       0       0       0       0       0       0
 pool 1  0       0       0       0       0       0       0       0
 pool 2  0       0       0       0       0       0       0       0
  sum    0       0       0       0       0       0       0       0
 osdstat kbused  kbavail kb      hb in   hb out
 0       1568    15726032        15727600        []      []
  sum    1568    15726032        15727600


 You hit a bug in 0.38 that made the default crushmap for one osd
 contain no pgs. This was fixed in 0.39, so I'd suggest upgrading and
 re-running mkcephfs.



 2012/2/3 Josh Durginjosh.dur...@dreamhost.com:

 On 02/03/2012 02:14 PM, Masuko Tomoya wrote:


 Hi Josh,

 Thank you for your comments.

    debug osd = 20
    debug ms = 1
    debug filestore = 20



 I added this to the osd section of ceph.conf and ran /etc/init.d/ceph
 stopstart.

 The output of OSD.log when 'rbd list' was executed is below.

 -
 2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32
 OSD::ms_verify_authorizer name=client.admin auid=0
 2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32  new session
 0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110
 2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32  session 0x24f5240
 has caps osdcaps(pools={} default allow= default_deny=)
 2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992==
 client.4201 10.68.119.191:0/1005110 1  osd_op(client.4201.0:1
 rbd_directory [read 0~0] 2.30a98c1c) v3  143+0+0 (3720164172 0 0)
 0x24d8900 con 0x24d4dc0
 2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900
 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3
 2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32
 require_same_or_newer_map 32 (i am 32) 0x24d8900
 2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming
 client.4201 10.68.119.191:0/1005110 32
 2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg
 2.0, waiting



 The pg should have been created already. What's the output of 'ceph pg
 dump'?




 2012/2/3 Josh Durginjosh.dur...@dreamhost.com:


 On 02/03/2012 12:51 AM, Masuko Tomoya wrote:



 Hi Josh,

 Thank you for reply !

 This might mean the rbd image list object can't be read for some
 reason, or the rbd tool is doing something weird that the rados tool
 isn't. Can you share the output of 'ceph -s' and
 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc
 20
 --debug-auth 20'?




 The output of 'ceph -s' is below.
 ---
 root@ceph01:~# ceph -s
 2012-02-03 17:01:47.881960    pg v33: 6 pgs: 6 active+clean+degraded;
 0 KB data, 1056 KB used, 15357 MB / 15358 MB avail
 2012-02-03 17:01:47.882583   mds e9: 1/1/1 up {0=0=up:creating}
 2012-02-03 17:01:47.882733   osd e21: 1 osds: 1 up, 1 in
 2012-02-03 17:01:47.883042   log 2012-02-03 16:35:11.183897 osd.0
 10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or
 wrong addr
 2012-02-03 17:01:47.883144   mon e1: 1 mons at
 {0=10.68.119.191:6789/0}

 The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
 --debug-monc 20 --debug-auth 20' is below.
 --
 root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20
 --debug-monc 20 --debug-auth 20




 snip


 2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit
 oid 

[no subject]

2012-02-03 Thread Masuko Tomoya
Hi, all.

I'm trying to attach rbd volume from instance on KVM.
But I have problem.
Could you help me ?

---
I tried to attach rbd volume on ceph01 to instance on compute1 with
virsh command.

root@compute1:~# virsh attach-device test-ub16 /root/testvolume.xml
error: Failed to attach device from /root/testvolume.xml
error: cannot resolve symlink rbd/testvolume: No such file or directory

/var/log/messages
Feb  3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: error :
qemuMonitorTextAddDevice:2417 : operation failed: adding
virtio-blk-pci,bus=pci.0,addr=0x9,drive=drive-virtio-disk4,id=virtio-disk4
device failed: Device needs media, but drive is empty#015#012Device
'virtio-blk-pci' could not be initialized#015#012
Feb  3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: warning :
qemuDomainAttachPciDiskDevice:188 : qemuMonitorAddDevice failed on
file=rbd:rbd/testvolume,if=none,id=drive-virtio-disk4,format=raw
(virtio-blk-pci,bus=pci.0,addr=0x9,drive=drive-virtio-disk4,id=virtio-disk4)
Feb  3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: error :
virSecurityDACRestoreSecurityFileLabel:143 : cannot resolve symlink
rbd/testvolume: No such file or directory
Feb  3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: warning :
qemuDomainAttachPciDiskDevice:229 : Unable to restore security label
on rbd/testvolume

there is no log in /var/log/ceph/mon.0.log of host ceph01.
---


My environment is below.
*There are two servers. All server are ubuntu 10.10 x86_64.
*ceph01: single server configured ceph.(version: 0.41-1maverick)
*compute1: kvm hypervisor
 -librados2 and librbd1 packages are installed.
 (version: 0.41-1maverick)
 -qemu-kvm is 0.14.0-rc1. I built qemu with rbd enable.
  the output of run 'qemu-img' show 'rbd' at supported formats field.
  (I built qemu reffering this page.
  http://ceph.newdream.net/wiki/QEMU-RBD)
 -apparmor is disable.
 -libvirt is 0.8.8


 -there is ceph.conf on compute1.
root@compute1:~# ls -l /etc/ceph/
total 20
-rw-r--r-- 1 root root 508 2012-02-03 14:38 ceph.conf
-rw--- 1 root root  63 2012-02-03 17:04 keyring.admin
-rw--- 1 root root  63 2012-02-03 14:38 keyring.bin
-rw--- 1 root root  56 2012-02-03 14:38 keyring.mds.0
-rw--- 1 root root  56 2012-02-03 14:38 keyring.osd.0

=
 -contents of ceph.conf is below.
root@compute1:~# cat /etc/ceph/ceph.conf
[global]
auth supported = cephx
keyring = /etc/ceph/keyring.bin
[mon]
mon data = /data/data/mon$id
debug ms = 1
[mon.0]
host = ceph01
mon addr = 10.68.119.191:6789
[mds]
keyring = /etc/ceph/keyring.$name
[mds.0]
host = ceph01
[osd]
keyring = /etc/ceph/keyring.$name
osd data = /data/osd$id
osd journal = /data/osd$id/journal
osd journal size = 512
osd class tmp = /var/lib/ceph/tmp
debug osd = 20
debug ms = 1
debug filestore = 20
[osd.0]
host = ceph01
btrfs devs = /dev/sdb1

===
*conten of keyring.admin is below
root@compute1:~# cat /etc/ceph/keyring.admin
[client.admin]
key = AQDFeCxPyBlNIRAAxS1DcRHpMXRpcjY/GNMwYg==


===
*output of run 'ceph auth list'
root@ceph01:/etc/ceph# ceph auth list
2012-02-03 20:34:59.507451 mon - [auth,list]
2012-02-03 20:34:59.508785 mon.0 - 'installed auth entries:
mon.
key: AQDFeCxPiK04IxAAslDBNkrOGKWxcbCh2iysqg==
mds.0
key: AQDFeCxPsJ+LGhAAJ3/rmkAtGXSv/eHh0yXgww==
caps: [mds] allow
caps: [mon] allow rwx
caps: [osd] allow *
osd.0
key: AQDFeCxPoEK+ExAAecD7+tWgpIRoZx2AT7Jwbg==
caps: [mon] allow rwx
caps: [osd] allow *
client.admin
key: AQDFeCxPyBlNIRAAxS1DcRHpMXRpcjY/GNMwYg==
caps: [mds] allow
caps: [mon] allow *
caps: [osd] allow *
' (0)


*xml file is below.
root@compute1:~# cat /root/testvolume.xml
disk type='network' device='disk'
  driver name='qemu' type='raw'/
  source protocol='rbd' name='rbd/testvolume'
host name='10.68.119.191' port='6789'/
  /source
  target dev='vde' bus='virtio'/
/disk


*testvolume is on rados pools.
root@compute1:~# qemu-img info rbd:rbd/testvolume
image: rbd:rbd/testvolume
file format: raw
virtual size: 1.0G (1073741824 bytes)
disk size: unavailable


Waiting for reply,

Tomoya.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


about attaching rbd volume from instance on KVM

2012-02-03 Thread Masuko Tomoya
Hi, all.

I'm trying to attach rbd volume from instance on KVM.
But I have problem.
Could you help me ?

---
I tried to attach rbd volume on ceph01 to instance on compute1 with
virsh command.

root@compute1:~# virsh attach-device test-ub16 /root/testvolume.xml
error: Failed to attach device from /root/testvolume.xml
error: cannot resolve symlink rbd/testvolume: No such file or directory

/var/log/messages
Feb  3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: error :
qemuMonitorTextAddDevice:2417 : operation failed: adding
virtio-blk-pci,bus=pci.0,addr=0x9,drive=drive-virtio-disk4,id=virtio-disk4
device failed: Device needs media, but drive is empty#015#012Device
'virtio-blk-pci' could not be initialized#015#012
Feb  3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: warning :
qemuDomainAttachPciDiskDevice:188 : qemuMonitorAddDevice failed on
file=rbd:rbd/testvolume,if=none,id=drive-virtio-disk4,format=raw
(virtio-blk-pci,bus=pci.0,addr=0x9,drive=drive-virtio-disk4,id=virtio-disk4)
Feb  3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: error :
virSecurityDACRestoreSecurityFileLabel:143 : cannot resolve symlink
rbd/testvolume: No such file or directory
Feb  3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: warning :
qemuDomainAttachPciDiskDevice:229 : Unable to restore security label
on rbd/testvolume

there is no log in /var/log/ceph/mon.0.log of host ceph01.
---


My environment is below.
*There are two servers. All server are ubuntu 10.10 x86_64.
*ceph01: single server configured ceph.(version: 0.41-1maverick)
*compute1: kvm hypervisor
 -librados2 and librbd1 packages are installed.
 (version: 0.41-1maverick)
 -qemu-kvm is 0.14.0-rc1. I built qemu with rbd enable.
 the output of run 'qemu-img' show 'rbd' at supported formats field.
 (I built qemu reffering this page.
 http://ceph.newdream.net/wiki/QEMU-RBD)
 -apparmor is disable.
 -libvirt is 0.8.8


 -there is ceph.conf on compute1.
root@compute1:~# ls -l /etc/ceph/
total 20
-rw-r--r-- 1 root root 508 2012-02-03 14:38 ceph.conf
-rw--- 1 root root  63 2012-02-03 17:04 keyring.admin
-rw--- 1 root root  63 2012-02-03 14:38 keyring.bin
-rw--- 1 root root  56 2012-02-03 14:38 keyring.mds.0
-rw--- 1 root root  56 2012-02-03 14:38 keyring.osd.0

=
 -contents of ceph.conf is below.
root@compute1:~# cat /etc/ceph/ceph.conf
[global]
   auth supported = cephx
   keyring = /etc/ceph/keyring.bin
[mon]
   mon data = /data/data/mon$id
   debug ms = 1
[mon.0]
   host = ceph01
   mon addr = 10.68.119.191:6789
[mds]
   keyring = /etc/ceph/keyring.$name
[mds.0]
   host = ceph01
[osd]
   keyring = /etc/ceph/keyring.$name
   osd data = /data/osd$id
   osd journal = /data/osd$id/journal
   osd journal size = 512
   osd class tmp = /var/lib/ceph/tmp
   debug osd = 20
   debug ms = 1
   debug filestore = 20
[osd.0]
   host = ceph01
   btrfs devs = /dev/sdb1

===
*conten of keyring.admin is below
root@compute1:~# cat /etc/ceph/keyring.admin
[client.admin]
   key = AQDFeCxPyBlNIRAAxS1DcRHpMXRpcjY/GNMwYg==


===
*output of run 'ceph auth list'
root@ceph01:/etc/ceph# ceph auth list
2012-02-03 20:34:59.507451 mon - [auth,list]
2012-02-03 20:34:59.508785 mon.0 - 'installed auth entries:
mon.
   key: AQDFeCxPiK04IxAAslDBNkrOGKWxcbCh2iysqg==
mds.0
   key: AQDFeCxPsJ+LGhAAJ3/rmkAtGXSv/eHh0yXgww==
   caps: [mds] allow
   caps: [mon] allow rwx
   caps: [osd] allow *
osd.0
   key: AQDFeCxPoEK+ExAAecD7+tWgpIRoZx2AT7Jwbg==
   caps: [mon] allow rwx
   caps: [osd] allow *
client.admin
   key: AQDFeCxPyBlNIRAAxS1DcRHpMXRpcjY/GNMwYg==
   caps: [mds] allow
   caps: [mon] allow *
   caps: [osd] allow *
' (0)


*xml file is below.
root@compute1:~# cat /root/testvolume.xml
disk type='network' device='disk'
 driver name='qemu' type='raw'/
 source protocol='rbd' name='rbd/testvolume'
   host name='10.68.119.191' port='6789'/
 /source
 target dev='vde' bus='virtio'/
/disk


*testvolume is on rados pools.
root@compute1:~# qemu-img info rbd:rbd/testvolume
image: rbd:rbd/testvolume
file format: raw
virtual size: 1.0G (1073741824 bytes)
disk size: unavailable


Waiting for reply,

Tomoya.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html