Re: Cannot delete some empty dirs and weird sizes
On Thursday 02 February 2012 wrote Gregory Farnum: On Wed, Feb 1, 2012 at 9:02 AM, Amon Ott a@m-privacy.de wrote: ceph should have recovered here. Might also be caused by this setting that I tried for a while, it is off now: mds standby replay = true With this setting, if the active mds gets killed, no mds is able to become active, so everything hangs. Had to reboot again. Hrm. That setting simply tells the non-active MDSes that they should follow the journal of the active MDS(es). They should still go active if the MDS they're following fails — although it does slightly increase the chances of them running into the same bugs in code and dying at the same time. Well, in this test the following MDS did not become active, it kept stuck in replay mode. As waking up without that following is also fast enough for us, I quickly turned it off again. Just wanted to mention that there might be another bug lurking. Then killed the active mds, another takes over and suddenly the missing file appears: .tmp/tiny14/.config/pcmanfm/LXDE/insgesamt 1 drwxr-xr-x 1 32252 users 393 1. Feb 15:19 . drwxr-xr-x 1 32252 users 0 1. Feb 17:21 .. -rw-r--r-- 1 root root 393 24. Jan 15:55 pcmanfm.conf So were you able to delete that file once it reappeared? It got deleted correctly by nightly cleanup cron job. Restarted the original mds, it does not appear in ceph mds dump, although it is running at 100% cpu. Same happened with other mds processes after killing and starting, now I have only one left that is working correctly. Remind me, you do only have one active MDS, correct? Did you look at the logs and see what the MDS was doing with 100% cpu? Four MDS defined, but using max mds = 1 Will leave the cluster in this state now and have another look tomorrow - maybe the spinning mds processes recover by some miracle. After yet another complete reboot it seems to be stable again now. So far I can say that Ceph FS runs pretty stable with the following conditions: - 4 nodes with each 8 to 12 CPU cores, 12 GB of RAM (CPUs and RAM mostly unused by Ceph) - Ceph 0.41 - 3 mon, 4 mds (max mds 1), 4 osd, all 4 nodes are both server and client - Kernel 3.1.10 with some hand integrated patches for Ceph fixes (3.2.2 was unstable, need to check) - OSD storage on normal hard drive, ext4 without journal (btrfs crashes about once per day) - OSD journals on separate SSD - MDS does not get restarted without reboot (yes, only reboot helps) From our experience, it could be a bit faster when writing into many different files, which is our major work load. However, the last few versions already brought significant improvements on stability and speed. The main problem we see is the MDS takeover and recovery, which has given a lot of trouble so far. Thanks once more for your good work! Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: couldn't use rbd
Hi Josh, Thank you for reply ! This might mean the rbd image list object can't be read for some reason, or the rbd tool is doing something weird that the rados tool isn't. Can you share the output of 'ceph -s' and 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20'? The output of 'ceph -s' is below. --- root@ceph01:~# ceph -s 2012-02-03 17:01:47.881960pg v33: 6 pgs: 6 active+clean+degraded; 0 KB data, 1056 KB used, 15357 MB / 15358 MB avail 2012-02-03 17:01:47.882583 mds e9: 1/1/1 up {0=0=up:creating} 2012-02-03 17:01:47.882733 osd e21: 1 osds: 1 up, 1 in 2012-02-03 17:01:47.883042 log 2012-02-03 16:35:11.183897 osd.0 10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or wrong addr 2012-02-03 17:01:47.883144 mon e1: 1 mons at {0=10.68.119.191:6789/0} The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20' is below. -- root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20 2012-02-03 17:02:10.910598 7f88cbb91720 monclient(hunting): build_initial_monmap 2012-02-03 17:02:10.910912 7f88cbb91720 -- :/0 messenger.start 2012-02-03 17:02:10.910966 7f88cbb91720 monclient(hunting): init 2012-02-03 17:02:10.911268 7f88cbb91720 auth: KeyRing::load: loaded key file /etc/ceph/keyring.bin 2012-02-03 17:02:10.911952 7f88cbb91720 monclient(hunting): supporting cephx auth protocol 2012-02-03 17:02:10.911984 7f88cbb91720 monclient(hunting): _reopen_session 2012-02-03 17:02:10.912098 7f88cbb91720 monclient(hunting): _pick_new_mon picked mon.0 con 0x24603e0 addr 10.68.119.191:6789/0 2012-02-03 17:02:10.912124 7f88cbb91720 monclient(hunting): _send_mon_message to mon.0 at 10.68.119.191:6789/0 2012-02-03 17:02:10.951553 7f88cbb91720 -- 10.68.119.191:0/1003500 -- 10.68.119.191:6789/0 -- auth(proto 0 30 bytes) v1 -- ?+0 0x24606f0 con 0x24603e0 2012-02-03 17:02:10.951729 7f88cbb8f710 -- 10.68.119.191:0/1003500 learned my addr 10.68.119.191:0/1003500 2012-02-03 17:02:10.952949 7f88cbb91720 monclient(hunting): renew_subs 2012-02-03 17:02:10.953012 7f88cbb91720 monclient(hunting): authenticate will time out at 2012-02-03 17:02:40.952983 2012-02-03 17:02:10.953392 7f88c9366710 -- 10.68.119.191:0/1003500 == mon.0 10.68.119.191:6789/0 1 auth_reply(proto 2 0 Success) v1 33+0+0 (1661896955 0 0) 0x24608d0 con 0x24603e0 2012-02-03 17:02:10.953444 7f88c9366710 cephx: set_have_need_key no handler for service mon 2012-02-03 17:02:10.953455 7f88c9366710 cephx: set_have_need_key no handler for service osd 2012-02-03 17:02:10.953462 7f88c9366710 cephx: set_have_need_key no handler for service auth 2012-02-03 17:02:10.953504 7f88c9366710 cephx: validate_tickets want 37 have 0 need 37 2012-02-03 17:02:10.953526 7f88c9366710 monclient(hunting): my global_id is 4106 2012-02-03 17:02:10.953546 7f88c9366710 cephx client: handle_response ret = 0 2012-02-03 17:02:10.953557 7f88c9366710 cephx client: got initial server challenge 1501581746095987 2012-02-03 17:02:10.953575 7f88c9366710 cephx client: build_request 2012-02-03 17:02:10.953586 7f88c9366710 cephx client: validate_tickets: want=37 need=37 have=0 2012-02-03 17:02:10.953592 7f88c9366710 cephx: set_have_need_key no handler for service mon 2012-02-03 17:02:10.953598 7f88c9366710 cephx: set_have_need_key no handler for service osd 2012-02-03 17:02:10.953604 7f88c9366710 cephx: set_have_need_key no handler for service auth 2012-02-03 17:02:10.953610 7f88c9366710 cephx: validate_tickets want 37 have 0 need 37 2012-02-03 17:02:10.953617 7f88c9366710 cephx client: want=37 need=37 have=0 2012-02-03 17:02:10.953920 7f88c9366710 cephx client: get auth session key: client_challenge 394141338555354357 2012-02-03 17:02:10.953953 7f88c9366710 monclient(hunting): _send_mon_message to mon.0 at 10.68.119.191:6789/0 2012-02-03 17:02:10.953967 7f88c9366710 -- 10.68.119.191:0/1003500 -- 10.68.119.191:6789/0 -- auth(proto 2 32 bytes) v1 -- ?+0 0x24608d0 con 0x24603e0 2012-02-03 17:02:10.954517 7f88c9366710 -- 10.68.119.191:0/1003500 == mon.0 10.68.119.191:6789/0 2 auth_reply(proto 2 0 Success) v1 206+0+0 (707160412 0 0) 0x24608d0 con 0x24603e0 2012-02-03 17:02:10.954535 7f88c9366710 cephx client: handle_response ret = 0 2012-02-03 17:02:10.954555 7f88c9366710 cephx client: get_auth_session_key 2012-02-03 17:02:10.954576 7f88c9366710 cephx: verify_service_ticket_reply got 1 keys 2012-02-03 17:02:10.954587 7f88c9366710 cephx: got key for service_id auth 2012-02-03 17:02:10.954696 7f88c9366710 cephx: ticket.secret_id=2 2012-02-03 17:02:10.954737 7f88c9366710 cephx: verify_service_ticket_reply service auth secret_id 2 session_key AQCClCtPyL7gOBAAaDicPRG6NHbgUs+4YNntWQ== validity=43200.00 2012-02-03 17:02:10.954756 7f88c9366710 cephx: ticket expires=2012-02-04 05:02:10.954749 renew_after=2012-02-04 02:02:10.954749 2012-02-03 17:02:10.954764 7f88c9366710 cephx client: want=37 need=37 have=0 2012-02-03 17:02:10.954771
avoid crashes from faulty crushmap
I messed up a crush map the other day, mixing components of different types in a single rule. The crushmap compiler didn't complain, but mons and osds would crash when applying those rules. I had to use this patch to recover the cluster. Only the second hunk was relevant, but I figured a BUG_ON that stops you from fixing the problem is best avoided ;-) ---BeginMessage--- It's very hard to recover from an invalid crushmap if mons fail assertions while processing the map, and osds crash while advancing past an already-fixed map. Skip such broken rules instead of aborting. Signed-off-by: Alexandre Oliva ol...@lsd.ic.unicamp.br --- src/crush/mapper.c | 14 +++--- 1 files changed, 11 insertions(+), 3 deletions(-) diff --git a/src/crush/mapper.c b/src/crush/mapper.c index 1e475b40..6ce4c97 100644 --- a/src/crush/mapper.c +++ b/src/crush/mapper.c @@ -354,7 +354,11 @@ static int crush_choose(const struct crush_map *map, item = bucket_perm_choose(in, x, r); else item = crush_bucket_choose(in, x, r); - BUG_ON(item = map-max_devices); + if (item = map-max_devices) { + dprintk( bad item %d\n, item); + skip_rep = 1; + break; + } /* desired type? */ if (item 0) @@ -365,8 +369,12 @@ static int crush_choose(const struct crush_map *map, /* keep going? */ if (itemtype != type) { - BUG_ON(item = 0 || - (-1-item) = map-max_buckets); + if (item = 0 || + (-1-item) = map-max_buckets) { + dprintk( bad item type %d\n, type) + skip_rep = 1; + break; + } in = map-buckets[-1-item]; retry_bucket = 1; continue; -- 1.7.7.6 ---End Message--- -- Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
[PATCH RESEND] ceph: avoid panic with mismatched symlink sizes in fill_inode()
Return -EINVAL rather than panic if iinfo-symlink_len and inode-i_size do not match. Also use kstrndup rather than kmalloc/memcpy. Signed-off-by: Xi Wang xi.w...@gmail.com --- fs/ceph/inode.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 2c48937..6c0205d 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -677,18 +677,19 @@ static int fill_inode(struct inode *inode, case S_IFLNK: inode-i_op = ceph_symlink_iops; if (!ci-i_symlink) { - int symlen = iinfo-symlink_len; + u32 symlen = iinfo-symlink_len; char *sym; - BUG_ON(symlen != inode-i_size); spin_unlock(ci-i_ceph_lock); + err = -EINVAL; + if (symlen != inode-i_size) + goto out; + err = -ENOMEM; - sym = kmalloc(symlen+1, GFP_NOFS); + sym = kstrndup(iinfo-symlink, symlen, GFP_NOFS); if (!sym) goto out; - memcpy(sym, iinfo-symlink, symlen); - sym[symlen] = 0; spin_lock(ci-i_ceph_lock); if (!ci-i_symlink) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/02/2012 05:28 PM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov wrote: I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? Looks like it. Interesting. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? There's just one main dispatcher; no split for the ops and rep_ops . The reason for that dispatch_running is that if there are requests waiting then the tick() function will run through them if the messenger dispatch thread is currently idle. But it is possible for the Messenger to try and dispatch, and for that to be blocked while some amount of (usually trivial) work is being done by a different thread, yes. I don't think we've ever observed it being a problem for anything other than updating OSD maps, though... Ah, OK. I guess I was confused by my log output, e.g.: osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads. I need to go study that code some more Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. I guess the logging output must have changed a bit at some pointer (or was that one of your patches?). master has reader wants not reader got for the policy throttler. (Just got a little confused when checking the code.) Yep, I added an extra message to make post-processing logs easier, sorry. When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000 seq 828/828 waiters 157/149 for src client.4301 tid=247 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/ osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620 osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172 Note the ~2 minute delay (and ~300,000 lines of logging) between the first and second reads. During that time 129 sockets were processed - what makes sd=215 special? Hrm. Well, you can try turning up the messenger debugging to 30 and taking advantage of the reader reading reader read pair right around tcp_read_nonblocking. OK, I'll give that a try as well, thanks. I've added tracepoints in my client kernel try_write(), and nothing seems unusual (that's with running the patch to ceph_write_space() I posted earlier): kworker/0:2-1790 [000] 1543.200887: ceph_try_write_msg_done: peer osd0 tid 179 seq 3 sent 4194304 kworker/0:2-1790
Re: couldn't use rbd
On 02/03/2012 12:51 AM, Masuko Tomoya wrote: Hi Josh, Thank you for reply ! This might mean the rbd image list object can't be read for some reason, or the rbd tool is doing something weird that the rados tool isn't. Can you share the output of 'ceph -s' and 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20'? The output of 'ceph -s' is below. --- root@ceph01:~# ceph -s 2012-02-03 17:01:47.881960pg v33: 6 pgs: 6 active+clean+degraded; 0 KB data, 1056 KB used, 15357 MB / 15358 MB avail 2012-02-03 17:01:47.882583 mds e9: 1/1/1 up {0=0=up:creating} 2012-02-03 17:01:47.882733 osd e21: 1 osds: 1 up, 1 in 2012-02-03 17:01:47.883042 log 2012-02-03 16:35:11.183897 osd.0 10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or wrong addr 2012-02-03 17:01:47.883144 mon e1: 1 mons at {0=10.68.119.191:6789/0} The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20' is below. -- root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20 snip 2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit oid rbd_directory @2 [read 0~0] tid 1 osd.0 2012-02-03 17:02:10.971465 7f88cbb91720 client.4106.objecter send_op 1 to osd.0 2012-02-03 17:02:10.971533 7f88cbb91720 -- 10.68.119.191:0/1003500 -- 10.68.119.191:6801/2912 -- osd_op(client.4106.0:1 rbd_directory [read 0~0] 2.30a98c1c) v1 -- ?+0 0x24664c0 con 0x24661b0 Everything above here is normal - the rbd tool connected to the monitors, got the monmap and osdmap, and sent a request to read the 'rbd_directory' object. snip 2012-02-03 17:02:25.969338 7f88c7261710 client.4106.objecter tid 1 on osd.0 is laggy This means the osd isn't responding to the read. Check the osd log for errors. If there's nothing obvious, add this to the osd section of your ceph.conf and restart the osd: debug osd = 20 debug ms = 1 debug filestore = 20 Then run 'rbd ls' and look at what happens after 'osd_op.*rbd_directory' appears in the osd log. rados lspools log I compared those logs and found there is differences. 'rbd list' 2012-02-03 17:02:10.971770 7f88c9366710 -- 10.68.119.191:0/1003500== mon.0 10.68.119.191:6789/0 10 osd_map(21..21 src has 1..21) v2 1284+0+0 (473305567 0 0) 0x24655a0 con 0x24603e0 2012-02-03 17:02:10.971789 7f88c9366710 client.4106.objecter handle_osd_map ignoring epochs [21,21]= 21 2012-02-03 17:02:10.971801 7f88c9366710 client.4106.objecter dump_active .. 0 homeless 2012-02-03 17:02:10.971815 7f88c9366710 client.4106.objecter 1 2.30a98c1c osd.0 rbd_directory [read 0~0] --(snip)-- 'rados lspools' 2012-02-03 17:11:52.866072 7f9c5764b710 -- 10.68.119.191:0/1003868== mon.0 10.68.119.191:6789/0 7 osd_map(21..21 src has 1..21) v2 1284+0+0 (473305567 0 0) 0x770a70 con 0x771440 2012-02-03 17:11:52.866103 7f9c5764b710 client.4107.objecter handle_osd_map got epochs [21,21] 0 2012-02-03 17:11:52.866111 7f9c5764b710 client.4107.objecter handle_osd_map decoding full epoch 21 2012-02-03 17:11:52.866272 7f9c5764b710 client.4107.objecter dump_active .. 0 homeless data metadata rbd --(snip)-- What do these logs mean ? The difference is that 'rbd ls' talks to the monitors and osds, while 'rados lspools' just needs to talk to the monitors. The objecter dump_active part is listing in-flight osd requests. *ceph cluster is configured on single server. *server is ubuntu 10.10 maverick. *ceph, librados2 and librbd1 packages are installed. (version: 0.38-1maverick) *apparmor is disable. apparmor shouldn't matter if you have libvirt 0.9.9 or newer. I use libvirt 0.8.3 (latest version for maverick), so I disabled apparmor. 2012/2/2 Josh Durginjosh.dur...@dreamhost.com: On 02/02/2012 01:49 AM, Masuko Tomoya wrote: Hi, all. When I execute rbd command, it is not success. root@ceph01:~# rbd list (no response) /var/log/ceph/mon.0.log - 2012-02-02 17:58:19.801762 7ff4bbfb1710 -- 10.68.119.191:6789/0== client.? 10.68.119.191:0/1002580 1 auth(proto 0 30 bytes) v1 56+0+0 (625540289 0 0) 0x1619a00 con 0x1615a00 2012-02-02 17:58:19.801919 7ff4bbfb1710 -- 10.68.119.191:6789/0 -- 10.68.119.191:0/1002580 -- auth_reply(proto 2 0 Success) v1 -- ?+0 0x1619c00 con 0x1615a00 2012-02-02 17:58:19.802505 7ff4bbfb1710 -- 10.68.119.191:6789/0== client.? 10.68.119.191:0/1002580 2 auth(proto 2 32 bytes) v1 58+0+0 (346146289 0 0) 0x161fc00 con 0x1615a00 2012-02-02 17:58:19.802673 7ff4bbfb1710 -- 10.68.119.191:6789/0 -- 10.68.119.191:0/1002580 -- auth_reply(proto 2 0 Success) v1 -- ?+0 0x1619a00 con 0x1615a00 2012-02-02 17:58:19.803473 7ff4bbfb1710 -- 10.68.119.191:6789/0== client.? 10.68.119.191:0/1002580 3 auth(proto 2 165 bytes) v1 191+0+0 (3737796417 0 0) 0x1619600 con 0x1615a00 2012-02-02 17:58:19.803745 7ff4bbfb1710 -- 10.68.119.191:6789/0 -- 10.68.119.191:0/1002580 -- auth_reply(proto 2 0 Success) v1 -- ?+0 0x161fc00 con
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Feb 3, 2012, at 8:18 AM, Jim Schutt jasc...@sandia.gov wrote: On 02/02/2012 05:28 PM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov wrote: I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? Looks like it. Interesting. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? There's just one main dispatcher; no split for the ops and rep_ops . The reason for that dispatch_running is that if there are requests waiting then the tick() function will run through them if the messenger dispatch thread is currently idle. But it is possible for the Messenger to try and dispatch, and for that to be blocked while some amount of (usually trivial) work is being done by a different thread, yes. I don't think we've ever observed it being a problem for anything other than updating OSD maps, though... Ah, OK. I guess I was confused by my log output, e.g.: D'oh. Sorry, you confused me with your reference to repops, which aren't special-cased or anything. But there are two messengers on the OSD, each with their own dispatch thread. One of those messengers is for clients and one is for other OSDs. And now that you point that out, I wonder if the problem is lack of Cond signaling in ms_dispatch. I'm on my phone right now but I believe there's a chunk of commented-out code (why commented instead of deleted? I don't know) that we want to uncomment for reasons that will become clear when you look at it. :) Try that and see how many of your problems disappear? osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads. I need to go study that code some more Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. I guess the logging output must have changed a bit at some pointer (or was that one of your patches?). master has reader wants not reader got for the policy throttler. (Just got a little confused when checking the code.) Yep, I added an extra message to make post-processing logs easier, sorry. When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000 seq 828/828 waiters 157/149 for src client.4301 tid=247 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/ osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620 osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Fri, 3 Feb 2012, Jim Schutt wrote: On 02/02/2012 05:28 PM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov wrote: I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? Looks like it. Interesting. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? There's just one main dispatcher; no split for the ops and rep_ops . The reason for that dispatch_running is that if there are requests waiting then the tick() function will run through them if the messenger dispatch thread is currently idle. But it is possible for the Messenger to try and dispatch, and for that to be blocked while some amount of (usually trivial) work is being done by a different thread, yes. I don't think we've ever observed it being a problem for anything other than updating OSD maps, though... Ah, OK. I guess I was confused by my log output, e.g.: osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads. I need to go study that code some more Oh... they are separate thread. In the OSD's case two different messengers (the public and cluster ones) are wired up to the same dispatcher (OSD::ms_dispatch). MOSDOps come in on the public thread, MOSDSubOps on the cluster one, but they're both fed to the same function. That's why there's some funkiness going on in, say, handle_osd_map(). But like Greg said, I don't think we've seen any significant latencies there except from map processing. If you have a log, that would be interesting to look at! sage Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. I guess the logging output must have changed a bit at some pointer (or was that one of your patches?). master has reader wants not reader got for the policy throttler. (Just got a little confused when checking the code.) Yep, I added an extra message to make post-processing logs easier, sorry. When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000 seq 828/828 waiters 157/149 for src client.4301 tid=247 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/ osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620 osd.0.log:1561092:2012-02-02
Possible RBD inconsistencies with kvm+Windows 7
I have a Windows 7 guest running under kvm/libvirt with RBD as a backend to a cluster of 3 OSDs. With this setup, I am seeing behavior that looks suspiciously like disk corruption in the guest VM executing some of our workloads. For instance, in one occurance, there is a python function that recursively deletes a large directory tree while the disk is otherwise loaded. For us, this occasionally fails because the OS reported that all the files in the directory were deleted, but then reports the directory is not empty when going to remove it. In another, a simple test application writes new files to a directory every 50ms, then after 6s verifies that at least 3 files were written, also while the disk is under heavy load. We have never ever seen these failures on bare metal, or on kvm instances backed by a LVM volume in years of operation, but they happen every couple of hours with RBD. Unfortunately, I have been unsuccessful when attempting to create synthetic test cases to demonstrate the inconsistent RBD behavior. Has anyone else seen similar inconsistent RBD behavior, or have ideas how to diagnose further? For reference, I am running ceph 0.41, qemu-kvm 1.0 on ubuntu 11.10 amd64. Regards, Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to remove laggy mds ?
Hi List, one of my test mds servers died a few days ago. (hardware crash) I will not buy a new one. Is there any chance to remove this laggy mds ? 2012-02-03 20:38:53.801623 mds e86436: 2/2/1 up {0=0=up:resolve,1=0=up:resolve(laggy or crashed)} 2012-02-03 20:39:08.943880 mds e86437: 2/2/1 up {0=0=up:resolve,1=0=up:resolve(laggy or crashed)} 2012-02-03 20:39:24.094926 mds e86438: 2/2/1 up {0=0=up:resolve,1=0=up:resolve(laggy or crashed)} 2012-02-03 20:39:39.236560 mds e86439: 2/2/1 up {0=0=up:resolve,1=0=up:resolve(laggy or crashed)} Thanks !! Jens -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible RBD inconsistencies with kvm+Windows 7
On 02/03/2012 10:19 AM, Josh Pieper wrote: I have a Windows 7 guest running under kvm/libvirt with RBD as a backend to a cluster of 3 OSDs. With this setup, I am seeing behavior that looks suspiciously like disk corruption in the guest VM executing some of our workloads. For instance, in one occurance, there is a python function that recursively deletes a large directory tree while the disk is otherwise loaded. For us, this occasionally fails because the OS reported that all the files in the directory were deleted, but then reports the directory is not empty when going to remove it. In another, a simple test application writes new files to a directory every 50ms, then after 6s verifies that at least 3 files were written, also while the disk is under heavy load. We have never ever seen these failures on bare metal, or on kvm instances backed by a LVM volume in years of operation, but they happen every couple of hours with RBD. Unfortunately, I have been unsuccessful when attempting to create synthetic test cases to demonstrate the inconsistent RBD behavior. Has anyone else seen similar inconsistent RBD behavior, or have ideas how to diagnose further? What fs are your osds using? A while ago there was a bug in ext4's fiemap that sometimes caused incorrect reads - if you set filestore_fiemap_threshold larger than your object size, you can test whether fiemap is the problem. Are you using the rbd_writeback_window option? If so, does the corruption occur without it? In any case, a log of this occurring with debug_ms=1 and debug_rbd=20 from qemu will tell us if there are out-of-order operations happening. For reference, I am running ceph 0.41, qemu-kvm 1.0 on ubuntu 11.10 amd64. Regards, Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible RBD inconsistencies with kvm+Windows 7
Josh Durgin wrote: On 02/03/2012 10:19 AM, Josh Pieper wrote: I have a Windows 7 guest running under kvm/libvirt with RBD as a backend to a cluster of 3 OSDs. With this setup, I am seeing behavior that looks suspiciously like disk corruption in the guest VM executing some of our workloads. For instance, in one occurance, there is a python function that recursively deletes a large directory tree while the disk is otherwise loaded. For us, this occasionally fails because the OS reported that all the files in the directory were deleted, but then reports the directory is not empty when going to remove it. In another, a simple test application writes new files to a directory every 50ms, then after 6s verifies that at least 3 files were written, also while the disk is under heavy load. We have never ever seen these failures on bare metal, or on kvm instances backed by a LVM volume in years of operation, but they happen every couple of hours with RBD. Unfortunately, I have been unsuccessful when attempting to create synthetic test cases to demonstrate the inconsistent RBD behavior. Has anyone else seen similar inconsistent RBD behavior, or have ideas how to diagnose further? What fs are your osds using? A while ago there was a bug in ext4's fiemap that sometimes caused incorrect reads - if you set filestore_fiemap_threshold larger than your object size, you can test whether fiemap is the problem. The OSDs are using xfs. In my testing with 0.40, btrfs had incredible performance problems after a day or so of operation. The last I heard, ext4 could potentially have data loss due to its limited xattr support. Are you using the rbd_writeback_window option? If so, does the corruption occur without it? Yes I was. In prior tests, performance was abysmal without it. I will test without it, but our runs will load the system very differently when they are going so slowly. In any case, a log of this occurring with debug_ms=1 and debug_rbd=20 from qemu will tell us if there are out-of-order operations happening. Great, I will attempt to record some. Regards, Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] ceph: avoid panic with mismatched symlink sizes in fill_inode()
On Fri, 2012-02-03 at 09:55 -0500, Xi Wang wrote: Return -EINVAL rather than panic if iinfo-symlink_len and inode-i_size do not match. Also use kstrndup rather than kmalloc/memcpy. Signed-off-by: Xi Wang xi.w...@gmail.com Looks good, though it might good to at least call WARN_ON(). What do you think? I will commit this (either with or without the WARN_ON()). Reviewed-by: Alex Elder el...@dreamhost.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] ceph: avoid panic with mismatched symlink sizes in fill_inode()
Return -EINVAL rather than panic if iinfo-symlink_len and inode-i_size do not match. Also use kstrndup rather than kmalloc/memcpy. Signed-off-by: Xi Wang xi.w...@gmail.com --- fs/ceph/inode.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 2c48937..9fff9f3 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -677,18 +677,19 @@ static int fill_inode(struct inode *inode, case S_IFLNK: inode-i_op = ceph_symlink_iops; if (!ci-i_symlink) { - int symlen = iinfo-symlink_len; + u32 symlen = iinfo-symlink_len; char *sym; - BUG_ON(symlen != inode-i_size); spin_unlock(ci-i_ceph_lock); + err = -EINVAL; + if (WARN_ON(symlen != inode-i_size)) + goto out; + err = -ENOMEM; - sym = kmalloc(symlen+1, GFP_NOFS); + sym = kstrndup(iinfo-symlink, symlen, GFP_NOFS); if (!sym) goto out; - memcpy(sym, iinfo-symlink, symlen); - sym[symlen] = 0; spin_lock(ci-i_ceph_lock); if (!ci-i_symlink) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] ceph: avoid panic with mismatched symlink sizes in fill_inode()
On Fri, 2012-02-03 at 15:49 -0500, Xi Wang wrote: On Feb 3, 2012, at 3:24 PM, Alex Elder wrote: Looks good, though it might good to at least call WARN_ON(). What do you think? Sounds good to me. I will send a v2. Thanks. No need. I can do it for you. -Alex - xi -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to remove laggy mds ?
On Fri, Feb 3, 2012 at 11:48, Jens Rehpöhler j...@shadow.gt.owl.de wrote: one of my test mds servers died a few days ago. (hardware crash) I will not buy a new one. Is there any chance to remove this laggy mds ? Start the mds daemon somewhere. If you don't want to fix/replace the hardware, well, start it somewhere else. It has no local state. Decreasing the number of active mdses is not currently supported. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to remove laggy mds ?
On Fri, Feb 3, 2012 at 1:19 PM, Tommi Virtanen tommi.virta...@dreamhost.com wrote: On Fri, Feb 3, 2012 at 11:48, Jens Rehpöhler j...@shadow.gt.owl.de wrote: one of my test mds servers died a few days ago. (hardware crash) I will not buy a new one. Is there any chance to remove this laggy mds ? Start the mds daemon somewhere. If you don't want to fix/replace the hardware, well, start it somewhere else. It has no local state. Decreasing the number of active mdses is not currently supported. Actually, it should work, if you have a daemon running for that MDS. Once it's running, execute ceph mds stop 1 and it should export authority to the remaining MDS and shut down cleanly. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: couldn't use rbd
Hi Josh, Thank you for your comments. debug osd = 20 debug ms = 1 debug filestore = 20 I added this to the osd section of ceph.conf and ran /etc/init.d/ceph stopstart. The output of OSD.log when 'rbd list' was executed is below. - 2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32 OSD::ms_verify_authorizer name=client.admin auid=0 2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32 new session 0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110 2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32 session 0x24f5240 has caps osdcaps(pools={} default allow= default_deny=) 2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992 == client.4201 10.68.119.191:0/1005110 1 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 143+0+0 (3720164172 0 0) 0x24d8900 con 0x24d4dc0 2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32 require_same_or_newer_map 32 (i am 32) 0x24d8900 2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming client.4201 10.68.119.191:0/1005110 32 2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg 2.0, waiting 2012-02-04 04:29:22.641501 7fe0edc18710 osd.0 32 tick 2012-02-04 04:29:22.641695 7fe0edc18710 osd.0 32 scrub_should_schedule loadavg 0 max 0.5 = no, randomly backing off 2012-02-04 04:29:22.718517 7fe0e12fe710 osd.0 32 update_osd_stat osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:22.718606 7fe0e12fe710 osd.0 32 heartbeat: osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:23.642595 7fe0edc18710 osd.0 32 tick 2012-02-04 04:29:23.642783 7fe0edc18710 osd.0 32 scrub_should_schedule loadavg 0 max 0.5 = no, randomly backing off 2012-02-04 04:29:23.719789 7fe0e12fe710 osd.0 32 update_osd_stat osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:23.719861 7fe0e12fe710 osd.0 32 heartbeat: osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:24.620713 7fe0e12fe710 osd.0 32 update_osd_stat osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:24.620789 7fe0e12fe710 osd.0 32 heartbeat: osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:24.643651 7fe0edc18710 osd.0 32 tick 2012-02-04 04:29:24.643830 7fe0edc18710 osd.0 32 scrub_should_schedule loadavg 0 max 0.5 = yes 2012-02-04 04:29:24.643845 7fe0edc18710 osd.0 32 sched_scrub 2012-02-04 04:29:24.643867 7fe0edc18710 osd.0 32 0.0p0 at 2012-02-03 10:25:51.380648 2012-02-03 04:29:24.643851 (86400 seconds ago) 2012-02-04 04:29:24.643875 7fe0edc18710 osd.0 32 sched_scrub done 2012-02-04 04:29:24.709018 7fe0e6c0a710 -- 10.68.119.191:6801/4992 == mds.0 10.68.119.191:6800/4916 444 ping v1 0+0+0 (0 0 0) 0x251b600 con 0x2500640 2012-02-04 04:29:24.709091 7fe0e6c0a710 osd.0 32 _dispatch 0x251b600 ping v1 2012-02-04 04:29:24.709102 7fe0e6c0a710 osd.0 32 ping from mds.0 2012-02-04 04:29:25.321771 7fe0e12fe710 osd.0 32 update_osd_stat osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:25.321849 7fe0e12fe710 osd.0 32 heartbeat: osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:25.644596 7fe0edc18710 osd.0 32 tick 2012-02-04 04:29:25.644812 7fe0edc18710 osd.0 32 scrub_should_schedule loadavg 0 max 0.5 = no, randomly backing off 2012-02-04 04:29:25.644832 7fe0edc18710 osd.0 32 do_mon_report 2012-02-04 04:29:25.644843 7fe0edc18710 osd.0 32 send_alive up_thru currently 29 want 29 2012-02-04 04:29:25.644853 7fe0edc18710 osd.0 32 send_pg_stats 2012-02-04 04:29:25.965641 7fe0ea411710 filestore(/data/osd0) sync_entry woke after 5.000199 2012-02-04 04:29:25.965748 7fe0ea411710 filestore(/data/osd0) sync_entry waiting for max_interval 5.00 2012-02-04 04:29:26.522826 7fe0e12fe710 osd.0 32 update_osd_stat osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:26.522899 7fe0e12fe710 osd.0 32 heartbeat: osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:26.645739 7fe0edc18710 osd.0 32 tick 2012-02-04 04:29:26.645943 7fe0edc18710 osd.0 32 scrub_should_schedule loadavg 0 max 0.5 = no, randomly backing off 2012-02-04 04:29:27.646788 7fe0edc18710 osd.0 32 tick 2012-02-04 04:29:27.646982 7fe0edc18710 osd.0 32 scrub_should_schedule loadavg 0 max 0.5 = yes 2012-02-04 04:29:27.646997 7fe0edc18710 osd.0 32 sched_scrub 2012-02-04 04:29:27.647018 7fe0edc18710 osd.0 32 0.0p0 at 2012-02-03 10:25:51.380648 2012-02-03 04:29:27.647003 (86400 seconds ago) 2012-02-04 04:29:27.647027 7fe0edc18710 osd.0 32 sched_scrub done 2012-02-04 04:29:27.723893 7fe0e12fe710 osd.0 32 update_osd_stat osd_stat(1400 KB used, 15357 MB avail, 15358 MB total, peers []/[]) 2012-02-04 04:29:27.723978 7fe0e12fe710 osd.0 32 heartbeat: osd_stat(1400 KB used, 15357 MB
Re: couldn't use rbd
On 02/03/2012 02:14 PM, Masuko Tomoya wrote: Hi Josh, Thank you for your comments. debug osd = 20 debug ms = 1 debug filestore = 20 I added this to the osd section of ceph.conf and ran /etc/init.d/ceph stopstart. The output of OSD.log when 'rbd list' was executed is below. - 2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32 OSD::ms_verify_authorizer name=client.admin auid=0 2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32 new session 0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110 2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32 session 0x24f5240 has caps osdcaps(pools={} default allow= default_deny=) 2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992== client.4201 10.68.119.191:0/1005110 1 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 143+0+0 (3720164172 0 0) 0x24d8900 con 0x24d4dc0 2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32 require_same_or_newer_map 32 (i am 32) 0x24d8900 2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming client.4201 10.68.119.191:0/1005110 32 2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg 2.0, waiting The pg should have been created already. What's the output of 'ceph pg dump'? 2012/2/3 Josh Durginjosh.dur...@dreamhost.com: On 02/03/2012 12:51 AM, Masuko Tomoya wrote: Hi Josh, Thank you for reply ! This might mean the rbd image list object can't be read for some reason, or the rbd tool is doing something weird that the rados tool isn't. Can you share the output of 'ceph -s' and 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20'? The output of 'ceph -s' is below. --- root@ceph01:~# ceph -s 2012-02-03 17:01:47.881960pg v33: 6 pgs: 6 active+clean+degraded; 0 KB data, 1056 KB used, 15357 MB / 15358 MB avail 2012-02-03 17:01:47.882583 mds e9: 1/1/1 up {0=0=up:creating} 2012-02-03 17:01:47.882733 osd e21: 1 osds: 1 up, 1 in 2012-02-03 17:01:47.883042 log 2012-02-03 16:35:11.183897 osd.0 10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or wrong addr 2012-02-03 17:01:47.883144 mon e1: 1 mons at {0=10.68.119.191:6789/0} The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20' is below. -- root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20 snip 2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit oid rbd_directory @2 [read 0~0] tid 1 osd.0 2012-02-03 17:02:10.971465 7f88cbb91720 client.4106.objecter send_op 1 to osd.0 2012-02-03 17:02:10.971533 7f88cbb91720 -- 10.68.119.191:0/1003500 -- 10.68.119.191:6801/2912 -- osd_op(client.4106.0:1 rbd_directory [read 0~0] 2.30a98c1c) v1 -- ?+0 0x24664c0 con 0x24661b0 Everything above here is normal - the rbd tool connected to the monitors, got the monmap and osdmap, and sent a request to read the 'rbd_directory' object. snip 2012-02-03 17:02:25.969338 7f88c7261710 client.4106.objecter tid 1 on osd.0 is laggy This means the osd isn't responding to the read. Check the osd log for errors. If there's nothing obvious, add this to the osd section of your ceph.conf and restart the osd: debug osd = 20 debug ms = 1 debug filestore = 20 Then run 'rbd ls' and look at what happens after 'osd_op.*rbd_directory' appears in the osd log. rados lspools log I compared those logs and found there is differences. 'rbd list' 2012-02-03 17:02:10.971770 7f88c9366710 -- 10.68.119.191:0/1003500== mon.0 10.68.119.191:6789/0 10 osd_map(21..21 src has 1..21) v2 1284+0+0 (473305567 0 0) 0x24655a0 con 0x24603e0 2012-02-03 17:02:10.971789 7f88c9366710 client.4106.objecter handle_osd_map ignoring epochs [21,21]= 21 2012-02-03 17:02:10.971801 7f88c9366710 client.4106.objecter dump_active .. 0 homeless 2012-02-03 17:02:10.971815 7f88c9366710 client.4106.objecter 1 2.30a98c1c osd.0 rbd_directory [read 0~0] --(snip)-- 'rados lspools' 2012-02-03 17:11:52.866072 7f9c5764b710 -- 10.68.119.191:0/1003868== mon.0 10.68.119.191:6789/0 7 osd_map(21..21 src has 1..21) v2 1284+0+0 (473305567 0 0) 0x770a70 con 0x771440 2012-02-03 17:11:52.866103 7f9c5764b710 client.4107.objecter handle_osd_map got epochs [21,21]0 2012-02-03 17:11:52.866111 7f9c5764b710 client.4107.objecter handle_osd_map decoding full epoch 21 2012-02-03 17:11:52.866272 7f9c5764b710 client.4107.objecter dump_active .. 0 homeless data metadata rbd --(snip)-- What do these logs mean ? The difference is that 'rbd ls' talks to the monitors and osds, while 'rados lspools' just needs to talk to the monitors. The objecter dump_active part is listing in-flight osd requests. *ceph cluster is configured on single server. *server is ubuntu 10.10 maverick. *ceph, librados2 and librbd1 packages are installed.
Re: couldn't use rbd
Hi, The output of 'ceph pg dump' is below. root@ceph01:~# ceph pg dump 2012-02-04 07:50:15.453151 mon - [pg,dump] 2012-02-04 07:50:15.453734 mon.0 - 'dumped all in format plain' (0) version 63 last_osdmap_epoch 37 last_pg_scan 1 full_ratio 0.95 nearfull_ratio 0.85 pg_stat objects mip degrunf kb bytes log disklog state v reportedup acting last_scrub 1.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:55.383343 0.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'32 [0] [0] 0'0 2012-02-03 10:25:51.380648 1.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:53.381291 0.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'32 [0] [0] 0'0 2012-02-03 10:25:52.380881 2.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:59.387441 2.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:26:04.392778 pool 0 0 0 0 0 0 0 0 0 pool 1 0 0 0 0 0 0 0 0 pool 2 0 0 0 0 0 0 0 0 sum0 0 0 0 0 0 0 0 osdstat kbused kbavail kb hb in hb out 0 15681572603215727600[] [] sum15681572603215727600 2012/2/3 Josh Durgin josh.dur...@dreamhost.com: On 02/03/2012 02:14 PM, Masuko Tomoya wrote: Hi Josh, Thank you for your comments. debug osd = 20 debug ms = 1 debug filestore = 20 I added this to the osd section of ceph.conf and ran /etc/init.d/ceph stopstart. The output of OSD.log when 'rbd list' was executed is below. - 2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32 OSD::ms_verify_authorizer name=client.admin auid=0 2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32 new session 0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110 2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32 session 0x24f5240 has caps osdcaps(pools={} default allow= default_deny=) 2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992== client.4201 10.68.119.191:0/1005110 1 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 143+0+0 (3720164172 0 0) 0x24d8900 con 0x24d4dc0 2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32 require_same_or_newer_map 32 (i am 32) 0x24d8900 2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming client.4201 10.68.119.191:0/1005110 32 2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg 2.0, waiting The pg should have been created already. What's the output of 'ceph pg dump'? 2012/2/3 Josh Durginjosh.dur...@dreamhost.com: On 02/03/2012 12:51 AM, Masuko Tomoya wrote: Hi Josh, Thank you for reply ! This might mean the rbd image list object can't be read for some reason, or the rbd tool is doing something weird that the rados tool isn't. Can you share the output of 'ceph -s' and 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20'? The output of 'ceph -s' is below. --- root@ceph01:~# ceph -s 2012-02-03 17:01:47.881960 pg v33: 6 pgs: 6 active+clean+degraded; 0 KB data, 1056 KB used, 15357 MB / 15358 MB avail 2012-02-03 17:01:47.882583 mds e9: 1/1/1 up {0=0=up:creating} 2012-02-03 17:01:47.882733 osd e21: 1 osds: 1 up, 1 in 2012-02-03 17:01:47.883042 log 2012-02-03 16:35:11.183897 osd.0 10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or wrong addr 2012-02-03 17:01:47.883144 mon e1: 1 mons at {0=10.68.119.191:6789/0} The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20' is below. -- root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20 snip 2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit oid rbd_directory @2 [read 0~0] tid 1 osd.0 2012-02-03 17:02:10.971465 7f88cbb91720 client.4106.objecter send_op 1 to osd.0 2012-02-03 17:02:10.971533 7f88cbb91720 -- 10.68.119.191:0/1003500 -- 10.68.119.191:6801/2912 -- osd_op(client.4106.0:1 rbd_directory [read 0~0] 2.30a98c1c) v1 -- ?+0 0x24664c0 con 0x24661b0 Everything above here is normal - the rbd tool connected to the monitors, got the monmap and osdmap, and sent a request to read the 'rbd_directory' object. snip 2012-02-03 17:02:25.969338 7f88c7261710 client.4106.objecter tid 1 on osd.0 is laggy
Re: couldn't use rbd
On 02/03/2012 02:54 PM, Masuko Tomoya wrote: Hi, The output of 'ceph pg dump' is below. root@ceph01:~# ceph pg dump 2012-02-04 07:50:15.453151 mon- [pg,dump] 2012-02-04 07:50:15.453734 mon.0 - 'dumped all in format plain' (0) version 63 last_osdmap_epoch 37 last_pg_scan 1 full_ratio 0.95 nearfull_ratio 0.85 pg_stat objects mip degrunf kb bytes log disklog state v reportedup acting last_scrub 1.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:55.383343 0.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'32 [0] [0] 0'0 2012-02-03 10:25:51.380648 1.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:53.381291 0.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'32 [0] [0] 0'0 2012-02-03 10:25:52.380881 2.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:59.387441 2.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:26:04.392778 pool 0 0 0 0 0 0 0 0 0 pool 1 0 0 0 0 0 0 0 0 pool 2 0 0 0 0 0 0 0 0 sum0 0 0 0 0 0 0 0 osdstat kbused kbavail kb hb in hb out 0 15681572603215727600[] [] sum15681572603215727600 You hit a bug in 0.38 that made the default crushmap for one osd contain no pgs. This was fixed in 0.39, so I'd suggest upgrading and re-running mkcephfs. 2012/2/3 Josh Durginjosh.dur...@dreamhost.com: On 02/03/2012 02:14 PM, Masuko Tomoya wrote: Hi Josh, Thank you for your comments. debug osd = 20 debug ms = 1 debug filestore = 20 I added this to the osd section of ceph.conf and ran /etc/init.d/ceph stopstart. The output of OSD.log when 'rbd list' was executed is below. - 2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32 OSD::ms_verify_authorizer name=client.admin auid=0 2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32 new session 0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110 2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32 session 0x24f5240 has caps osdcaps(pools={} default allow= default_deny=) 2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992== client.4201 10.68.119.191:0/1005110 1 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 143+0+0 (3720164172 0 0) 0x24d8900 con 0x24d4dc0 2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32 require_same_or_newer_map 32 (i am 32) 0x24d8900 2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming client.4201 10.68.119.191:0/1005110 32 2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg 2.0, waiting The pg should have been created already. What's the output of 'ceph pg dump'? 2012/2/3 Josh Durginjosh.dur...@dreamhost.com: On 02/03/2012 12:51 AM, Masuko Tomoya wrote: Hi Josh, Thank you for reply ! This might mean the rbd image list object can't be read for some reason, or the rbd tool is doing something weird that the rados tool isn't. Can you share the output of 'ceph -s' and 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20'? The output of 'ceph -s' is below. --- root@ceph01:~# ceph -s 2012-02-03 17:01:47.881960pg v33: 6 pgs: 6 active+clean+degraded; 0 KB data, 1056 KB used, 15357 MB / 15358 MB avail 2012-02-03 17:01:47.882583 mds e9: 1/1/1 up {0=0=up:creating} 2012-02-03 17:01:47.882733 osd e21: 1 osds: 1 up, 1 in 2012-02-03 17:01:47.883042 log 2012-02-03 16:35:11.183897 osd.0 10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or wrong addr 2012-02-03 17:01:47.883144 mon e1: 1 mons at {0=10.68.119.191:6789/0} The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20' is below. -- root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20 snip 2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit oid rbd_directory @2 [read 0~0] tid 1 osd.0 2012-02-03 17:02:10.971465 7f88cbb91720 client.4106.objecter send_op 1 to osd.0 2012-02-03 17:02:10.971533 7f88cbb91720 -- 10.68.119.191:0/1003500 -- 10.68.119.191:6801/2912 -- osd_op(client.4106.0:1 rbd_directory [read 0~0] 2.30a98c1c) v1 -- ?+0 0x24664c0 con 0x24661b0 Everything above here is normal - the rbd tool connected to the monitors, got the monmap and
Re: couldn't use rbd
Hi, I upgraded ceph to 0.41 and re-running mkcephfs. I found my issue is fixed. - root@ceph01:~# rbd list pool rbd doesn't contain rbd images root@ceph01:~# rbd create test --size 1024 root@ceph01:~# rbd list test - Josh, thank you for your advices. 2012/2/3 Josh Durgin josh.dur...@dreamhost.com: On 02/03/2012 02:54 PM, Masuko Tomoya wrote: Hi, The output of 'ceph pg dump' is below. root@ceph01:~# ceph pg dump 2012-02-04 07:50:15.453151 mon- [pg,dump] 2012-02-04 07:50:15.453734 mon.0 - 'dumped all in format plain' (0) version 63 last_osdmap_epoch 37 last_pg_scan 1 full_ratio 0.95 nearfull_ratio 0.85 pg_stat objects mip degr unf kb bytes log disklog state v reported up acting last_scrub 1.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:55.383343 0.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'32 [0] [0] 0'0 2012-02-03 10:25:51.380648 1.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:53.381291 0.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'32 [0] [0] 0'0 2012-02-03 10:25:52.380881 2.0p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:25:59.387441 2.1p0 0 0 0 0 0 0 0 0 active+clean+degraded 0'0 34'30 [0] [0] 0'0 2012-02-03 10:26:04.392778 pool 0 0 0 0 0 0 0 0 0 pool 1 0 0 0 0 0 0 0 0 pool 2 0 0 0 0 0 0 0 0 sum 0 0 0 0 0 0 0 0 osdstat kbused kbavail kb hb in hb out 0 1568 15726032 15727600 [] [] sum 1568 15726032 15727600 You hit a bug in 0.38 that made the default crushmap for one osd contain no pgs. This was fixed in 0.39, so I'd suggest upgrading and re-running mkcephfs. 2012/2/3 Josh Durginjosh.dur...@dreamhost.com: On 02/03/2012 02:14 PM, Masuko Tomoya wrote: Hi Josh, Thank you for your comments. debug osd = 20 debug ms = 1 debug filestore = 20 I added this to the osd section of ceph.conf and ran /etc/init.d/ceph stopstart. The output of OSD.log when 'rbd list' was executed is below. - 2012-02-04 04:29:22.457990 7fe0e08fb710 osd.0 32 OSD::ms_verify_authorizer name=client.admin auid=0 2012-02-04 04:29:22.458041 7fe0e08fb710 osd.0 32 new session 0x24f5240 con=0x24d4dc0 addr=10.68.119.191:0/1005110 2012-02-04 04:29:22.458069 7fe0e08fb710 osd.0 32 session 0x24f5240 has caps osdcaps(pools={} default allow= default_deny=) 2012-02-04 04:29:22.458415 7fe0e6c0a710 -- 10.68.119.191:6801/4992== client.4201 10.68.119.191:0/1005110 1 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 143+0+0 (3720164172 0 0) 0x24d8900 con 0x24d4dc0 2012-02-04 04:29:22.458442 7fe0e6c0a710 osd.0 32 _dispatch 0x24d8900 osd_op(client.4201.0:1 rbd_directory [read 0~0] 2.30a98c1c) v3 2012-02-04 04:29:22.458463 7fe0e6c0a710 osd.0 32 require_same_or_newer_map 32 (i am 32) 0x24d8900 2012-02-04 04:29:22.458487 7fe0e6c0a710 osd.0 32 _share_map_incoming client.4201 10.68.119.191:0/1005110 32 2012-02-04 04:29:22.458507 7fe0e6c0a710 osd.0 32 hit non-existent pg 2.0, waiting The pg should have been created already. What's the output of 'ceph pg dump'? 2012/2/3 Josh Durginjosh.dur...@dreamhost.com: On 02/03/2012 12:51 AM, Masuko Tomoya wrote: Hi Josh, Thank you for reply ! This might mean the rbd image list object can't be read for some reason, or the rbd tool is doing something weird that the rados tool isn't. Can you share the output of 'ceph -s' and 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20'? The output of 'ceph -s' is below. --- root@ceph01:~# ceph -s 2012-02-03 17:01:47.881960 pg v33: 6 pgs: 6 active+clean+degraded; 0 KB data, 1056 KB used, 15357 MB / 15358 MB avail 2012-02-03 17:01:47.882583 mds e9: 1/1/1 up {0=0=up:creating} 2012-02-03 17:01:47.882733 osd e21: 1 osds: 1 up, 1 in 2012-02-03 17:01:47.883042 log 2012-02-03 16:35:11.183897 osd.0 10.68.119.191:6801/2912 12 : [WRN] map e19 wrongly marked me down or wrong addr 2012-02-03 17:01:47.883144 mon e1: 1 mons at {0=10.68.119.191:6789/0} The output of 'rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20' is below. -- root@ceph01:~# rbd ls --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20 --debug-auth 20 snip 2012-02-03 17:02:10.971391 7f88cbb91720 client.4106.objecter op_submit oid
[no subject]
Hi, all. I'm trying to attach rbd volume from instance on KVM. But I have problem. Could you help me ? --- I tried to attach rbd volume on ceph01 to instance on compute1 with virsh command. root@compute1:~# virsh attach-device test-ub16 /root/testvolume.xml error: Failed to attach device from /root/testvolume.xml error: cannot resolve symlink rbd/testvolume: No such file or directory /var/log/messages Feb 3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: error : qemuMonitorTextAddDevice:2417 : operation failed: adding virtio-blk-pci,bus=pci.0,addr=0x9,drive=drive-virtio-disk4,id=virtio-disk4 device failed: Device needs media, but drive is empty#015#012Device 'virtio-blk-pci' could not be initialized#015#012 Feb 3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: warning : qemuDomainAttachPciDiskDevice:188 : qemuMonitorAddDevice failed on file=rbd:rbd/testvolume,if=none,id=drive-virtio-disk4,format=raw (virtio-blk-pci,bus=pci.0,addr=0x9,drive=drive-virtio-disk4,id=virtio-disk4) Feb 3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: error : virSecurityDACRestoreSecurityFileLabel:143 : cannot resolve symlink rbd/testvolume: No such file or directory Feb 3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: warning : qemuDomainAttachPciDiskDevice:229 : Unable to restore security label on rbd/testvolume there is no log in /var/log/ceph/mon.0.log of host ceph01. --- My environment is below. *There are two servers. All server are ubuntu 10.10 x86_64. *ceph01: single server configured ceph.(version: 0.41-1maverick) *compute1: kvm hypervisor -librados2 and librbd1 packages are installed. (version: 0.41-1maverick) -qemu-kvm is 0.14.0-rc1. I built qemu with rbd enable. the output of run 'qemu-img' show 'rbd' at supported formats field. (I built qemu reffering this page. http://ceph.newdream.net/wiki/QEMU-RBD) -apparmor is disable. -libvirt is 0.8.8 -there is ceph.conf on compute1. root@compute1:~# ls -l /etc/ceph/ total 20 -rw-r--r-- 1 root root 508 2012-02-03 14:38 ceph.conf -rw--- 1 root root 63 2012-02-03 17:04 keyring.admin -rw--- 1 root root 63 2012-02-03 14:38 keyring.bin -rw--- 1 root root 56 2012-02-03 14:38 keyring.mds.0 -rw--- 1 root root 56 2012-02-03 14:38 keyring.osd.0 = -contents of ceph.conf is below. root@compute1:~# cat /etc/ceph/ceph.conf [global] auth supported = cephx keyring = /etc/ceph/keyring.bin [mon] mon data = /data/data/mon$id debug ms = 1 [mon.0] host = ceph01 mon addr = 10.68.119.191:6789 [mds] keyring = /etc/ceph/keyring.$name [mds.0] host = ceph01 [osd] keyring = /etc/ceph/keyring.$name osd data = /data/osd$id osd journal = /data/osd$id/journal osd journal size = 512 osd class tmp = /var/lib/ceph/tmp debug osd = 20 debug ms = 1 debug filestore = 20 [osd.0] host = ceph01 btrfs devs = /dev/sdb1 === *conten of keyring.admin is below root@compute1:~# cat /etc/ceph/keyring.admin [client.admin] key = AQDFeCxPyBlNIRAAxS1DcRHpMXRpcjY/GNMwYg== === *output of run 'ceph auth list' root@ceph01:/etc/ceph# ceph auth list 2012-02-03 20:34:59.507451 mon - [auth,list] 2012-02-03 20:34:59.508785 mon.0 - 'installed auth entries: mon. key: AQDFeCxPiK04IxAAslDBNkrOGKWxcbCh2iysqg== mds.0 key: AQDFeCxPsJ+LGhAAJ3/rmkAtGXSv/eHh0yXgww== caps: [mds] allow caps: [mon] allow rwx caps: [osd] allow * osd.0 key: AQDFeCxPoEK+ExAAecD7+tWgpIRoZx2AT7Jwbg== caps: [mon] allow rwx caps: [osd] allow * client.admin key: AQDFeCxPyBlNIRAAxS1DcRHpMXRpcjY/GNMwYg== caps: [mds] allow caps: [mon] allow * caps: [osd] allow * ' (0) *xml file is below. root@compute1:~# cat /root/testvolume.xml disk type='network' device='disk' driver name='qemu' type='raw'/ source protocol='rbd' name='rbd/testvolume' host name='10.68.119.191' port='6789'/ /source target dev='vde' bus='virtio'/ /disk *testvolume is on rados pools. root@compute1:~# qemu-img info rbd:rbd/testvolume image: rbd:rbd/testvolume file format: raw virtual size: 1.0G (1073741824 bytes) disk size: unavailable Waiting for reply, Tomoya. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
about attaching rbd volume from instance on KVM
Hi, all. I'm trying to attach rbd volume from instance on KVM. But I have problem. Could you help me ? --- I tried to attach rbd volume on ceph01 to instance on compute1 with virsh command. root@compute1:~# virsh attach-device test-ub16 /root/testvolume.xml error: Failed to attach device from /root/testvolume.xml error: cannot resolve symlink rbd/testvolume: No such file or directory /var/log/messages Feb 3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: error : qemuMonitorTextAddDevice:2417 : operation failed: adding virtio-blk-pci,bus=pci.0,addr=0x9,drive=drive-virtio-disk4,id=virtio-disk4 device failed: Device needs media, but drive is empty#015#012Device 'virtio-blk-pci' could not be initialized#015#012 Feb 3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: warning : qemuDomainAttachPciDiskDevice:188 : qemuMonitorAddDevice failed on file=rbd:rbd/testvolume,if=none,id=drive-virtio-disk4,format=raw (virtio-blk-pci,bus=pci.0,addr=0x9,drive=drive-virtio-disk4,id=virtio-disk4) Feb 3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: error : virSecurityDACRestoreSecurityFileLabel:143 : cannot resolve symlink rbd/testvolume: No such file or directory Feb 3 20:14:48 compute1 libvirtd: 20:14:48.717: 3234: warning : qemuDomainAttachPciDiskDevice:229 : Unable to restore security label on rbd/testvolume there is no log in /var/log/ceph/mon.0.log of host ceph01. --- My environment is below. *There are two servers. All server are ubuntu 10.10 x86_64. *ceph01: single server configured ceph.(version: 0.41-1maverick) *compute1: kvm hypervisor -librados2 and librbd1 packages are installed. (version: 0.41-1maverick) -qemu-kvm is 0.14.0-rc1. I built qemu with rbd enable. the output of run 'qemu-img' show 'rbd' at supported formats field. (I built qemu reffering this page. http://ceph.newdream.net/wiki/QEMU-RBD) -apparmor is disable. -libvirt is 0.8.8 -there is ceph.conf on compute1. root@compute1:~# ls -l /etc/ceph/ total 20 -rw-r--r-- 1 root root 508 2012-02-03 14:38 ceph.conf -rw--- 1 root root 63 2012-02-03 17:04 keyring.admin -rw--- 1 root root 63 2012-02-03 14:38 keyring.bin -rw--- 1 root root 56 2012-02-03 14:38 keyring.mds.0 -rw--- 1 root root 56 2012-02-03 14:38 keyring.osd.0 = -contents of ceph.conf is below. root@compute1:~# cat /etc/ceph/ceph.conf [global] auth supported = cephx keyring = /etc/ceph/keyring.bin [mon] mon data = /data/data/mon$id debug ms = 1 [mon.0] host = ceph01 mon addr = 10.68.119.191:6789 [mds] keyring = /etc/ceph/keyring.$name [mds.0] host = ceph01 [osd] keyring = /etc/ceph/keyring.$name osd data = /data/osd$id osd journal = /data/osd$id/journal osd journal size = 512 osd class tmp = /var/lib/ceph/tmp debug osd = 20 debug ms = 1 debug filestore = 20 [osd.0] host = ceph01 btrfs devs = /dev/sdb1 === *conten of keyring.admin is below root@compute1:~# cat /etc/ceph/keyring.admin [client.admin] key = AQDFeCxPyBlNIRAAxS1DcRHpMXRpcjY/GNMwYg== === *output of run 'ceph auth list' root@ceph01:/etc/ceph# ceph auth list 2012-02-03 20:34:59.507451 mon - [auth,list] 2012-02-03 20:34:59.508785 mon.0 - 'installed auth entries: mon. key: AQDFeCxPiK04IxAAslDBNkrOGKWxcbCh2iysqg== mds.0 key: AQDFeCxPsJ+LGhAAJ3/rmkAtGXSv/eHh0yXgww== caps: [mds] allow caps: [mon] allow rwx caps: [osd] allow * osd.0 key: AQDFeCxPoEK+ExAAecD7+tWgpIRoZx2AT7Jwbg== caps: [mon] allow rwx caps: [osd] allow * client.admin key: AQDFeCxPyBlNIRAAxS1DcRHpMXRpcjY/GNMwYg== caps: [mds] allow caps: [mon] allow * caps: [osd] allow * ' (0) *xml file is below. root@compute1:~# cat /root/testvolume.xml disk type='network' device='disk' driver name='qemu' type='raw'/ source protocol='rbd' name='rbd/testvolume' host name='10.68.119.191' port='6789'/ /source target dev='vde' bus='virtio'/ /disk *testvolume is on rados pools. root@compute1:~# qemu-img info rbd:rbd/testvolume image: rbd:rbd/testvolume file format: raw virtual size: 1.0G (1073741824 bytes) disk size: unavailable Waiting for reply, Tomoya. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html