Just some more info -- this happens also when I just restart an OSD that *was* working -- it won't start back.
In the mon log I have (which correspond to the OSDs that I've been trying to start). osd.13 was working just now, before I stopped the service and tried to start it again. 2017-07-25 14:42:49.249076 7f2386806700 0 cephx server osd.10: couldn't find entity name: osd.10 2017-07-25 14:43:24.323603 7f2386806700 0 cephx server osd.13: couldn't find entity name: osd.13 2017-07-25 14:43:25.033487 7f2386806700 0 cephx server osd.7: couldn't find entity name: osd.7 Still reading and learning. On Tue, Jul 25, 2017 at 2:38 PM, Daniel K <satha...@gmail.com> wrote: > Update to this -- I tried building a new host and a new OSD, new disk, and > I am having the same issue. > > > > I set osd debug level to 10 -- the issue looks like it's coming from a mon > daemon. Still trying to learn enough about the internals of ceph to > understand what's happening here. > > Relevant debug logs(I think) > > > 2017-07-25 14:21:58.889016 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 1 ==== mon_map magic: 0 v1 ==== 541+0+0 > (2831459213 0 0) 0x556640ecd900 con 0x556641949800 > 2017-07-25 14:21:58.889109 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== > 33+0+0 (248727397 0 0) 0x556640ecdb80 con 0x556641949800 > 2017-07-25 14:21:58.889204 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x556640ecd400 > con 0 > 2017-07-25 14:21:58.889966 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== > 206+0+0 (3141870879 0 0) 0x556640ecd400 con 0x556641949800 > 2017-07-25 14:21:58.890066 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x556640ecdb80 > con 0 > 2017-07-25 14:21:58.890759 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== > 564+0+0 (1715764650 0 0) 0x556640ecdb80 con 0x556641949800 > 2017-07-25 14:21:58.890871 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x556640e77680 con 0 > 2017-07-25 14:21:58.890901 7f25a88af700 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x556640ecd400 > con 0 > 2017-07-25 14:21:58.891494 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 5 ==== mon_map magic: 0 v1 ==== 541+0+0 > (2831459213 0 0) 0x556640ecde00 con 0x556641949800 > 2017-07-25 14:21:58.891555 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 6 ==== auth_reply(proto 2 0 (0) Success) v1 ==== > 194+0+0 (1036670921 0 0) 0x556640ece080 con 0x556641949800 > 2017-07-25 14:21:58.892003 7f25b5e71c80 10 osd.7 0 > mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class", > "class": "hdd", "ids": ["7"]} > 2017-07-25 14:21:58.892039 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class", > "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e78d00 con 0 > *2017-07-25 14:21:58.894596 7f25a88af700 1 -- 10.0.15.142:6800/16150 > <http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0 > <http://10.0.15.51:6789/0> 7 ==== mon_command_ack([{"prefix": "osd crush > set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or > directory v10406) v1 ==== 133+0+0 (3400959855 0 0) 0x556640ece300 con > 0x556641949800* > 2017-07-25 14:21:58.894797 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- mon_command({"prefix": "osd create", "id": 7, > "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"} v 0) v1 -- 0x556640e79180 > con 0 > 2017-07-25 14:21:58.896301 7f25a88af700 1 -- 10.0.15.142:6800/16150 <== > mon.1 10.0.15.51:6789/0 8 ==== mon_command_ack([{"prefix": "osd create", > "id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"}]=0 v10406) v1 > ==== 115+0+2 (2540205126 0 1371665406) 0x556640ece580 con 0x556641949800 > 2017-07-25 14:21:58.896473 7f25b5e71c80 10 osd.7 0 > mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class", > "class": "hdd", "ids": ["7"]} > 2017-07-25 14:21:58.896516 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 --> > 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class", > "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e793c0 con 0 > *2017-07-25 14:21:58.898180 7f25a88af700 1 -- 10.0.15.142:6800/16150 > <http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0 > <http://10.0.15.51:6789/0> 9 ==== mon_command_ack([{"prefix": "osd crush > set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or > directory v10406) v1 ==== 133+0+0 (3400959855 0 0) 0x556640ecd900 con > 0x556641949800* > *2017-07-25 14:21:58.898276 7f25b5e71c80 -1 osd.7 0 > mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such > file or directory* > 2017-07-25 14:21:58.898380 7f25b5e71c80 1 -- 10.0.15.142:6800/16150 >> > 10.0.15.51:6789/0 conn(0x556641949800 :-1 s=STATE_OPEN pgs=367879 cs=1 > l=1).mark_down > > > > > On Mon, Jul 24, 2017 at 1:33 PM, Daniel K <satha...@gmail.com> wrote: > >> List -- >> >> I have a 4-node cluster running on baremetal and have a need to use the >> kernel client on 2 nodes. As I read you should not run the kernel client on >> a node that runs an OSD daemon, I decided to move the OSD daemons into a VM >> on the same device. >> >> Orignal host is stor-vm2(bare metal), new host is stor-vm2a(Virtual) >> >> All went well -- I did these steps(for each OSD, 5 total per host) >> >> - setup the VM >> - install the OS >> - installed ceph(using ceph-deploy) >> - set noout >> - stopped ceph osd on bare metal host >> - unmount /dev/sdb1 from /var/lib/ceph/osd/ceph-0 >> - add /dev/sdb to the VM >> - ceph detected the osd and started automatically. >> - moved VM host to the same bucket as physical host in crushmap >> >> I did this for each OSD, and despite some recovery IO because of the >> updated crushmap, all OSDs were up. >> >> I rebooted the physical host, which rebooted the VM, and now the OSDs are >> refusing to start. >> >> I've tried moving them back to the bare metal host with the same results. >> >> Any ideas? >> >> Here are what seem to be the relevant osd log lines: >> >> 2017-07-24 13:21:53.561265 7faf1752fc80 0 osd.10 8854 crush map has >> features 2200130813952, adjusting msgr requires for clients >> 2017-07-24 13:21:53.561284 7faf1752fc80 0 osd.10 8854 crush map has >> features 2200130813952 was 8705, adjusting msgr requires for mons >> 2017-07-24 13:21:53.561298 7faf1752fc80 0 osd.10 8854 crush map has >> features 720578140510109696, adjusting msgr requires for osds >> 2017-07-24 13:21:55.626834 7faf1752fc80 0 osd.10 8854 load_pgs >> 2017-07-24 13:22:20.970222 7faf1752fc80 0 osd.10 8854 load_pgs opened >> 536 pgs >> 2017-07-24 13:22:20.972659 7faf1752fc80 0 osd.10 8854 using >> weightedpriority op queue with priority op cut off at 64. >> 2017-07-24 13:22:20.976861 7faf1752fc80 -1 osd.10 8854 log_to_monitors >> {default=true} >> 2017-07-24 13:22:20.998233 7faf1752fc80 -1 osd.10 8854 >> mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such >> file or directory >> 2017-07-24 13:22:20.999165 7faf1752fc80 1 >> bluestore(/var/lib/ceph/osd/ceph-10) >> umount >> 2017-07-24 13:22:21.016146 7faf1752fc80 1 freelist shutdown >> 2017-07-24 13:22:21.016243 7faf1752fc80 4 rocksdb: >> [/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling >> all background work >> 2017-07-24 13:22:21.020440 7faf1752fc80 4 rocksdb: >> [/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:343] Shutdown complete >> 2017-07-24 13:22:21.274481 7faf1752fc80 1 bluefs umount >> 2017-07-24 13:22:21.275822 7faf1752fc80 1 bdev(0x558bb1f82d80 >> /var/lib/ceph/osd/ceph-10/block) close >> 2017-07-24 13:22:21.485226 7faf1752fc80 1 bdev(0x558bb1f82b40 >> /var/lib/ceph/osd/ceph-10/block) close >> 2017-07-24 13:22:21.551009 7faf1752fc80 -1 ** ERROR: osd init failed: >> (2) No such file or directory >> 2017-07-24 13:22:21.563567 7faf1752fc80 -1 >> /build/ceph-12.1.1/src/common/HeartbeatMap.cc: >> In function 'ceph::HeartbeatMap::~HeartbeatMap()' thread 7faf1752fc80 >> time 2017-07-24 13:22:21.558275 >> /build/ceph-12.1.1/src/common/HeartbeatMap.cc: 39: FAILED >> assert(m_workers.empty()) >> >> ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous >> (rc) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x102) [0x558ba6ba6b72] >> 2: (()+0xb81cf1) [0x558ba6cc0cf1] >> 3: (CephContext::~CephContext()+0x4d9) [0x558ba6ca77b9] >> 4: (CephContext::put()+0xe6) [0x558ba6ca7ab6] >> 5: (main()+0x563) [0x558ba650df73] >> 6: (__libc_start_main()+0xf0) [0x7faf14999830] >> 7: (_start()+0x29) [0x558ba6597cf9] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed >> to interpret this. >> >> --- begin dump of recent events --- >> >> >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com