Just some more info -- this happens also when I just restart an OSD that
*was* working -- it won't start back.

In the mon log I have (which correspond to the OSDs that I've been trying
to start). osd.13 was working just now, before I stopped the service and
tried to start it again.

2017-07-25 14:42:49.249076 7f2386806700  0 cephx server osd.10: couldn't
find entity name: osd.10
2017-07-25 14:43:24.323603 7f2386806700  0 cephx server osd.13: couldn't
find entity name: osd.13
2017-07-25 14:43:25.033487 7f2386806700  0 cephx server osd.7: couldn't
find entity name: osd.7



Still reading and learning.


On Tue, Jul 25, 2017 at 2:38 PM, Daniel K <satha...@gmail.com> wrote:

> Update to this -- I tried building a new host and a new OSD, new disk, and
> I am having the same issue.
>
>
>
> I set osd debug level to 10 -- the issue looks like it's coming from a mon
> daemon. Still trying to learn enough about the internals of ceph to
> understand what's happening here.
>
> Relevant debug logs(I think)
>
>
> 2017-07-25 14:21:58.889016 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 1 ==== mon_map magic: 0 v1 ==== 541+0+0
> (2831459213 0 0) 0x556640ecd900 con 0x556641949800
> 2017-07-25 14:21:58.889109 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ====
> 33+0+0 (248727397 0 0) 0x556640ecdb80 con 0x556641949800
> 2017-07-25 14:21:58.889204 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x556640ecd400
> con 0
> 2017-07-25 14:21:58.889966 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ====
> 206+0+0 (3141870879 0 0) 0x556640ecd400 con 0x556641949800
> 2017-07-25 14:21:58.890066 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x556640ecdb80
> con 0
> 2017-07-25 14:21:58.890759 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ====
> 564+0+0 (1715764650 0 0) 0x556640ecdb80 con 0x556641949800
> 2017-07-25 14:21:58.890871 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- mon_subscribe({monmap=0+}) v2 -- 0x556640e77680 con 0
> 2017-07-25 14:21:58.890901 7f25a88af700  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- 0x556640ecd400
> con 0
> 2017-07-25 14:21:58.891494 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 5 ==== mon_map magic: 0 v1 ==== 541+0+0
> (2831459213 0 0) 0x556640ecde00 con 0x556641949800
> 2017-07-25 14:21:58.891555 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 6 ==== auth_reply(proto 2 0 (0) Success) v1 ====
> 194+0+0 (1036670921 0 0) 0x556640ece080 con 0x556641949800
> 2017-07-25 14:21:58.892003 7f25b5e71c80 10 osd.7 0
> mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class",
> "class": "hdd", "ids": ["7"]}
> 2017-07-25 14:21:58.892039 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class",
> "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e78d00 con 0
> *2017-07-25 14:21:58.894596 7f25a88af700  1 -- 10.0.15.142:6800/16150
> <http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0
> <http://10.0.15.51:6789/0> 7 ==== mon_command_ack([{"prefix": "osd crush
> set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or
> directory v10406) v1 ==== 133+0+0 (3400959855 0 0) 0x556640ece300 con
> 0x556641949800*
> 2017-07-25 14:21:58.894797 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd create", "id": 7,
> "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"} v 0) v1 -- 0x556640e79180
> con 0
> 2017-07-25 14:21:58.896301 7f25a88af700  1 -- 10.0.15.142:6800/16150 <==
> mon.1 10.0.15.51:6789/0 8 ==== mon_command_ack([{"prefix": "osd create",
> "id": 7, "uuid": "92445e4f-850e-453b-b5ab-569d1414f72d"}]=0  v10406) v1
> ==== 115+0+2 (2540205126 0 1371665406) 0x556640ece580 con 0x556641949800
> 2017-07-25 14:21:58.896473 7f25b5e71c80 10 osd.7 0
> mon_cmd_maybe_osd_create cmd: {"prefix": "osd crush set-device-class",
> "class": "hdd", "ids": ["7"]}
> 2017-07-25 14:21:58.896516 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 -->
> 10.0.15.51:6789/0 -- mon_command({"prefix": "osd crush set-device-class",
> "class": "hdd", "ids": ["7"]} v 0) v1 -- 0x556640e793c0 con 0
> *2017-07-25 14:21:58.898180 7f25a88af700  1 -- 10.0.15.142:6800/16150
> <http://10.0.15.142:6800/16150> <== mon.1 10.0.15.51:6789/0
> <http://10.0.15.51:6789/0> 9 ==== mon_command_ack([{"prefix": "osd crush
> set-device-class", "class": "hdd", "ids": ["7"]}]=-2 (2) No such file or
> directory v10406) v1 ==== 133+0+0 (3400959855 0 0) 0x556640ecd900 con
> 0x556641949800*
> *2017-07-25 14:21:58.898276 7f25b5e71c80 -1 osd.7 0
> mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such
> file or directory*
> 2017-07-25 14:21:58.898380 7f25b5e71c80  1 -- 10.0.15.142:6800/16150 >>
> 10.0.15.51:6789/0 conn(0x556641949800 :-1 s=STATE_OPEN pgs=367879 cs=1
> l=1).mark_down
>
>
>
>
> On Mon, Jul 24, 2017 at 1:33 PM, Daniel K <satha...@gmail.com> wrote:
>
>> List --
>>
>> I have a 4-node cluster running on baremetal and have a need to use the
>> kernel client on 2 nodes. As I read you should not run the kernel client on
>> a node that runs an OSD daemon, I decided to move the OSD daemons into a VM
>> on the same device.
>>
>> Orignal host is stor-vm2(bare metal), new host is stor-vm2a(Virtual)
>>
>> All went well -- I did these steps(for each OSD, 5 total per host)
>>
>> - setup the VM
>> - install the OS
>> - installed ceph(using ceph-deploy)
>> - set noout
>> - stopped ceph osd on bare metal host
>> - unmount /dev/sdb1 from /var/lib/ceph/osd/ceph-0
>> - add /dev/sdb to the VM
>> - ceph detected the osd and started automatically.
>> - moved VM host to the same bucket as physical host in crushmap
>>
>> I did this for each OSD, and despite some recovery IO because of the
>> updated crushmap, all OSDs were up.
>>
>> I rebooted the physical host, which rebooted the VM, and now the OSDs are
>> refusing to start.
>>
>> I've tried moving them back to the bare metal host with the same results.
>>
>> Any ideas?
>>
>> Here are what seem to be the relevant osd log lines:
>>
>> 2017-07-24 13:21:53.561265 7faf1752fc80  0 osd.10 8854 crush map has
>> features 2200130813952, adjusting msgr requires for clients
>> 2017-07-24 13:21:53.561284 7faf1752fc80  0 osd.10 8854 crush map has
>> features 2200130813952 was 8705, adjusting msgr requires for mons
>> 2017-07-24 13:21:53.561298 7faf1752fc80  0 osd.10 8854 crush map has
>> features 720578140510109696, adjusting msgr requires for osds
>> 2017-07-24 13:21:55.626834 7faf1752fc80  0 osd.10 8854 load_pgs
>> 2017-07-24 13:22:20.970222 7faf1752fc80  0 osd.10 8854 load_pgs opened
>> 536 pgs
>> 2017-07-24 13:22:20.972659 7faf1752fc80  0 osd.10 8854 using
>> weightedpriority op queue with priority op cut off at 64.
>> 2017-07-24 13:22:20.976861 7faf1752fc80 -1 osd.10 8854 log_to_monitors
>> {default=true}
>> 2017-07-24 13:22:20.998233 7faf1752fc80 -1 osd.10 8854
>> mon_cmd_maybe_osd_create fail: '(2) No such file or directory': (2) No such
>> file or directory
>> 2017-07-24 13:22:20.999165 7faf1752fc80  1 
>> bluestore(/var/lib/ceph/osd/ceph-10)
>> umount
>> 2017-07-24 13:22:21.016146 7faf1752fc80  1 freelist shutdown
>> 2017-07-24 13:22:21.016243 7faf1752fc80  4 rocksdb:
>> [/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling
>> all background work
>> 2017-07-24 13:22:21.020440 7faf1752fc80  4 rocksdb:
>> [/build/ceph-12.1.1/src/rocksdb/db/db_impl.cc:343] Shutdown complete
>> 2017-07-24 13:22:21.274481 7faf1752fc80  1 bluefs umount
>> 2017-07-24 13:22:21.275822 7faf1752fc80  1 bdev(0x558bb1f82d80
>> /var/lib/ceph/osd/ceph-10/block) close
>> 2017-07-24 13:22:21.485226 7faf1752fc80  1 bdev(0x558bb1f82b40
>> /var/lib/ceph/osd/ceph-10/block) close
>> 2017-07-24 13:22:21.551009 7faf1752fc80 -1  ** ERROR: osd init failed:
>> (2) No such file or directory
>> 2017-07-24 13:22:21.563567 7faf1752fc80 -1 
>> /build/ceph-12.1.1/src/common/HeartbeatMap.cc:
>> In function 'ceph::HeartbeatMap::~HeartbeatMap()' thread 7faf1752fc80
>> time 2017-07-24 13:22:21.558275
>> /build/ceph-12.1.1/src/common/HeartbeatMap.cc: 39: FAILED
>> assert(m_workers.empty())
>>
>>  ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous
>> (rc)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x102) [0x558ba6ba6b72]
>>  2: (()+0xb81cf1) [0x558ba6cc0cf1]
>>  3: (CephContext::~CephContext()+0x4d9) [0x558ba6ca77b9]
>>  4: (CephContext::put()+0xe6) [0x558ba6ca7ab6]
>>  5: (main()+0x563) [0x558ba650df73]
>>  6: (__libc_start_main()+0xf0) [0x7faf14999830]
>>  7: (_start()+0x29) [0x558ba6597cf9]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>> --- begin dump of recent events ---
>>
>>
>>
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to