good, please send me the mon and osd.0 log. the cluster still un-recovered?
nokia ceph <nokiacephus...@gmail.com> 于2019年11月10日周日 下午1:24写道: > > Hi Huang, > > Yes the node 10.50.10.45 is the fifth node which is replaced. Yes I have set > the debug_mon to 20 and still it is running with that value only. If you want > I will send you the logs of the mon once again by restarting the osd.0 > > On Sun, Nov 10, 2019 at 10:17 AM huang jun <hjwsm1...@gmail.com> wrote: >> >> The mon log shows that the all mismatch fsid osds are from node 10.50.11.45, >> maybe that the fith node? >> BTW i don't found the osd.0 boot message in ceph-mon.log >> do you set debug_mon=20 first and then restart osd.0 process, and make >> sure the osd.0 is restarted. >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月10日周日 下午12:31写道: >> >> > >> > Hi, >> > >> > Please find the ceph osd tree output in the pastebin >> > https://pastebin.com/Gn93rE6w >> > >> > On Fri, Nov 8, 2019 at 7:58 PM huang jun <hjwsm1...@gmail.com> wrote: >> >> >> >> can you post your 'ceph osd tree' in pastebin? >> >> do you mean the osds report fsid mismatch is from old removed nodes? >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午10:21写道: >> >> > >> >> > Hi, >> >> > >> >> > The fifth node in the cluster was affected by hardware failure and >> >> > hence the node was replaced in the ceph cluster. But we were not able >> >> > to replace it properly and hence we uninstalled the ceph in all the >> >> > nodes, deleted the pools and also zapped the osd's and recreated them >> >> > as new ceph cluster. But not sure where from the reference for the old >> >> > fifth nodes(failed nodes) osd's fsid's are coming from still. Is this >> >> > creating the problem. Because I am seeing that the OSD's in the fifth >> >> > node are showing up in the ceph status whereas the other nodes osd's >> >> > are showing down. >> >> > >> >> > On Fri, Nov 8, 2019 at 7:25 PM huang jun <hjwsm1...@gmail.com> wrote: >> >> >> >> >> >> I saw many lines like that >> >> >> >> >> >> mon.cn1@0(leader).osd e1805 preprocess_boot from osd.112 >> >> >> v2:10.50.11.45:6822/158344 clashes with existing osd: different fsid >> >> >> (ours: 85908622-31bd-4728-9be3-f1f6ca44ed98 ; theirs: >> >> >> 127fdc44-c17e-42ee-bcd4-d577c0ef4479) >> >> >> the osd boot will be ignored if the fsid mismatch >> >> >> what do you do before this happen? >> >> >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午8:29写道: >> >> >> > >> >> >> > Hi, >> >> >> > >> >> >> > Please find the osd.0 which is restarted after the debug_mon is >> >> >> > increased to 20. >> >> >> > >> >> >> > cn1.chn8be1c1.cdn ~# date;systemctl restart ceph-osd@0.service >> >> >> > Fri Nov 8 12:25:05 UTC 2019 >> >> >> > >> >> >> > cn1.chn8be1c1.cdn ~# systemctl status ceph-osd@0.service -l >> >> >> > ● ceph-osd@0.service - Ceph object storage daemon osd.0 >> >> >> > Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; >> >> >> > enabled-runtime; vendor preset: disabled) >> >> >> > Drop-In: /etc/systemd/system/ceph-osd@.service.d >> >> >> > └─90-ExecStart_NUMA.conf >> >> >> > Active: active (running) since Fri 2019-11-08 12:25:06 UTC; 29s >> >> >> > ago >> >> >> > Process: 298505 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh >> >> >> > --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) >> >> >> > Main PID: 298512 (ceph-osd) >> >> >> > CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service >> >> >> > └─298512 /usr/bin/ceph-osd -f --cluster ceph --id 0 >> >> >> > --setuser ceph --setgroup ceph >> >> >> > >> >> >> > Nov 08 12:25:06 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph object >> >> >> > storage daemon osd.0... >> >> >> > Nov 08 12:25:06 cn1.chn8be1c1.cdn systemd[1]: Started Ceph object >> >> >> > storage daemon osd.0. >> >> >> > Nov 08 12:25:11 cn1.chn8be1c1.cdn numactl[298512]: 2019-11-08 >> >> >> > 12:25:11.538 7f8515323d80 -1 osd.0 1795 log_to_monitors >> >> >> > {default=true} >> >> >> > Nov 08 12:25:11 cn1.chn8be1c1.cdn numactl[298512]: 2019-11-08 >> >> >> > 12:25:11.689 7f850792e700 -1 osd.0 1795 set_numa_affinity unable to >> >> >> > identify public interface 'dss-client' numa node: (2) No such file >> >> >> > or directory >> >> >> > >> >> >> > On Fri, Nov 8, 2019 at 4:48 PM huang jun <hjwsm1...@gmail.com> wrote: >> >> >> >> >> >> >> >> the osd.0 is still in down state after restart? if so, maybe the >> >> >> >> problem is in mon, >> >> >> >> can you set the leader mon's debug_mon=20 and restart one of the >> >> >> >> down >> >> >> >> state osd. >> >> >> >> and then attach the mon log file. >> >> >> >> >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午6:38写道: >> >> >> >> > >> >> >> >> > Hi, >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > Below is the status of the OSD after restart. >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > # systemctl status ceph-osd@0.service >> >> >> >> > >> >> >> >> > ● ceph-osd@0.service - Ceph object storage daemon osd.0 >> >> >> >> > >> >> >> >> > Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; >> >> >> >> > enabled-runtime; vendor preset: disabled) >> >> >> >> > >> >> >> >> > Drop-In: /etc/systemd/system/ceph-osd@.service.d >> >> >> >> > >> >> >> >> > └─90-ExecStart_NUMA.conf >> >> >> >> > >> >> >> >> > Active: active (running) since Fri 2019-11-08 10:32:51 UTC; >> >> >> >> > 1min 1s ago >> >> >> >> > >> >> >> >> > Process: 219213 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh >> >> >> >> > --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) >> >> >> >> > Main PID: 219218 (ceph-osd) >> >> >> >> > >> >> >> >> > CGroup: >> >> >> >> > /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service >> >> >> >> > >> >> >> >> > └─219218 /usr/bin/ceph-osd -f --cluster ceph --id 0 >> >> >> >> > --setuser ceph --setgroup ceph >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > Nov 08 10:32:51 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph >> >> >> >> > object storage daemon osd.0... >> >> >> >> > >> >> >> >> > Nov 08 10:32:51 cn1.chn8be1c1.cdn systemd[1]: Started Ceph object >> >> >> >> > storage daemon osd.0. >> >> >> >> > >> >> >> >> > Nov 08 10:33:03 cn1.chn8be1c1.cdn numactl[219218]: 2019-11-08 >> >> >> >> > 10:33:03.785 7f9adeed4d80 -1 osd.0 1795 log_to_monitors >> >> >> >> > {default=true} Nov 08 10:33:05 cn1.chn8be1c1.cdn numactl[219218]: >> >> >> >> > 2019-11-08 10:33:05.474 7f9ad14df700 -1 osd.0 1795 >> >> >> >> > set_numa_affinity unable to identify public interface >> >> >> >> > 'dss-client' numa n...r directory >> >> >> >> > >> >> >> >> > Hint: Some lines were ellipsized, use -l to show in full. >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > And I have attached the logs in the file in this mail while this >> >> >> >> > restart was initiated. >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > On Fri, Nov 8, 2019 at 3:59 PM huang jun <hjwsm1...@gmail.com> >> >> >> >> > wrote: >> >> >> >> >> >> >> >> >> >> try to restart some of the down osds in 'ceph osd tree', and to >> >> >> >> >> see >> >> >> >> >> what happened? >> >> >> >> >> >> >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午6:24写道: >> >> >> >> >> > >> >> >> >> >> > Adding my official mail id >> >> >> >> >> > >> >> >> >> >> > ---------- Forwarded message --------- >> >> >> >> >> > From: nokia ceph <nokiacephus...@gmail.com> >> >> >> >> >> > Date: Fri, Nov 8, 2019 at 3:57 PM >> >> >> >> >> > Subject: OSD's not coming up in Nautilus >> >> >> >> >> > To: Ceph Users <ceph-users@lists.ceph.com> >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > Hi Team, >> >> >> >> >> > >> >> >> >> >> > There is one 5 node ceph cluster which we have upgraded from >> >> >> >> >> > Luminous to Nautilus and everything was going well until >> >> >> >> >> > yesterday when we noticed that the ceph osd's are marked down >> >> >> >> >> > and not recognized by the monitors as running eventhough the >> >> >> >> >> > osd processes are running. >> >> >> >> >> > >> >> >> >> >> > We noticed that the admin.keyring and the mon.keyring are >> >> >> >> >> > missing in the nodes which we have recreated it with the below >> >> >> >> >> > commands. >> >> >> >> >> > >> >> >> >> >> > ceph-authtool --create-keyring >> >> >> >> >> > /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin >> >> >> >> >> > --cap mon 'allow *' --cap osd 'allow *' --cap mds allow >> >> >> >> >> > >> >> >> >> >> > ceph-authtool --create_keyring /etc/ceph/ceph.mon.keyring >> >> >> >> >> > --gen-key -n mon. --cap mon 'allow *' >> >> >> >> >> > >> >> >> >> >> > In logs we find the below lines. >> >> >> >> >> > >> >> >> >> >> > 2019-11-08 09:01:50.525 7ff61722b700 0 log_channel(audit) log >> >> >> >> >> > [DBG] : from='client.? 10.50.11.44:0/2398064782' >> >> >> >> >> > entity='client.admin' cmd=[{"prefix": "df", "format": >> >> >> >> >> > "json"}]: dispatch >> >> >> >> >> > 2019-11-08 09:02:37.686 7ff61722b700 0 log_channel(cluster) >> >> >> >> >> > log [INF] : mon.cn1 calling monitor election >> >> >> >> >> > 2019-11-08 09:02:37.686 7ff61722b700 1 >> >> >> >> >> > mon.cn1@0(electing).elector(31157) init, last seen epoch >> >> >> >> >> > 31157, mid-election, bumping >> >> >> >> >> > 2019-11-08 09:02:37.688 7ff61722b700 -1 mon.cn1@0(electing) e3 >> >> >> >> >> > failed to get devid for : >> >> >> >> >> > udev_device_new_from_subsystem_sysname failed on '' >> >> >> >> >> > 2019-11-08 09:02:37.770 7ff61722b700 0 log_channel(cluster) >> >> >> >> >> > log [INF] : mon.cn1 is new leader, mons cn1,cn2,cn3,cn4,cn5 in >> >> >> >> >> > quorum (ranks 0,1,2,3,4) >> >> >> >> >> > 2019-11-08 09:02:37.857 7ff613a24700 0 log_channel(cluster) >> >> >> >> >> > log [DBG] : monmap e3: 5 mons at >> >> >> >> >> > {cn1=[v2:10.50.11.41:3300/0,v1:10.50.11.41:6789/0],cn2=[v2:10.50.11.42:3300/0,v1:10.50.11.42:6789/0],cn3=[v2:10.50.11.43:3300/0,v1:10.50.11.43:6789/0],cn4=[v2:10.50.11.44:3300/0,v1:10.50.11.44:6789/0],cn5=[v2:10.50.11.45:3300/0,v1:10.50.11.45:6789/0]} >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > # ceph mon dump >> >> >> >> >> > dumped monmap epoch 3 >> >> >> >> >> > epoch 3 >> >> >> >> >> > fsid 9dbf207a-561c-48ba-892d-3e79b86be12f >> >> >> >> >> > last_changed 2019-09-03 07:53:39.031174 >> >> >> >> >> > created 2019-08-23 18:30:55.970279 >> >> >> >> >> > min_mon_release 14 (nautilus) >> >> >> >> >> > 0: [v2:10.50.11.41:3300/0,v1:10.50.11.41:6789/0] mon.cn1 >> >> >> >> >> > 1: [v2:10.50.11.42:3300/0,v1:10.50.11.42:6789/0] mon.cn2 >> >> >> >> >> > 2: [v2:10.50.11.43:3300/0,v1:10.50.11.43:6789/0] mon.cn3 >> >> >> >> >> > 3: [v2:10.50.11.44:3300/0,v1:10.50.11.44:6789/0] mon.cn4 >> >> >> >> >> > 4: [v2:10.50.11.45:3300/0,v1:10.50.11.45:6789/0] mon.cn5 >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > # ceph -s >> >> >> >> >> > cluster: >> >> >> >> >> > id: 9dbf207a-561c-48ba-892d-3e79b86be12f >> >> >> >> >> > health: HEALTH_WARN >> >> >> >> >> > 85 osds down >> >> >> >> >> > 3 hosts (72 osds) down >> >> >> >> >> > 1 nearfull osd(s) >> >> >> >> >> > 1 pool(s) nearfull >> >> >> >> >> > Reduced data availability: 2048 pgs inactive >> >> >> >> >> > too few PGs per OSD (17 < min 30) >> >> >> >> >> > 1/5 mons down, quorum cn2,cn3,cn4,cn5 >> >> >> >> >> > >> >> >> >> >> > services: >> >> >> >> >> > mon: 5 daemons, quorum cn2,cn3,cn4,cn5 (age 57s), out of >> >> >> >> >> > quorum: cn1 >> >> >> >> >> > mgr: cn1(active, since 73m), standbys: cn2, cn3, cn4, cn5 >> >> >> >> >> > osd: 120 osds: 35 up, 120 in; 909 remapped pgs >> >> >> >> >> > >> >> >> >> >> > data: >> >> >> >> >> > pools: 1 pools, 2048 pgs >> >> >> >> >> > objects: 0 objects, 0 B >> >> >> >> >> > usage: 176 TiB used, 260 TiB / 437 TiB avail >> >> >> >> >> > pgs: 100.000% pgs unknown >> >> >> >> >> > 2048 unknown >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > The osd logs show the below logs. >> >> >> >> >> > >> >> >> >> >> > 2019-11-08 09:05:33.332 7fd1a36eed80 0 _get_class not >> >> >> >> >> > permitted to load kvs >> >> >> >> >> > 2019-11-08 09:05:33.332 7fd1a36eed80 0 _get_class not >> >> >> >> >> > permitted to load lua >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80 0 _get_class not >> >> >> >> >> > permitted to load sdk >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80 0 osd.0 1795 crush map >> >> >> >> >> > has features 432629308056666112, adjusting msgr requires for >> >> >> >> >> > clients >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80 0 osd.0 1795 crush map >> >> >> >> >> > has features 432629308056666112 was 8705, adjusting msgr >> >> >> >> >> > requires for mons >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80 0 osd.0 1795 crush map >> >> >> >> >> > has features 1009090060360105984, adjusting msgr requires for >> >> >> >> >> > osds >> >> >> >> >> > >> >> >> >> >> > Please let us know what might be the issue. There seems to be >> >> >> >> >> > no network issues in any of the servers public and private >> >> >> >> >> > interfaces. >> >> >> >> >> > >> >> >> >> >> > _______________________________________________ >> >> >> >> >> > ceph-users mailing list >> >> >> >> >> > ceph-users@lists.ceph.com >> >> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com