Re: [ceph-users] Fwd: OSD's not coming up in Nautilus

huang jun Fri, 08 Nov 2019 06:29:18 -0800

can you post your 'ceph osd tree' in pastebin?
do you mean the osds report fsid mismatch is from old removed nodes?


nokia ceph <[email protected]> 于2019年11月8日周五 下午10:21写道：
>
> Hi,
>
> The fifth node in the cluster was affected by hardware failure and hence the 
> node was replaced in the ceph cluster. But we were not able to replace it 
> properly and hence we uninstalled the ceph in all the nodes, deleted the 
> pools and also zapped the osd's and recreated them as new ceph cluster. But 
> not sure where from the reference for the old fifth nodes(failed nodes) osd's 
> fsid's are coming from still. Is this creating the problem. Because I am 
> seeing that the OSD's in the fifth node are showing up in the ceph status 
> whereas the other nodes osd's are showing down.
>
> On Fri, Nov 8, 2019 at 7:25 PM huang jun <[email protected]> wrote:
>>
>> I saw many lines like that
>>
>> mon.cn1@0(leader).osd e1805 preprocess_boot from osd.112
>> v2:10.50.11.45:6822/158344 clashes with existing osd: different fsid
>> (ours: 85908622-31bd-4728-9be3-f1f6ca44ed98 ; theirs:
>> 127fdc44-c17e-42ee-bcd4-d577c0ef4479)
>> the osd boot will be ignored if the fsid mismatch
>> what do you do before this happen?
>>
>> nokia ceph <[email protected]> 于2019年11月8日周五 下午8:29写道：
>> >
>> > Hi,
>> >
>> > Please find the osd.0 which is restarted after the debug_mon is increased 
>> > to 20.
>> >
>> > cn1.chn8be1c1.cdn ~# date;systemctl restart [email protected]
>> > Fri Nov  8 12:25:05 UTC 2019
>> >
>> > cn1.chn8be1c1.cdn ~# systemctl status [email protected] -l
>> > ● [email protected] - Ceph object storage daemon osd.0
>> >    Loaded: loaded (/usr/lib/systemd/system/[email protected]; 
>> > enabled-runtime; vendor preset: disabled)
>> >   Drop-In: /etc/systemd/system/[email protected]
>> >            └─90-ExecStart_NUMA.conf
>> >    Active: active (running) since Fri 2019-11-08 12:25:06 UTC; 29s ago
>> >   Process: 298505 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh 
>> > --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>> >  Main PID: 298512 (ceph-osd)
>> >    CGroup: /system.slice/system-ceph\x2dosd.slice/[email protected]
>> >            └─298512 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser 
>> > ceph --setgroup ceph
>> >
>> > Nov 08 12:25:06 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph object storage 
>> > daemon osd.0...
>> > Nov 08 12:25:06 cn1.chn8be1c1.cdn systemd[1]: Started Ceph object storage 
>> > daemon osd.0.
>> > Nov 08 12:25:11 cn1.chn8be1c1.cdn numactl[298512]: 2019-11-08 12:25:11.538 
>> > 7f8515323d80 -1 osd.0 1795 log_to_monitors {default=true}
>> > Nov 08 12:25:11 cn1.chn8be1c1.cdn numactl[298512]: 2019-11-08 12:25:11.689 
>> > 7f850792e700 -1 osd.0 1795 set_numa_affinity unable to identify public 
>> > interface 'dss-client' numa node: (2) No such file or directory
>> >
>> > On Fri, Nov 8, 2019 at 4:48 PM huang jun <[email protected]> wrote:
>> >>
>> >> the osd.0 is still in down state after restart? if so, maybe the
>> >> problem is in mon,
>> >> can you set the leader mon's debug_mon=20 and restart one of the down
>> >> state osd.
>> >> and then attach the mon log file.
>> >>
>> >> nokia ceph <[email protected]> 于2019年11月8日周五 下午6:38写道：
>> >> >
>> >> > Hi,
>> >> >
>> >> >
>> >> >
>> >> > Below is the status of the OSD after restart.
>> >> >
>> >> >
>> >> >
>> >> > # systemctl status [email protected]
>> >> >
>> >> > ● [email protected] - Ceph object storage daemon osd.0
>> >> >
>> >> >    Loaded: loaded (/usr/lib/systemd/system/[email protected]; 
>> >> > enabled-runtime; vendor preset: disabled)
>> >> >
>> >> >   Drop-In: /etc/systemd/system/[email protected]
>> >> >
>> >> >            └─90-ExecStart_NUMA.conf
>> >> >
>> >> >    Active: active (running) since Fri 2019-11-08 10:32:51 UTC; 1min 1s 
>> >> > ago
>> >> >
>> >> >   Process: 219213 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh 
>> >> > --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)  Main PID: 
>> >> > 219218 (ceph-osd)
>> >> >
>> >> >    CGroup: /system.slice/system-ceph\x2dosd.slice/[email protected]
>> >> >
>> >> >            └─219218 /usr/bin/ceph-osd -f --cluster ceph --id 0 
>> >> > --setuser ceph --setgroup ceph
>> >> >
>> >> >
>> >> >
>> >> > Nov 08 10:32:51 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph object 
>> >> > storage daemon osd.0...
>> >> >
>> >> > Nov 08 10:32:51 cn1.chn8be1c1.cdn systemd[1]: Started Ceph object 
>> >> > storage daemon osd.0.
>> >> >
>> >> > Nov 08 10:33:03 cn1.chn8be1c1.cdn numactl[219218]: 2019-11-08 
>> >> > 10:33:03.785 7f9adeed4d80 -1 osd.0 1795 log_to_monitors {default=true} 
>> >> > Nov 08 10:33:05 cn1.chn8be1c1.cdn numactl[219218]: 2019-11-08 
>> >> > 10:33:05.474 7f9ad14df700 -1 osd.0 1795 set_numa_affinity unable to 
>> >> > identify public interface 'dss-client' numa n...r directory
>> >> >
>> >> > Hint: Some lines were ellipsized, use -l to show in full.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > And I have attached the logs in the file in this mail while this 
>> >> > restart was initiated.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Nov 8, 2019 at 3:59 PM huang jun <[email protected]> wrote:
>> >> >>
>> >> >> try to restart some of the down osds in 'ceph osd tree', and to see
>> >> >> what happened?
>> >> >>
>> >> >> nokia ceph <[email protected]> 于2019年11月8日周五 下午6:24写道：
>> >> >> >
>> >> >> > Adding my official mail id
>> >> >> >
>> >> >> > ---------- Forwarded message ---------
>> >> >> > From: nokia ceph <[email protected]>
>> >> >> > Date: Fri, Nov 8, 2019 at 3:57 PM
>> >> >> > Subject: OSD's not coming up in Nautilus
>> >> >> > To: Ceph Users <[email protected]>
>> >> >> >
>> >> >> >
>> >> >> > Hi Team,
>> >> >> >
>> >> >> > There is one 5 node ceph cluster which we have upgraded from 
>> >> >> > Luminous to Nautilus and everything was going well until yesterday 
>> >> >> > when we noticed that the ceph osd's are marked down and not 
>> >> >> > recognized by the monitors as running eventhough the osd processes 
>> >> >> > are running.
>> >> >> >
>> >> >> > We noticed that the admin.keyring and the mon.keyring are missing in 
>> >> >> > the nodes which we have recreated it with the below commands.
>> >> >> >
>> >> >> > ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring 
>> >> >> > --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' 
>> >> >> > --cap mds allow
>> >> >> >
>> >> >> > ceph-authtool --create_keyring /etc/ceph/ceph.mon.keyring --gen-key 
>> >> >> > -n mon. --cap mon 'allow *'
>> >> >> >
>> >> >> > In logs we find the below lines.
>> >> >> >
>> >> >> > 2019-11-08 09:01:50.525 7ff61722b700  0 log_channel(audit) log [DBG] 
>> >> >> > : from='client.? 10.50.11.44:0/2398064782' entity='client.admin' 
>> >> >> > cmd=[{"prefix": "df", "format": "json"}]: dispatch
>> >> >> > 2019-11-08 09:02:37.686 7ff61722b700  0 log_channel(cluster) log 
>> >> >> > [INF] : mon.cn1 calling monitor election
>> >> >> > 2019-11-08 09:02:37.686 7ff61722b700  1 
>> >> >> > mon.cn1@0(electing).elector(31157) init, last seen epoch 31157, 
>> >> >> > mid-election, bumping
>> >> >> > 2019-11-08 09:02:37.688 7ff61722b700 -1 mon.cn1@0(electing) e3 
>> >> >> > failed to get devid for : udev_device_new_from_subsystem_sysname 
>> >> >> > failed on ''
>> >> >> > 2019-11-08 09:02:37.770 7ff61722b700  0 log_channel(cluster) log 
>> >> >> > [INF] : mon.cn1 is new leader, mons cn1,cn2,cn3,cn4,cn5 in quorum 
>> >> >> > (ranks 0,1,2,3,4)
>> >> >> > 2019-11-08 09:02:37.857 7ff613a24700  0 log_channel(cluster) log 
>> >> >> > [DBG] : monmap e3: 5 mons at 
>> >> >> > {cn1=[v2:10.50.11.41:3300/0,v1:10.50.11.41:6789/0],cn2=[v2:10.50.11.42:3300/0,v1:10.50.11.42:6789/0],cn3=[v2:10.50.11.43:3300/0,v1:10.50.11.43:6789/0],cn4=[v2:10.50.11.44:3300/0,v1:10.50.11.44:6789/0],cn5=[v2:10.50.11.45:3300/0,v1:10.50.11.45:6789/0]}
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > # ceph mon dump
>> >> >> > dumped monmap epoch 3
>> >> >> > epoch 3
>> >> >> > fsid 9dbf207a-561c-48ba-892d-3e79b86be12f
>> >> >> > last_changed 2019-09-03 07:53:39.031174
>> >> >> > created 2019-08-23 18:30:55.970279
>> >> >> > min_mon_release 14 (nautilus)
>> >> >> > 0: [v2:10.50.11.41:3300/0,v1:10.50.11.41:6789/0] mon.cn1
>> >> >> > 1: [v2:10.50.11.42:3300/0,v1:10.50.11.42:6789/0] mon.cn2
>> >> >> > 2: [v2:10.50.11.43:3300/0,v1:10.50.11.43:6789/0] mon.cn3
>> >> >> > 3: [v2:10.50.11.44:3300/0,v1:10.50.11.44:6789/0] mon.cn4
>> >> >> > 4: [v2:10.50.11.45:3300/0,v1:10.50.11.45:6789/0] mon.cn5
>> >> >> >
>> >> >> >
>> >> >> > # ceph -s
>> >> >> >   cluster:
>> >> >> >     id:     9dbf207a-561c-48ba-892d-3e79b86be12f
>> >> >> >     health: HEALTH_WARN
>> >> >> >             85 osds down
>> >> >> >             3 hosts (72 osds) down
>> >> >> >             1 nearfull osd(s)
>> >> >> >             1 pool(s) nearfull
>> >> >> >             Reduced data availability: 2048 pgs inactive
>> >> >> >             too few PGs per OSD (17 < min 30)
>> >> >> >             1/5 mons down, quorum cn2,cn3,cn4,cn5
>> >> >> >
>> >> >> >   services:
>> >> >> >     mon: 5 daemons, quorum cn2,cn3,cn4,cn5 (age 57s), out of quorum: 
>> >> >> > cn1
>> >> >> >     mgr: cn1(active, since 73m), standbys: cn2, cn3, cn4, cn5
>> >> >> >     osd: 120 osds: 35 up, 120 in; 909 remapped pgs
>> >> >> >
>> >> >> >   data:
>> >> >> >     pools:   1 pools, 2048 pgs
>> >> >> >     objects: 0 objects, 0 B
>> >> >> >     usage:   176 TiB used, 260 TiB / 437 TiB avail
>> >> >> >     pgs:     100.000% pgs unknown
>> >> >> >              2048 unknown
>> >> >> >
>> >> >> >
>> >> >> > The osd logs show the below logs.
>> >> >> >
>> >> >> > 2019-11-08 09:05:33.332 7fd1a36eed80  0 _get_class not permitted to 
>> >> >> > load kvs
>> >> >> > 2019-11-08 09:05:33.332 7fd1a36eed80  0 _get_class not permitted to 
>> >> >> > load lua
>> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80  0 _get_class not permitted to 
>> >> >> > load sdk
>> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80  0 osd.0 1795 crush map has 
>> >> >> > features 432629308056666112, adjusting msgr requires for clients
>> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80  0 osd.0 1795 crush map has 
>> >> >> > features 432629308056666112 was 8705, adjusting msgr requires for 
>> >> >> > mons
>> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80  0 osd.0 1795 crush map has 
>> >> >> > features 1009090060360105984, adjusting msgr requires for osds
>> >> >> >
>> >> >> > Please let us know what might be the issue. There seems to be no 
>> >> >> > network issues in any of the servers public and private interfaces.
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > ceph-users mailing list
>> >> >> > [email protected]
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: OSD's not coming up in Nautilus

Reply via email to