Re: [ceph-users] Fwd: OSD's not coming up in Nautilus

nokia ceph Sat, 09 Nov 2019 22:48:36 -0800

Hi,

yes still the cluster unrecovered. Not able to even up the osd.0 yet.


osd logs: https://pastebin.com/4WrpgrH5

Mon logs: https://drive.google.com/open?id=1_HqK2d52Cgaps203WnZ0mCfvxdcjcBoE

# ceph daemon /var/run/ceph/ceph-mon.cn1.asok config show|grep debug_mon
    "debug_mon": "20/20",
    "debug_monc": "0/0",


# date; systemctl restart ceph-osd@0.service;date
Sun Nov 10 05:25:54 UTC 2019
Sun Nov 10 05:25:55 UTC 2019


cn1.chn8be1c1.cdn ~# systemctl status ceph-osd@0.service
● ceph-osd@0.service - Ceph object storage daemon osd.0
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service;
enabled-runtime; vendor preset: disabled)
  Drop-In: /etc/systemd/system/ceph-osd@.service.d
           └─90-ExecStart_NUMA.conf
   Active: active (running) since Sun 2019-11-10 05:25:55 UTC; 8s ago
  Process: 2022026 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 2022032 (ceph-osd)
   CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
           └─2022032 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser
ceph --setgroup ceph

Nov 10 05:25:55 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph object storage
daemon osd.0...
Nov 10 05:25:55 cn1.chn8be1c1.cdn systemd[1]: Started Ceph object storage
daemon osd.0.
Nov 10 05:26:03 cn1.chn8be1c1.cdn numactl[2022032]: 2019-11-10 05:26:03.131
7fbef7bb5d80 -1 osd.0 1795 log_to_monitors {default=true}
Nov 10 05:26:03 cn1.chn8be1c1.cdn numactl[2022032]: 2019-11-10 05:26:03.372
7fbeea1c0700 -1 osd.0 1795 set_numa_affinity unable to identify public
interface 'dss-client' numa node: (2) No such file or directory
Hint: Some lines were ellipsized, use -l to show in full.


# ceph tell mon.cn1 injectargs '--debug-mon 1/5'
injectargs:

cn1.chn8be1c1.cdn ~# ceph daemon /var/run/ceph/ceph-mon.cn1.asok config
show|grep debug_mon
    "debug_mon": "1/5",
    "debug_monc": "0/0",




On Sun, Nov 10, 2019 at 11:05 AM huang jun <hjwsm1...@gmail.com> wrote:

> good, please send me the mon and osd.0 log.
> the cluster still un-recovered?
>
> nokia ceph <nokiacephus...@gmail.com> 于2019年11月10日周日 下午1:24写道：
> >
> > Hi Huang,
> >
> > Yes the node 10.50.10.45 is the fifth node which is replaced. Yes I have
> set the debug_mon to 20 and still it is running with that value only. If
> you want I will send you the logs of the mon once again by restarting the
> osd.0
> >
> > On Sun, Nov 10, 2019 at 10:17 AM huang jun <hjwsm1...@gmail.com> wrote:
> >>
> >> The mon log shows that the all mismatch fsid osds are from node
> 10.50.11.45,
> >> maybe that the fith node?
> >> BTW i don't found the osd.0 boot message in ceph-mon.log
> >> do you set debug_mon=20 first and then restart osd.0 process, and make
> >> sure the osd.0 is restarted.
> >>
> >>
> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月10日周日 下午12:31写道：
> >>
> >> >
> >> > Hi,
> >> >
> >> > Please find the ceph osd tree output in the pastebin
> https://pastebin.com/Gn93rE6w
> >> >
> >> > On Fri, Nov 8, 2019 at 7:58 PM huang jun <hjwsm1...@gmail.com> wrote:
> >> >>
> >> >> can you post your 'ceph osd tree' in pastebin?
> >> >> do you mean the osds report fsid mismatch is from old removed nodes?
> >> >>
> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午10:21写道：
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > The fifth node in the cluster was affected by hardware failure and
> hence the node was replaced in the ceph cluster. But we were not able to
> replace it properly and hence we uninstalled the ceph in all the nodes,
> deleted the pools and also zapped the osd's and recreated them as new ceph
> cluster. But not sure where from the reference for the old fifth
> nodes(failed nodes) osd's fsid's are coming from still. Is this creating
> the problem. Because I am seeing that the OSD's in the fifth node are
> showing up in the ceph status whereas the other nodes osd's are showing
> down.
> >> >> >
> >> >> > On Fri, Nov 8, 2019 at 7:25 PM huang jun <hjwsm1...@gmail.com>
> wrote:
> >> >> >>
> >> >> >> I saw many lines like that
> >> >> >>
> >> >> >> mon.cn1@0(leader).osd e1805 preprocess_boot from osd.112
> >> >> >> v2:10.50.11.45:6822/158344 clashes with existing osd: different
> fsid
> >> >> >> (ours: 85908622-31bd-4728-9be3-f1f6ca44ed98 ; theirs:
> >> >> >> 127fdc44-c17e-42ee-bcd4-d577c0ef4479)
> >> >> >> the osd boot will be ignored if the fsid mismatch
> >> >> >> what do you do before this happen?
> >> >> >>
> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午8:29写道：
> >> >> >> >
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > Please find the osd.0 which is restarted after the debug_mon is
> increased to 20.
> >> >> >> >
> >> >> >> > cn1.chn8be1c1.cdn ~# date;systemctl restart ceph-osd@0.service
> >> >> >> > Fri Nov  8 12:25:05 UTC 2019
> >> >> >> >
> >> >> >> > cn1.chn8be1c1.cdn ~# systemctl status ceph-osd@0.service -l
> >> >> >> > ● ceph-osd@0.service - Ceph object storage daemon osd.0
> >> >> >> >    Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service;
> enabled-runtime; vendor preset: disabled)
> >> >> >> >   Drop-In: /etc/systemd/system/ceph-osd@.service.d
> >> >> >> >            └─90-ExecStart_NUMA.conf
> >> >> >> >    Active: active (running) since Fri 2019-11-08 12:25:06 UTC;
> 29s ago
> >> >> >> >   Process: 298505
> ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id
> %i (code=exited, status=0/SUCCESS)
> >> >> >> >  Main PID: 298512 (ceph-osd)
> >> >> >> >    CGroup:
> /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
> >> >> >> >            └─298512 /usr/bin/ceph-osd -f --cluster ceph --id 0
> --setuser ceph --setgroup ceph
> >> >> >> >
> >> >> >> > Nov 08 12:25:06 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph
> object storage daemon osd.0...
> >> >> >> > Nov 08 12:25:06 cn1.chn8be1c1.cdn systemd[1]: Started Ceph
> object storage daemon osd.0.
> >> >> >> > Nov 08 12:25:11 cn1.chn8be1c1.cdn numactl[298512]: 2019-11-08
> 12:25:11.538 7f8515323d80 -1 osd.0 1795 log_to_monitors {default=true}
> >> >> >> > Nov 08 12:25:11 cn1.chn8be1c1.cdn numactl[298512]: 2019-11-08
> 12:25:11.689 7f850792e700 -1 osd.0 1795 set_numa_affinity unable to
> identify public interface 'dss-client' numa node: (2) No such file or
> directory
> >> >> >> >
> >> >> >> > On Fri, Nov 8, 2019 at 4:48 PM huang jun <hjwsm1...@gmail.com>
> wrote:
> >> >> >> >>
> >> >> >> >> the osd.0 is still in down state after restart? if so, maybe
> the
> >> >> >> >> problem is in mon,
> >> >> >> >> can you set the leader mon's debug_mon=20 and restart one of
> the down
> >> >> >> >> state osd.
> >> >> >> >> and then attach the mon log file.
> >> >> >> >>
> >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午6:38写道：
> >> >> >> >> >
> >> >> >> >> > Hi,
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > Below is the status of the OSD after restart.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > # systemctl status ceph-osd@0.service
> >> >> >> >> >
> >> >> >> >> > ● ceph-osd@0.service - Ceph object storage daemon osd.0
> >> >> >> >> >
> >> >> >> >> >    Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service;
> enabled-runtime; vendor preset: disabled)
> >> >> >> >> >
> >> >> >> >> >   Drop-In: /etc/systemd/system/ceph-osd@.service.d
> >> >> >> >> >
> >> >> >> >> >            └─90-ExecStart_NUMA.conf
> >> >> >> >> >
> >> >> >> >> >    Active: active (running) since Fri 2019-11-08 10:32:51
> UTC; 1min 1s ago
> >> >> >> >> >
> >> >> >> >> >   Process: 219213
> ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id
> %i (code=exited, status=0/SUCCESS)  Main PID: 219218 (ceph-osd)
> >> >> >> >> >
> >> >> >> >> >    CGroup:
> /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
> >> >> >> >> >
> >> >> >> >> >            └─219218 /usr/bin/ceph-osd -f --cluster ceph --id
> 0 --setuser ceph --setgroup ceph
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > Nov 08 10:32:51 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph
> object storage daemon osd.0...
> >> >> >> >> >
> >> >> >> >> > Nov 08 10:32:51 cn1.chn8be1c1.cdn systemd[1]: Started Ceph
> object storage daemon osd.0.
> >> >> >> >> >
> >> >> >> >> > Nov 08 10:33:03 cn1.chn8be1c1.cdn numactl[219218]:
> 2019-11-08 10:33:03.785 7f9adeed4d80 -1 osd.0 1795 log_to_monitors
> {default=true} Nov 08 10:33:05 cn1.chn8be1c1.cdn numactl[219218]:
> 2019-11-08 10:33:05.474 7f9ad14df700 -1 osd.0 1795 set_numa_affinity unable
> to identify public interface 'dss-client' numa n...r directory
> >> >> >> >> >
> >> >> >> >> > Hint: Some lines were ellipsized, use -l to show in full.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > And I have attached the logs in the file in this mail while
> this restart was initiated.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Fri, Nov 8, 2019 at 3:59 PM huang jun <
> hjwsm1...@gmail.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> try to restart some of the down osds in 'ceph osd tree',
> and to see
> >> >> >> >> >> what happened?
> >> >> >> >> >>
> >> >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五
> 下午6:24写道：
> >> >> >> >> >> >
> >> >> >> >> >> > Adding my official mail id
> >> >> >> >> >> >
> >> >> >> >> >> > ---------- Forwarded message ---------
> >> >> >> >> >> > From: nokia ceph <nokiacephus...@gmail.com>
> >> >> >> >> >> > Date: Fri, Nov 8, 2019 at 3:57 PM
> >> >> >> >> >> > Subject: OSD's not coming up in Nautilus
> >> >> >> >> >> > To: Ceph Users <ceph-users@lists.ceph.com>
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > Hi Team,
> >> >> >> >> >> >
> >> >> >> >> >> > There is one 5 node ceph cluster which we have upgraded
> from Luminous to Nautilus and everything was going well until yesterday
> when we noticed that the ceph osd's are marked down and not recognized by
> the monitors as running eventhough the osd processes are running.
> >> >> >> >> >> >
> >> >> >> >> >> > We noticed that the admin.keyring and the mon.keyring are
> missing in the nodes which we have recreated it with the below commands.
> >> >> >> >> >> >
> >> >> >> >> >> > ceph-authtool --create-keyring
> /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon
> 'allow *' --cap osd 'allow *' --cap mds allow
> >> >> >> >> >> >
> >> >> >> >> >> > ceph-authtool --create_keyring /etc/ceph/ceph.mon.keyring
> --gen-key -n mon. --cap mon 'allow *'
> >> >> >> >> >> >
> >> >> >> >> >> > In logs we find the below lines.
> >> >> >> >> >> >
> >> >> >> >> >> > 2019-11-08 09:01:50.525 7ff61722b700  0
> log_channel(audit) log [DBG] : from='client.? 10.50.11.44:0/2398064782'
> entity='client.admin' cmd=[{"prefix": "df", "format": "json"}]: dispatch
> >> >> >> >> >> > 2019-11-08 09:02:37.686 7ff61722b700  0
> log_channel(cluster) log [INF] : mon.cn1 calling monitor election
> >> >> >> >> >> > 2019-11-08 09:02:37.686 7ff61722b700  1 
> >> >> >> >> >> > mon.cn1@0(electing).elector(31157)
> init, last seen epoch 31157, mid-election, bumping
> >> >> >> >> >> > 2019-11-08 09:02:37.688 7ff61722b700 -1 mon.cn1@0(electing)
> e3 failed to get devid for : udev_device_new_from_subsystem_sysname failed
> on ''
> >> >> >> >> >> > 2019-11-08 09:02:37.770 7ff61722b700  0
> log_channel(cluster) log [INF] : mon.cn1 is new leader, mons
> cn1,cn2,cn3,cn4,cn5 in quorum (ranks 0,1,2,3,4)
> >> >> >> >> >> > 2019-11-08 09:02:37.857 7ff613a24700  0
> log_channel(cluster) log [DBG] : monmap e3: 5 mons at {cn1=[v2:
> 10.50.11.41:3300/0,v1:10.50.11.41:6789/0],cn2=[v2:
> 10.50.11.42:3300/0,v1:10.50.11.42:6789/0],cn3=[v2:
> 10.50.11.43:3300/0,v1:10.50.11.43:6789/0],cn4=[v2:
> 10.50.11.44:3300/0,v1:10.50.11.44:6789/0],cn5=[v2:
> 10.50.11.45:3300/0,v1:10.50.11.45:6789/0]}
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > # ceph mon dump
> >> >> >> >> >> > dumped monmap epoch 3
> >> >> >> >> >> > epoch 3
> >> >> >> >> >> > fsid 9dbf207a-561c-48ba-892d-3e79b86be12f
> >> >> >> >> >> > last_changed 2019-09-03 07:53:39.031174
> >> >> >> >> >> > created 2019-08-23 18:30:55.970279
> >> >> >> >> >> > min_mon_release 14 (nautilus)
> >> >> >> >> >> > 0: [v2:10.50.11.41:3300/0,v1:10.50.11.41:6789/0] mon.cn1
> >> >> >> >> >> > 1: [v2:10.50.11.42:3300/0,v1:10.50.11.42:6789/0] mon.cn2
> >> >> >> >> >> > 2: [v2:10.50.11.43:3300/0,v1:10.50.11.43:6789/0] mon.cn3
> >> >> >> >> >> > 3: [v2:10.50.11.44:3300/0,v1:10.50.11.44:6789/0] mon.cn4
> >> >> >> >> >> > 4: [v2:10.50.11.45:3300/0,v1:10.50.11.45:6789/0] mon.cn5
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > # ceph -s
> >> >> >> >> >> >   cluster:
> >> >> >> >> >> >     id:     9dbf207a-561c-48ba-892d-3e79b86be12f
> >> >> >> >> >> >     health: HEALTH_WARN
> >> >> >> >> >> >             85 osds down
> >> >> >> >> >> >             3 hosts (72 osds) down
> >> >> >> >> >> >             1 nearfull osd(s)
> >> >> >> >> >> >             1 pool(s) nearfull
> >> >> >> >> >> >             Reduced data availability: 2048 pgs inactive
> >> >> >> >> >> >             too few PGs per OSD (17 < min 30)
> >> >> >> >> >> >             1/5 mons down, quorum cn2,cn3,cn4,cn5
> >> >> >> >> >> >
> >> >> >> >> >> >   services:
> >> >> >> >> >> >     mon: 5 daemons, quorum cn2,cn3,cn4,cn5 (age 57s), out
> of quorum: cn1
> >> >> >> >> >> >     mgr: cn1(active, since 73m), standbys: cn2, cn3, cn4,
> cn5
> >> >> >> >> >> >     osd: 120 osds: 35 up, 120 in; 909 remapped pgs
> >> >> >> >> >> >
> >> >> >> >> >> >   data:
> >> >> >> >> >> >     pools:   1 pools, 2048 pgs
> >> >> >> >> >> >     objects: 0 objects, 0 B
> >> >> >> >> >> >     usage:   176 TiB used, 260 TiB / 437 TiB avail
> >> >> >> >> >> >     pgs:     100.000% pgs unknown
> >> >> >> >> >> >              2048 unknown
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > The osd logs show the below logs.
> >> >> >> >> >> >
> >> >> >> >> >> > 2019-11-08 09:05:33.332 7fd1a36eed80  0 _get_class not
> permitted to load kvs
> >> >> >> >> >> > 2019-11-08 09:05:33.332 7fd1a36eed80  0 _get_class not
> permitted to load lua
> >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80  0 _get_class not
> permitted to load sdk
> >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80  0 osd.0 1795 crush
> map has features 432629308056666112, adjusting msgr requires for clients
> >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80  0 osd.0 1795 crush
> map has features 432629308056666112 was 8705, adjusting msgr requires for
> mons
> >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80  0 osd.0 1795 crush
> map has features 1009090060360105984, adjusting msgr requires for osds
> >> >> >> >> >> >
> >> >> >> >> >> > Please let us know what might be the issue. There seems
> to be no network issues in any of the servers public and private interfaces.
> >> >> >> >> >> >
> >> >> >> >> >> > _______________________________________________
> >> >> >> >> >> > ceph-users mailing list
> >> >> >> >> >> > ceph-users@lists.ceph.com
> >> >> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: OSD's not coming up in Nautilus

Reply via email to