Nevermind, they just came back. Looks like i had some other issues, such as manually enabled ceph-osd@#.service files in systemd config for OSDs that had been moved to different nodes.
The root problem is clearly that ceph-osd-prestart updates the crush map before the OSD successfully starts at all. If there's duplicate IDs for example, due to leftover files or somesuch, then a working OSD on another OSD may be forcibly moved in the crush map to another node where it doesn't exist. I would expect OSDs to update their own location in CRUSH, rather than having this be a prestart step. -Ben On Wed, May 4, 2016 at 10:27 PM, Ben Hines <bhi...@gmail.com> wrote: > Centos 7.2. > > .. and i think i just figured it out. One node had directories from former > OSDs in /var/lib/ceph/osd. When restarting other OSDs on this host, ceph > apparently added those to the crush map, too. > > [root@sm-cld-mtl-013 osd]# ls -la /var/lib/ceph/osd/ > total 128 > drwxr-x--- 8 ceph ceph 90 Feb 24 14:44 . > drwxr-x--- 9 ceph ceph 106 Feb 24 14:44 .. > drwxr-xr-x 2 root root 6 Jul 2 2015 ceph-42 > drwxr-xr-x 2 root root 6 Jul 2 2015 ceph-43 > drwxr-xr-x 1 root root 278 May 4 22:21 ceph-44 > drwxr-xr-x 1 root root 278 May 4 22:21 ceph-45 > drwxr-xr-x 1 root root 278 May 4 22:25 ceph-67 > drwxr-xr-x 1 root root 304 May 4 22:25 ceph-86 > > > (42 and 43 are on a different host.. yet when 'systemctl start > ceph.target' is used, the osd preflight adds them to the crush map anyway: > > > May 4 22:13:26 sm-cld-mtl-013 ceph-osd: starting osd.67 at :/0 osd_data > /var/lib/ceph/osd/ceph-67 /var/lib/ceph/osd/ceph-67/journal > May 4 22:13:26 sm-cld-mtl-013 ceph-osd: starting osd.45 at :/0 osd_data > /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal > May 4 22:13:26 sm-cld-mtl-013 ceph-osd: WARNING: will not setuid/gid: > /var/lib/ceph/osd/ceph-42 owned by 0:0 and not requested 167:167 > May 4 22:13:26 sm-cld-mtl-013 ceph-osd: 2016-05-04 22:13:26.529176 > 7f00cca7c900 -1 #033[0;31m ** ERROR: unable to open OSD superblock on > /var/lib/ceph/osd/ceph-43: (2) No such file or directory#033[0m > May 4 22:13:26 sm-cld-mtl-013 ceph-osd: 2016-05-04 22:13:26.534657 > 7fb55c17e900 -1 #033[0;31m ** ERROR: unable to open OSD superblock on > /var/lib/ceph/osd/ceph-42: (2) No such file or directory#033[0m > May 4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@43.service: main process > exited, code=exited, status=1/FAILURE > May 4 22:13:26 sm-cld-mtl-013 systemd: Unit ceph-osd@43.service entered > failed state. > May 4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@43.service failed. > May 4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@42.service: main process > exited, code=exited, status=1/FAILURE > May 4 22:13:26 sm-cld-mtl-013 systemd: Unit ceph-osd@42.service entered > failed state. > May 4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@42.service failed. > > > > -Ben > > On Tue, May 3, 2016 at 7:16 PM, Wade Holler <wade.hol...@gmail.com> wrote: > >> Hi Ben, >> >> What OS+Version ? >> >> Best Regards, >> Wade >> >> >> On Tue, May 3, 2016 at 2:44 PM Ben Hines <bhi...@gmail.com> wrote: >> >>> My crush map keeps putting some OSDs on the wrong node. Restarting them >>> fixes it temporarily, but they eventually hop back to the other node that >>> they aren't really on. >>> >>> Is there anything that can cause this to look for? >>> >>> Ceph 9.2.1 >>> >>> -Ben >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com