Nevermind, they just came back. Looks like i had some other issues, such as
manually enabled ceph-osd@#.service files in systemd config for OSDs that
had been moved to different nodes.

The root problem is clearly that ceph-osd-prestart updates the crush map
before the OSD successfully starts at all. If there's duplicate IDs for
example, due to leftover files or somesuch, then a working OSD on another
OSD may be forcibly moved in the crush map to another node where it doesn't
exist. I would expect OSDs to update their own location in CRUSH, rather
than having this be a prestart step.

-Ben


On Wed, May 4, 2016 at 10:27 PM, Ben Hines <bhi...@gmail.com> wrote:

> Centos 7.2.
>
> .. and i think i just figured it out. One node had directories from former
> OSDs in /var/lib/ceph/osd. When restarting other OSDs on this host, ceph
> apparently added those to the crush map, too.
>
> [root@sm-cld-mtl-013 osd]# ls -la /var/lib/ceph/osd/
> total 128
> drwxr-x--- 8 ceph ceph  90 Feb 24 14:44 .
> drwxr-x--- 9 ceph ceph 106 Feb 24 14:44 ..
> drwxr-xr-x 2 root root   6 Jul  2  2015 ceph-42
> drwxr-xr-x 2 root root   6 Jul  2  2015 ceph-43
> drwxr-xr-x 1 root root 278 May  4 22:21 ceph-44
> drwxr-xr-x 1 root root 278 May  4 22:21 ceph-45
> drwxr-xr-x 1 root root 278 May  4 22:25 ceph-67
> drwxr-xr-x 1 root root 304 May  4 22:25 ceph-86
>
>
> (42 and 43 are on a different host.. yet when 'systemctl start
> ceph.target' is used, the osd preflight adds them to the crush map anyway:
>
>
> May  4 22:13:26 sm-cld-mtl-013 ceph-osd: starting osd.67 at :/0 osd_data
> /var/lib/ceph/osd/ceph-67 /var/lib/ceph/osd/ceph-67/journal
> May  4 22:13:26 sm-cld-mtl-013 ceph-osd: starting osd.45 at :/0 osd_data
> /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal
> May  4 22:13:26 sm-cld-mtl-013 ceph-osd: WARNING: will not setuid/gid:
> /var/lib/ceph/osd/ceph-42 owned by 0:0 and not requested 167:167
> May  4 22:13:26 sm-cld-mtl-013 ceph-osd: 2016-05-04 22:13:26.529176
> 7f00cca7c900 -1 #033[0;31m ** ERROR: unable to open OSD superblock on
> /var/lib/ceph/osd/ceph-43: (2) No such file or directory#033[0m
> May  4 22:13:26 sm-cld-mtl-013 ceph-osd: 2016-05-04 22:13:26.534657
> 7fb55c17e900 -1 #033[0;31m ** ERROR: unable to open OSD superblock on
> /var/lib/ceph/osd/ceph-42: (2) No such file or directory#033[0m
> May  4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@43.service: main process
> exited, code=exited, status=1/FAILURE
> May  4 22:13:26 sm-cld-mtl-013 systemd: Unit ceph-osd@43.service entered
> failed state.
> May  4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@43.service failed.
> May  4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@42.service: main process
> exited, code=exited, status=1/FAILURE
> May  4 22:13:26 sm-cld-mtl-013 systemd: Unit ceph-osd@42.service entered
> failed state.
> May  4 22:13:26 sm-cld-mtl-013 systemd: ceph-osd@42.service failed.
>
>
>
> -Ben
>
> On Tue, May 3, 2016 at 7:16 PM, Wade Holler <wade.hol...@gmail.com> wrote:
>
>> Hi Ben,
>>
>> What OS+Version ?
>>
>> Best Regards,
>> Wade
>>
>>
>> On Tue, May 3, 2016 at 2:44 PM Ben Hines <bhi...@gmail.com> wrote:
>>
>>> My crush map keeps putting some OSDs on the wrong node. Restarting them
>>> fixes it temporarily, but they eventually hop back to the other node that
>>> they aren't really on.
>>>
>>> Is there anything that can cause this to look for?
>>>
>>> Ceph 9.2.1
>>>
>>> -Ben
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to