Re: [ceph-users] Firefly OSDs stuck in creating state forever

Brian Rak Fri, 01 Aug 2014 18:14:17 -0700

What happens if you remove nodown? I'd be interested to see what OSDsit thinks are down. My next thought would be tcpdump on the privateinterface. See if the OSDs are actually managing to connect to each other.

For comparison, when I bring up a cluster of 3 OSDs it goes to HEALTH_OKnearly instantly (definitely under a minute!), so it's probably not justtaking awhile.


Does 'ceph osd dump' show the proper public and private IPs?

On 8/1/2014 6:13 PM, Bruce McFarland wrote:

MDS: I assumed that I'd need to bring up a ceph-mds for my cluster atinitial bringup. We also intended to modify the CRUSH map such thatit's pool is resident to SSD(s). It is one of the areas of the onlinedocs there doesn't seem to be a lot of info on and I haven't spent alot of time researching. I'll stop it.
OSD connectivity: The connectivity is good for both 1GE and 10GE. Ithought moving to 10GE with nothing else on that net might help withgroup placement etc and bring up the pages quicker. I've checked'tcpdump' output on all boxes.
Firewall: Thanks for that one - it's the "basic" I over looked in myceph learning curve. One of the OSDs had selinux=enforcing -- allothers were disabled. Changing that box and the 10 pages in mydemo-pool (kept page count very small for sanity) are now'active+clean'. The pages for the default pools -- data, metadata, rbd-- are still stuck in creating+peering or creating+incomplete. I didhave to use manually set 'osd pool default min size = 1' from it'sdefault of 2 for these 3 pools to eliminate a bunch of warnings inthe 'ceph health detail' output.
I'm adding the [mon] setting you suggested below and stoppingceph-mds and bringing everything up now.
[root@essperf3 Ceph]# ceph -s

    cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
health HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgsstuck inactive; 192 pgs stuck unclean; 28 requests are blocked > 32sec; nodown,noscrub flag(s) set
monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, electionepoch 1, quorum 0 essperf3
     mdsmap e43: 1/1/1 up {0=essperf3=up:creating}

     osdmap e752: 3 osds: 3 up, 3 in

flags nodown,noscrub

      pgmap v1483: 202 pgs, 4 pools, 0 bytes data, 0 objects

            134 MB used, 1158 GB / 1158 GB avail

96 creating+peering

10 active+clean <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<!!!!!!!!

96 creating+incomplete

[root@essperf3 Ceph]#

*From:*Brian Rak [mailto:b...@gameservers.com]
*Sent:* Friday, August 01, 2014 2:54 PM
*To:* Bruce McFarland; ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] Firefly OSDs stuck in creating state forever
Why do you have a MDS active? I'd suggest getting rid of that atleast until you have everything else working.
I see you've set nodown on the OSDs, did you have problems with theOSDs flapping? Do the OSDs have broken connectivity betweenthemselves? Do you have some kind of firewall interfering here?
I've seen odd issues when the OSDs have broken private networking,you'll get one OSD marking all the other ones down. Adding this to myconfig helped:
[mon]
mon osd min down reporters = 2

On 8/1/2014 5:41 PM, Bruce McFarland wrote:

    Hello,

    I've run out of ideas and assume I've overlooked something very
    basic. I've created 2 ceph clusters in the last 2 weeks with
    different OSD HW and private network fabrics -- 1GE and 10GE. I
    have never been able to get the OSDs to come up to the
    'active+clean' state. I have followed your online documentation
    and at this point the only thing I don't think I've done is
    modifying the CRUSH map (although I have been looking into that).
    These are new clusters with no data and only 1 HDD and 1 SSD per
    OSD (24 2.5Ghz cores with 64GB RAM).

    Since the disks are being recycled is there something I need to
    flag to let ceph just create it's mappings, but not scrub for data
    compatibility? I've tried setting the noscrub flag to no effect.

    I also have constant OSD flapping. I've set nodown, but assume
    that is just masking a problem that still occurring.

    Besides the lack of ever reaching 'active+clean' state ceph-mon
    always crashes after leaving it running overnight. The OSDs all
    eventually fill /root with with ceph logs so I regularly have to
    bring everything down Delete logs and restart.

    I have all sorts of output from the ceph.conf; osd boot ouput with
    'debug osd -= 20' and 'debug ms = 1'; ceph --w output; and pretty
    much all of the debug/monitoring suggestions from the online docs
    and 2 weeks of google searches from online references in blogs,
    mailing lists etc.

    [root@essperf3 Ceph]# ceph -v

    ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)

    [root@essperf3 Ceph]# ceph -s

        cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73

         health HEALTH_WARN 96 pgs incomplete; 106 pgs peering; 202
    pgs stuck inactive; 202 pgs stuck unclean; nodown,noscrub flag(s) set

         monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0},
    election epoch 1, quorum 0 essperf3

         mdsmap e43: 1/1/1 up {0=essperf3=up:creating}

         osdmap e752: 3 osds: 3 up, 3 in

                flags nodown,noscrub

          pgmap v1476: 202 pgs, 4 pools, 0 bytes data, 0 objects

                134 MB used, 1158 GB / 1158 GB avail

                     106 creating+peering

                      96 creating+incomplete

    [root@essperf3 Ceph]#

    Suggestions?

    Thanks,

    Bruce




    _______________________________________________

    ceph-users mailing list

    ceph-users@lists.ceph.com  <mailto:ceph-users@lists.ceph.com>

    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Firefly OSDs stuck in creating state forever

Reply via email to