Re: [ceph-users] Brand new cluster -- pg is stuck inactive

dE Sat, 14 Oct 2017 20:51:07 -0700

On 10/15/2017 03:13 AM, Denes Dolhay wrote:


Hello,

Could you include the monitors and the osds as well to your clock skewtest?

How did you create the osds? ceph-deploy osd create osd1:/dev/sdXosd2:/dev/sdY osd3: /dev/sdZ ?


Some log from one of the osds would be great!


Kind regards,

Denes.


On 10/14/2017 07:39 PM, dE wrote:

On 10/14/2017 08:18 PM, David Turner wrote:

What are the ownership permissions on your osd folders? Clock skewcares about partial seconds.

It isn't the networking issue because your cluster isn't stuckpeering. I'm not sure if the creating state happens in disk or inthe cluster.

On Sat, Oct 14, 2017, 10:01 AM dE . <de.tec...@gmail.com<mailto:de.tec...@gmail.com>> wrote:


    I attached 1TB disks to each osd.

    cluster 8161c90e-dbd2-4491-acf8-74449bef916a
         health HEALTH_ERR
                clock skew detected on mon.1, mon.2

                64 pgs are stuck inactive for more than 300 seconds
                64 pgs stuck inactive
                too few PGs per OSD (21 < min 30)
                Monitor clock skew detected
         monmap e1: 3 mons at
    {0=10.247.103.139:8567/0,1=10.247.103.140:8567/0,2=10.247.103.141:8567/0
    
<http://10.247.103.139:8567/0,1=10.247.103.140:8567/0,2=10.247.103.141:8567/0>}
                election epoch 12, quorum 0,1,2 0,1,2
         osdmap e10: 3 osds: 3 up, 3 in
                flags sortbitwise,require_jewel_osds
          pgmap v38: 64 pgs, 1 pools, 0 bytes data, 0 objects
                33963 MB used, 3037 GB / 3070 GB avail
                      64 creating

    I dont seem to have any clock skews --
    or i in {139..141}; do ssh $i date +%s; done
    1507989554
    1507989554
    1507989554


    On Sat, Oct 14, 2017 at 6:41 PM, David Turner
    <drakonst...@gmail.com <mailto:drakonst...@gmail.com>> wrote:

        What is the output of your `ceph status`?


        On Fri, Oct 13, 2017, 10:09 PM dE <de.tec...@gmail.com
        <mailto:de.tec...@gmail.com>> wrote:

            On 10/14/2017 12:53 AM, David Turner wrote:

            What does your environment look like?  Someone recently
            on the mailing list had PGs stuck creating because of a
            networking issue.

            On Fri, Oct 13, 2017 at 2:03 PM Ronny Aasen
            <ronny+ceph-us...@aasen.cx
            <mailto:ronny%2bceph-us...@aasen.cx>> wrote:

                strange that no osd is acting for your pg's
                can you show the output from
                ceph osd tree


                mvh
                Ronny Aasen



                On 13.10.2017 18:53, dE wrote:
                > Hi,
                >
                >     I'm running ceph 10.2.5 on Debian (official
                package).
                >
                > It cant seem to create any functional pools --
                >
                > ceph health detail
                > HEALTH_ERR 64 pgs are stuck inactive for more
                than 300 seconds; 64 pgs
                > stuck inactive; too few PGs per OSD (21 < min 30)
                > pg 0.39 is stuck inactive for 652.741684, current
                state creating, last
                > acting []
                > pg 0.38 is stuck inactive for 652.741688, current
                state creating, last
                > acting []
                > pg 0.37 is stuck inactive for 652.741690, current
                state creating, last
                > acting []
                > pg 0.36 is stuck inactive for 652.741692, current
                state creating, last
                > acting []
                > pg 0.35 is stuck inactive for 652.741694, current
                state creating, last
                > acting []
                > pg 0.34 is stuck inactive for 652.741696, current
                state creating, last
                > acting []
                > pg 0.33 is stuck inactive for 652.741698, current
                state creating, last
                > acting []
                > pg 0.32 is stuck inactive for 652.741701, current
                state creating, last
                > acting []
                > pg 0.3 is stuck inactive for 652.741762, current
                state creating, last
                > acting []
                > pg 0.2e is stuck inactive for 652.741715, current
                state creating, last
                > acting []
                > pg 0.2d is stuck inactive for 652.741719, current
                state creating, last
                > acting []
                > pg 0.2c is stuck inactive for 652.741721, current
                state creating, last
                > acting []
                > pg 0.2b is stuck inactive for 652.741723, current
                state creating, last
                > acting []
                > pg 0.2a is stuck inactive for 652.741725, current
                state creating, last
                > acting []
                > pg 0.29 is stuck inactive for 652.741727, current
                state creating, last
                > acting []
                > pg 0.28 is stuck inactive for 652.741730, current
                state creating, last
                > acting []
                > pg 0.27 is stuck inactive for 652.741732, current
                state creating, last
                > acting []
                > pg 0.26 is stuck inactive for 652.741734, current
                state creating, last
                > acting []
                > pg 0.3e is stuck inactive for 652.741707, current
                state creating, last
                > acting []
                > pg 0.f is stuck inactive for 652.741761, current
                state creating, last
                > acting []
                > pg 0.3f is stuck inactive for 652.741708, current
                state creating, last
                > acting []
                > pg 0.10 is stuck inactive for 652.741763, current
                state creating, last
                > acting []
                > pg 0.4 is stuck inactive for 652.741773, current
                state creating, last
                > acting []
                > pg 0.5 is stuck inactive for 652.741774, current
                state creating, last
                > acting []
                > pg 0.3a is stuck inactive for 652.741717, current
                state creating, last
                > acting []
                > pg 0.b is stuck inactive for 652.741771, current
                state creating, last
                > acting []
                > pg 0.c is stuck inactive for 652.741772, current
                state creating, last
                > acting []
                > pg 0.3b is stuck inactive for 652.741721, current
                state creating, last
                > acting []
                > pg 0.d is stuck inactive for 652.741774, current
                state creating, last
                > acting []
                > pg 0.3c is stuck inactive for 652.741722, current
                state creating, last
                > acting []
                > pg 0.e is stuck inactive for 652.741776, current
                state creating, last
                > acting []
                > pg 0.3d is stuck inactive for 652.741724, current
                state creating, last
                > acting []
                > pg 0.22 is stuck inactive for 652.741756, current
                state creating, last
                > acting []
                > pg 0.21 is stuck inactive for 652.741758, current
                state creating, last
                > acting []
                > pg 0.a is stuck inactive for 652.741783, current
                state creating, last
                > acting []
                > pg 0.20 is stuck inactive for 652.741761, current
                state creating, last
                > acting []
                > pg 0.9 is stuck inactive for 652.741787, current
                state creating, last
                > acting []
                > pg 0.1f is stuck inactive for 652.741764, current
                state creating, last
                > acting []
                > pg 0.8 is stuck inactive for 652.741790, current
                state creating, last
                > acting []
                > pg 0.7 is stuck inactive for 652.741792, current
                state creating, last
                > acting []
                > pg 0.6 is stuck inactive for 652.741794, current
                state creating, last
                > acting []
                > pg 0.1e is stuck inactive for 652.741770, current
                state creating, last
                > acting []
                > pg 0.1d is stuck inactive for 652.741772, current
                state creating, last
                > acting []
                > pg 0.1c is stuck inactive for 652.741774, current
                state creating, last
                > acting []
                > pg 0.1b is stuck inactive for 652.741777, current
                state creating, last
                > acting []
                > pg 0.1a is stuck inactive for 652.741784, current
                state creating, last
                > acting []
                > pg 0.2 is stuck inactive for 652.741812, current
                state creating, last
                > acting []
                > pg 0.31 is stuck inactive for 652.741762, current
                state creating, last
                > acting []
                > pg 0.19 is stuck inactive for 652.741789, current
                state creating, last
                > acting []
                > pg 0.11 is stuck inactive for 652.741797, current
                state creating, last
                > acting []
                > pg 0.18 is stuck inactive for 652.741793, current
                state creating, last
                > acting []
                > pg 0.1 is stuck inactive for 652.741820, current
                state creating, last
                > acting []
                > pg 0.30 is stuck inactive for 652.741769, current
                state creating, last
                > acting []
                > pg 0.17 is stuck inactive for 652.741797, current
                state creating, last
                > acting []
                > pg 0.0 is stuck inactive for 652.741829, current
                state creating, last
                > acting []
                > pg 0.2f is stuck inactive for 652.741774, current
                state creating, last
                > acting []
                > pg 0.16 is stuck inactive for 652.741802, current
                state creating, last
                > acting []
                > pg 0.12 is stuck inactive for 652.741807, current
                state creating, last
                > acting []
                > pg 0.13 is stuck inactive for 652.741807, current
                state creating, last
                > acting []
                > pg 0.14 is stuck inactive for 652.741807, current
                state creating, last
                > acting []
                > pg 0.15 is stuck inactive for 652.741808, current
                state creating, last
                > acting []
                > pg 0.23 is stuck inactive for 652.741792, current
                state creating, last
                > acting []
                > pg 0.24 is stuck inactive for 652.741793, current
                state creating, last
                > acting []
                > pg 0.25 is stuck inactive for 652.741793, current
                state creating, last
                > acting []
                >
                > I got 3 OSDs --
                >
                > ceph osd stat
                >      osdmap e8: 3 osds: 3 up, 3 in
                >             flags sortbitwise,require_jewel_osds
                >
                > ceph osd pool ls detail
                > pool 0 'rbd' replicated size 3 min_size 2
                crush_ruleset 0 object_hash
                > rjenkins pg_num 64 pgp_num 64 last_change 1 flags
                hashpspool
                > stripe_width 0
                >
                > The state inactive seems to be odd for a brand
                new pool with no data.
                >
                > This's my ceph.conf --
                >
                > [global]
                > fsid = 8161c91e-dbd2-4491-adf8-74446bef916a
                > auth cluster required = cephx
                > auth service required = cephx
                > auth client required = cephx
                > debug = 10/10
                > mon host = 10.242.103.139:8567
                <http://10.242.103.139:8567>,10.242.103.140:8567
                <http://10.242.103.140:8567>,10.242.103.141:8567
                <http://10.242.103.141:8567>
                > [mon]
                > ms bind ipv6 = false
                > mon data = /srv/ceph/mon
                > mon addr = 0.0.0.0:8567 <http://0.0.0.0:8567>
                > mon warn on legacy crush tunables = true
                > mon crush min required version = jewel
                > mon initial members = 0,1,2
                > keyring = /etc/ceph/mon_keyring
                > log file = /var/log/ceph/mon.log
                > [osd]
                > osd data = /srv/ceph/osd
                > osd journal = /srv/ceph/osd/osd_journal
                > osd journal size = 10240
                > osd recovery delay start = 10
                > osd recovery thread timeout = 60
                > osd recovery max active = 1
                > osd recovery max chunk = 10485760
                > osd max backfills = 2
                > osd backfill retry interval = 60
                > osd backfill scan min = 100
                > osd backfill scan max = 1000
                > keyring = /etc/ceph/osd_keyring
                >
                > The monitors run on the same host as osds.
                >
                > Any help will be appreciated highly!
                >
                > _______________________________________________
                > ceph-users mailing list
                > ceph-users@lists.ceph.com
                <mailto:ceph-users@lists.ceph.com>
                >
                http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


                _______________________________________________
                ceph-users mailing list
                ceph-users@lists.ceph.com
                <mailto:ceph-users@lists.ceph.com>
                http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



            _______________________________________________
            ceph-users mailing list
            ceph-users@lists.ceph.com
            <mailto:ceph-users@lists.ceph.com>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


            These are VMs with a Linux bridge for connectivity.

            vlan haver been created over teamed interfaces for the
            primary interface.

            The osds can be seen as up and in and there's a quorum,
            so not a connectivity issue.

ceph:root. I tried ceph:ceph, and also ran ceph-osd as root.



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


The monitors and OSDs are in the same host.

The output of one of the OSDs (run directly on the terminal)

ceph-osd -i 0 -f -d --setuser ceph --setgroup ceph
starting osd.0 at :/0 osd_data /srv/ceph/osd /srv/ceph/osd/osd_journal

2017-10-15 09:03:20.234260 7f49bdb00900 0 set uid:gid to 64045:64045(ceph:ceph)2017-10-15 09:03:20.234269 7f49bdb00900 0 ceph version 10.2.5(c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 10682017-10-15 09:03:20.234636 7f49bdb00900 0 pidfile_write: ignore empty--pid-file2017-10-15 09:03:20.247340 7f49bdb00900 0 filestore(/srv/ceph/osd)backend xfs (magic 0x58465342)2017-10-15 09:03:20.247940 7f49bdb00900 0genericfilestorebackend(/srv/ceph/osd) detect_features: FIEMAP ioctl isdisabled via 'filestore fiemap' config option2017-10-15 09:03:20.247959 7f49bdb00900 0genericfilestorebackend(/srv/ceph/osd) detect_features:SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option2017-10-15 09:03:20.247982 7f49bdb00900 0genericfilestorebackend(/srv/ceph/osd) detect_features: splice is supported2017-10-15 09:03:20.248777 7f49bdb00900 0genericfilestorebackend(/srv/ceph/osd) detect_features: syncfs(2)syscall fully supported (by glibc and kernel)2017-10-15 09:03:20.248820 7f49bdb00900 0xfsfilestorebackend(/srv/ceph/osd) detect_feature: extsize is disabledby conf

2017-10-15 09:03:20.249386 7f49bdb00900  1 leveldb: Recovering log #5

2017-10-15 09:03:20.249420 7f49bdb00900 1 leveldb: Level-0 table #7:started2017-10-15 09:03:20.250334 7f49bdb00900 1 leveldb: Level-0 table #7:146 bytes OK

2017-10-15 09:03:20.252409 7f49bdb00900  1 leveldb: Delete type=0 #5

2017-10-15 09:03:20.252449 7f49bdb00900  1 leveldb: Delete type=3 #4

2017-10-15 09:03:20.252552 7f49bdb00900 0 filestore(/srv/ceph/osd)mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled2017-10-15 09:03:20.252708 7f49bdb00900 -1 journal FileJournal::_open:disabling aio for non-block journal. Use journal_force_aio to force useof aio anyway2017-10-15 09:03:20.252714 7f49bdb00900 1 journal _open/srv/ceph/osd/osd_journal fd 17: 10737418240 bytes, block size 4096bytes, directio = 1, aio = 02017-10-15 09:03:20.253053 7f49bdb00900 1 journal _open/srv/ceph/osd/osd_journal fd 17: 10737418240 bytes, block size 4096bytes, directio = 1, aio = 0

2017-10-15 09:03:20.255212 7f49bdb00900  1 filestore(/srv/ceph/osd) upgrade

2017-10-15 09:03:20.258680 7f49bdb00900 0 <cls>cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan2017-10-15 09:03:20.259598 7f49bdb00900 0 <cls>cls/hello/cls_hello.cc:305: loading cls_hello2017-10-15 09:03:20.327155 7f49bdb00900 0 osd.0 0 crush map hasfeatures 2199057072128, adjusting msgr requires for clients2017-10-15 09:03:20.327167 7f49bdb00900 0 osd.0 0 crush map hasfeatures 2199057072128 was 8705, adjusting msgr requires for mons2017-10-15 09:03:20.327171 7f49bdb00900 0 osd.0 0 crush map hasfeatures 2199057072128, adjusting msgr requires for osds

2017-10-15 09:03:20.327199 7f49bdb00900  0 osd.0 0 load_pgs
2017-10-15 09:03:20.327210 7f49bdb00900  0 osd.0 0 load_pgs opened 0 pgs

2017-10-15 09:03:20.327216 7f49bdb00900 0 osd.0 0 using 0 op queue withpriority op cut off at 64.2017-10-15 09:03:20.331681 7f49bdb00900 -1 osd.0 0 log_to_monitors{default=true}2017-10-15 09:03:20.339963 7f49bdb00900 0 osd.0 0 done with init,starting boot process

sh: 1: lsb_release: not found

2017-10-15 09:03:20.344114 7f49a25d3700 -1 lsb_release_parse - pclosefailed: (13) Permission denied2017-10-15 09:03:20.420408 7f49ae759700 0 osd.0 6 crush map hasfeatures 288232576282525696, adjusting msgr requires for clients2017-10-15 09:03:20.420587 7f49ae759700 0 osd.0 6 crush map hasfeatures 288232576282525696 was 2199057080833, adjusting msgr requiresfor mons2017-10-15 09:03:20.420596 7f49ae759700 0 osd.0 6 crush map hasfeatures 288232576282525696, adjusting msgr requires for osds


The cluster was created from scratch. Steps for creating OSDs --

ceph osd crush tunables jewel

ceph osd create f0960666-ad75-11e7-abc4-cec278b6b50a 0
ceph osd create 0e6295bc-adab-11e7-abc4-cec278b6b50a 1
ceph osd create 0e629828-adab-11e7-abc4-cec278b6b50a 2

ceph-osd -i 0/1/2 --mkfs --osd-uuidf0960666-ad75-11e7-abc4-cec278b6b50a/0e6295bc-adab-11e7-abc4-cec278b6b50a/0e629828-adab-11e7-abc4-cec278b6b50a-f -d (for each OSD)


chown -R ceph /srv/ceph/osd/

Ceph was started with --

ceph-osd -i 0/1/2 -f -d --setuser ceph --setgroup ceph

I'm skipped out the authentication part since the same problem occurswithout cephx (set to none).


In the mean time luminous works great with the same setup.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Brand new cluster -- pg is stuck inactive

Reply via email to