Re: [ceph-users] Cluster not recovering after OSD deamon is down

Tupper Cole Tue, 03 May 2016 06:24:59 -0700

Yes the pg *should *get remapped, but that is not always the case. For
discussion on thi, check out the tracker below. Your particular
circumstances may be a little different, but the idea is the same.


http://tracker.ceph.com/issues/3806



On Tue, May 3, 2016 at 9:16 AM, Gaurav Bafna <baf...@gmail.com> wrote:

> Thanks Tupper for replying.
>
> Shouldn't the PG be remapped to other OSDs ?
>
> Yes , removing OSD from the cluster is resulting into full recovery.
> But that should not be needed , right ?
>
>
>
> On Tue, May 3, 2016 at 6:31 PM, Tupper Cole <tc...@redhat.com> wrote:
> > The degraded pgs are mapped to the down OSD and have not mapped to a new
> > OSD. Removing the OSD would likely result in a full recovery.
> >
> > As a note, having two monitors (or any even number of monitors) is not
> > recommended. If either monitor goes down you will lose quorum. The
> > recommended number of monitors for any cluster is at least three.
> >
> > On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna <baf...@gmail.com> wrote:
> >>
> >> Hi Cephers,
> >>
> >> I am running a very small cluster of 3 storage and 2 monitor nodes.
> >>
> >> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
> >> remain undersized for unknown reason.
> >>
> >> After I restart that 1 osd deamon, the cluster recovers in no time .
> >>
> >> Size of all pools are 3 and min_size is 2.
> >>
> >> Can anybody please help ?
> >>
> >> Output of  "ceph -s"
> >>     cluster fac04d85-db48-4564-b821-deebda046261
> >>      health HEALTH_WARN
> >>             9 pgs degraded
> >>             9 pgs stuck degraded
> >>             9 pgs stuck unclean
> >>             9 pgs stuck undersized
> >>             9 pgs undersized
> >>             recovery 3327/195138 objects degraded (1.705%)
> >>             pool .users pg_num 512 > pgp_num 8
> >>      monmap e2: 2 mons at
> >> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
> >>             election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
> >>      osdmap e857: 69 osds: 68 up, 68 in
> >>       pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
> >>             279 GB used, 247 TB / 247 TB avail
> >>             3327/195138 objects degraded (1.705%)
> >>                  887 active+clean
> >>                    9 active+undersized+degraded
> >>   client io 395 B/s rd, 0 B/s wr, 0 op/s
> >>
> >> ceph health detail output :
> >>
> >> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
> >> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
> >> degraded (1.705%); pool .users pg_num 512 > pgp_num 8
> >> pg 7.a is stuck unclean for 322742.938959, current state
> >> active+undersized+degraded, last acting [38,2]
> >> pg 5.27 is stuck unclean for 322754.823455, current state
> >> active+undersized+degraded, last acting [26,19]
> >> pg 5.32 is stuck unclean for 322750.685684, current state
> >> active+undersized+degraded, last acting [39,19]
> >> pg 6.13 is stuck unclean for 322732.665345, current state
> >> active+undersized+degraded, last acting [30,16]
> >> pg 5.4e is stuck unclean for 331869.103538, current state
> >> active+undersized+degraded, last acting [16,38]
> >> pg 5.72 is stuck unclean for 331871.208948, current state
> >> active+undersized+degraded, last acting [16,49]
> >> pg 4.17 is stuck unclean for 331822.771240, current state
> >> active+undersized+degraded, last acting [47,20]
> >> pg 5.2c is stuck unclean for 323021.274535, current state
> >> active+undersized+degraded, last acting [47,18]
> >> pg 5.37 is stuck unclean for 323007.574395, current state
> >> active+undersized+degraded, last acting [43,1]
> >> pg 7.a is stuck undersized for 322487.284302, current state
> >> active+undersized+degraded, last acting [38,2]
> >> pg 5.27 is stuck undersized for 322487.287164, current state
> >> active+undersized+degraded, last acting [26,19]
> >> pg 5.32 is stuck undersized for 322487.285566, current state
> >> active+undersized+degraded, last acting [39,19]
> >> pg 6.13 is stuck undersized for 322487.287168, current state
> >> active+undersized+degraded, last acting [30,16]
> >> pg 5.4e is stuck undersized for 331351.476170, current state
> >> active+undersized+degraded, last acting [16,38]
> >> pg 5.72 is stuck undersized for 331351.475707, current state
> >> active+undersized+degraded, last acting [16,49]
> >> pg 4.17 is stuck undersized for 322487.280309, current state
> >> active+undersized+degraded, last acting [47,20]
> >> pg 5.2c is stuck undersized for 322487.286347, current state
> >> active+undersized+degraded, last acting [47,18]
> >> pg 5.37 is stuck undersized for 322487.280027, current state
> >> active+undersized+degraded, last acting [43,1]
> >> pg 7.a is stuck degraded for 322487.284340, current state
> >> active+undersized+degraded, last acting [38,2]
> >> pg 5.27 is stuck degraded for 322487.287202, current state
> >> active+undersized+degraded, last acting [26,19]
> >> pg 5.32 is stuck degraded for 322487.285604, current state
> >> active+undersized+degraded, last acting [39,19]
> >> pg 6.13 is stuck degraded for 322487.287207, current state
> >> active+undersized+degraded, last acting [30,16]
> >> pg 5.4e is stuck degraded for 331351.476209, current state
> >> active+undersized+degraded, last acting [16,38]
> >> pg 5.72 is stuck degraded for 331351.475746, current state
> >> active+undersized+degraded, last acting [16,49]
> >> pg 4.17 is stuck degraded for 322487.280348, current state
> >> active+undersized+degraded, last acting [47,20]
> >> pg 5.2c is stuck degraded for 322487.286386, current state
> >> active+undersized+degraded, last acting [47,18]
> >> pg 5.37 is stuck degraded for 322487.280066, current state
> >> active+undersized+degraded, last acting [43,1]
> >> pg 5.72 is active+undersized+degraded, acting [16,49]
> >> pg 5.4e is active+undersized+degraded, acting [16,38]
> >> pg 5.32 is active+undersized+degraded, acting [39,19]
> >> pg 5.37 is active+undersized+degraded, acting [43,1]
> >> pg 5.2c is active+undersized+degraded, acting [47,18]
> >> pg 5.27 is active+undersized+degraded, acting [26,19]
> >> pg 6.13 is active+undersized+degraded, acting [30,16]
> >> pg 4.17 is active+undersized+degraded, acting [47,20]
> >> pg 7.a is active+undersized+degraded, acting [38,2]
> >> recovery 3327/195138 objects degraded (1.705%)
> >> pool .users pg_num 512 > pgp_num 8
> >>
> >>
> >> My crush map is default.
> >>
> >> Ceph.conf is :
> >>
> >> [osd]
> >> osd mkfs type=xfs
> >> osd recovery threads=2
> >> osd disk thread ioprio class=idle
> >> osd disk thread ioprio priority=7
> >> osd journal=/var/lib/ceph/osd/ceph-$id/journal
> >> filestore flusher=False
> >> osd op num shards=3
> >> debug osd=5
> >> osd disk threads=2
> >> osd data=/var/lib/ceph/osd/ceph-$id
> >> osd op num threads per shard=5
> >> osd op threads=4
> >> keyring=/var/lib/ceph/osd/ceph-$id/keyring
> >> osd journal size=4096
> >>
> >>
> >> [global]
> >> filestore max sync interval=10
> >> auth cluster required=cephx
> >> osd pool default min size=3
> >> osd pool default size=3
> >> public network=10.140.13.0/26
> >> objecter inflight op_bytes=1073741824
> >> auth service required=cephx
> >> filestore min sync interval=1
> >> fsid=fac04d85-db48-4564-b821-deebda046261
> >> keyring=/etc/ceph/keyring
> >> cluster network=10.140.13.0/26
> >> auth client required=cephx
> >> filestore xattr use omap=True
> >> max open files=65536
> >> objecter inflight ops=2048
> >> osd pool default pg num=512
> >> log to syslog = true
> >> #err to syslog = true
> >>
> >>
> >> --
> >> Gaurav Bafna
> >> 9540631400
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > --
> >
> > Thanks,
> > Tupper Cole
> > Senior Storage Consultant
> > Global Storage Consulting, Red Hat
> > tc...@redhat.com
> > phone:  + 01 919-720-2612
>
>
>
> --
> Gaurav Bafna
> 9540631400
>



-- 

Thanks,
Tupper Cole
Senior Storage Consultant
Global Storage Consulting, Red Hat
tc...@redhat.com
phone:  + 01 919-720-2612

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster not recovering after OSD deamon is down

Reply via email to