What were the settings for your pool? What was the size?  It looks like the
size was 2 and that the PGs only existed on osds 2 and 6. If that's the
case, it's like having a 4 disk raid 1+0, removing 2 disks of the same
mirror, and complaining that the other mirror didn't pick up the data...
Don't delete all copies of your data.  If your replica size is 2, you
cannot loose 2 disks at the same time.

On Fri, Aug 18, 2017, 1:28 AM Hyun Ha <hfamil...@gmail.com> wrote:

> Hi, Cephers!
>
> I'm currently testing the situation of double failure for ceph cluster.
> But, I faced that pgs are in stale state forever.
>
> reproduce steps)
> 0. ceph version : jewel 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> 1. Pool create : exp-volumes (size = 2, min_size = 1)
> 2. rbd create : testvol01
> 3. rbd map and create mkfs.xfs
> 4. mount and create file
> 5. list rados object
> 6. check osd map of each object
>  # ceph osd map exp-volumes rbd_data.4a41f238e1f29.000000000000017a
>    osdmap e199 pool 'exp-volumes' (2) object
> 'rbd_data.4a41f238e1f29.000000000000017a' -> pg 2.3f04d6e2 (2.62) -> up
> ([2,6], p2) acting ([2,6], p2)
> 7. stop primary osd.2 and secondary osd.6 of above object at the same time
> 8. check ceph status
> health HEALTH_ERR
>             16 pgs are stuck inactive for more than 300 seconds
>             16 pgs stale
>             16 pgs stuck stale
>      monmap e11: 3 mons at {10.105.176.85=
> 10.105.176.85:6789/0,10.110.248.154=10.110.248.154:6789/0,10.110.249.153=10.110.249.153:6789/0
> }
>             election epoch 84, quorum 0,1,2
> 10.105.176.85,10.110.248.154,10.110.249.153
>      osdmap e248: 6 osds: 4 up, 4 in; 16 remapped pgs
>             flags sortbitwise,require_jewel_osds
>       pgmap v112095: 128 pgs, 1 pools, 14659 kB data, 17 objects
>             165 MB used, 159 GB / 160 GB avail
>                  112 active+clean
>                   16 stale+active+clean
>
> # ceph health detail
> HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 16 pgs
> stale; 16 pgs stuck stale
> pg 2.67 is stuck stale for 689.171742, current state stale+active+clean,
> last acting [2,6]
> pg 2.5a is stuck stale for 689.171748, current state stale+active+clean,
> last acting [6,2]
> pg 2.52 is stuck stale for 689.171753, current state stale+active+clean,
> last acting [2,6]
> pg 2.4d is stuck stale for 689.171757, current state stale+active+clean,
> last acting [2,6]
> pg 2.56 is stuck stale for 689.171755, current state stale+active+clean,
> last acting [6,2]
> pg 2.d is stuck stale for 689.171811, current state stale+active+clean,
> last acting [6,2]
> pg 2.79 is stuck stale for 689.171808, current state stale+active+clean,
> last acting [2,6]
> pg 2.1f is stuck stale for 689.171782, current state stale+active+clean,
> last acting [6,2]
> pg 2.76 is stuck stale for 689.171809, current state stale+active+clean,
> last acting [6,2]
> pg 2.17 is stuck stale for 689.171794, current state stale+active+clean,
> last acting [6,2]
> pg 2.63 is stuck stale for 689.171794, current state stale+active+clean,
> last acting [2,6]
> pg 2.77 is stuck stale for 689.171816, current state stale+active+clean,
> last acting [2,6]
> pg 2.1b is stuck stale for 689.171793, current state stale+active+clean,
> last acting [6,2]
> pg 2.62 is stuck stale for 689.171765, current state stale+active+clean,
> last acting [2,6]
> pg 2.30 is stuck stale for 689.171799, current state stale+active+clean,
> last acting [2,6]
> pg 2.19 is stuck stale for 689.171798, current state stale+active+clean,
> last acting [6,2]
>
>  # ceph pg dump_stuck stale
> ok
> pg_stat state   up      up_primary      acting  acting_primary
> 2.67    stale+active+clean      [2,6]   2       [2,6]   2
> 2.5a    stale+active+clean      [6,2]   6       [6,2]   6
> 2.52    stale+active+clean      [2,6]   2       [2,6]   2
> 2.4d    stale+active+clean      [2,6]   2       [2,6]   2
> 2.56    stale+active+clean      [6,2]   6       [6,2]   6
> 2.d     stale+active+clean      [6,2]   6       [6,2]   6
> 2.79    stale+active+clean      [2,6]   2       [2,6]   2
> 2.1f    stale+active+clean      [6,2]   6       [6,2]   6
> 2.76    stale+active+clean      [6,2]   6       [6,2]   6
> 2.17    stale+active+clean      [6,2]   6       [6,2]   6
> 2.63    stale+active+clean      [2,6]   2       [2,6]   2
> 2.77    stale+active+clean      [2,6]   2       [2,6]   2
> 2.1b    stale+active+clean      [6,2]   6       [6,2]   6
> 2.62    stale+active+clean      [2,6]   2       [2,6]   2
> 2.30    stale+active+clean      [2,6]   2       [2,6]   2
> 2.19    stale+active+clean      [6,2]   6       [6,2]   6
>
> # ceph pg 2.62 query
> Error ENOENT: i don't have pgid 2.62
>
>  # rados ls -p exp-volumes
> rbd_data.4a41f238e1f29.000000000000003f
> ^C --> hang
>
> I understand that this is a natural result becasue above pgs have no
> primary and seconary osd. But this situation can be occurred so, I want to
> recover ceph cluster and rbd images.
>
> Firstly I want to know how to make ceph cluster's state clean.
> I read document and try to solve this but nothing can help including below
> commands.
>  - ceph pg force_create_pg 2.6
>  - ceph osd lost 2 --yes-i-really-mean-it
>  - ceph osd lost 6 --yes-i-really-mean-it
>  - ceph osd crush rm osd.2
>  - ceph osd crush rm osd.6
>  - cpeh osd rm osd.2
>  - ceph osd rm osd.6
>
> Is there any command to force delete pgs or make ceph cluster clean ?
> Thank you in advance.
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to