[ceph-users] OSDs in EC pool flapping

george.vasilakakos Tue, 22 Aug 2017 09:40:05 -0700

Hey folks,


I'm staring at a problem that I have found no solution for and which is causing 
major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only 
to crash again with the following error in their logs:

    -1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 
pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 
72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 
lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010 
active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
     0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) 
**
 in thread 7f4af4057700 thread_name:tp_osd_tp

This has been going on over the weekend when we saw a different error message 
before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.

The OSDs crash with that error only to be restarted by systemd and fail again 
the exact same way. Eventually systemd gives, the mon_osd_down_out_interval 
expires and the PG just stays down+remapped while other recover and go 
active+clean.

Can anybody help with this type of problem?


Best regards,

George Vasilakakos
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSDs in EC pool flapping

Reply via email to