Hi, You can use command 'ceph pg query' to check what's going on with the pgs which have problem and use "ceph-objectstore-tool" to recover that pg.
2016-06-21 19:09 GMT+08:00 Paweł Sadowski <c...@sadziu.pl>: > Already restarted those OSD and then whole cluster (rack by rack, > failure domain is rack in this setup). > We would like to try *ceph-objectstore-tool mark-complete* operation. Is > there any way (other than checking mtime on file and querying PGs) to > determine which replica has most up to date datas? > > On 06/21/2016 12:37 PM, M Ranga Swami Reddy wrote: > > Try to restart OSD 109 and 166? check if it help? > > > > > > On Tue, Jun 21, 2016 at 4:05 PM, Paweł Sadowski <c...@sadziu.pl> wrote: > >> Thanks for response. > >> > >> All OSDs seems to be ok, they have been restarted, joined cluster after > >> that, nothing weird in the logs. > >> > >> # ceph pg dump_stuck stale > >> ok > >> > >> # ceph pg dump_stuck inactive > >> ok > >> pg_stat state up up_primary acting acting_primary > >> 3.2929 incomplete [109,272,83] 109 [109,272,83] 109 > >> 3.1683 incomplete [166,329,281] 166 [166,329,281] 166 > >> > >> # ceph pg dump_stuck unclean > >> ok > >> pg_stat state up up_primary acting acting_primary > >> 3.2929 incomplete [109,272,83] 109 [109,272,83] 109 > >> 3.1683 incomplete [166,329,281] 166 [166,329,281] 166 > >> > >> > >> On OSD 166 there is 100 blocked ops (on 109 too), they all end on > >> "event": "reached_pg" > >> > >> # ceph --admin-daemon /var/run/ceph/ceph-osd.166.asok dump_ops_in_flight > >> ... > >> { > >> "description": "osd_op(client.958764031.0:18137113 > >> rbd_data.392585982ae8944a.0000000000000ad4 [set-alloc-hint object_size > >> 4194304 write_size 4194304,write 2641920~8192] 3.d6195683 RETRY=15 > >> ack+ondisk+retry+write+known_if_redirected e613241)", > >> "initiated_at": "2016-06-21 10:19:59.894393", > >> "age": 828.025527, > >> "duration": 600.020809, > >> "type_data": [ > >> "reached pg", > >> { > >> "client": "client.958764031", > >> "tid": 18137113 > >> }, > >> [ > >> { > >> "time": "2016-06-21 10:19:59.894393", > >> "event": "initiated" > >> }, > >> { > >> "time": "2016-06-21 10:29:59.915202", > >> "event": "reached_pg" > >> } > >> ] > >> ] > >> } > >> ], > >> "num_ops": 100 > >> } > >> > >> > >> > >> On 06/21/2016 12:27 PM, M Ranga Swami Reddy wrote: > >>> you can use the below cmds: > >>> == > >>> > >>> ceph pg dump_stuck stale > >>> ceph pg dump_stuck inactive > >>> ceph pg dump_stuck unclean > >>> === > >>> > >>> And the query the PG, which are in unclean or stale state, check for > >>> any issue with a specific OSD. > >>> > >>> Thanks > >>> Swami > >>> > >>> On Tue, Jun 21, 2016 at 3:02 PM, Paweł Sadowski <c...@sadziu.pl> > wrote: > >>>> Hello, > >>>> > >>>> We have an issue on one of our clusters. One node with 9 OSD was down > >>>> for more than 12 hours. During that time cluster recovered without > >>>> problems. When host back to the cluster we got two PGs in incomplete > >>>> state. We decided to mark OSDs on this host as out but the two PGs are > >>>> still in incomplete state. Trying to query those pg hangs forever. We > >>>> were alredy trying restarting OSDs. Is there any way to solve this > issue > >>>> without loosing data? Any help appreciate :) > >>>> > >>>> # ceph health detail | grep incomplete > >>>> HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck > unclean; > >>>> 200 requests are blocked > 32 sec; 2 osds have slow requests; > >>>> noscrub,nodeep-scrub flag(s) set > >>>> pg 3.2929 is stuck inactive since forever, current state incomplete, > >>>> last acting [109,272,83] > >>>> pg 3.1683 is stuck inactive since forever, current state incomplete, > >>>> last acting [166,329,281] > >>>> pg 3.2929 is stuck unclean since forever, current state incomplete, > last > >>>> acting [109,272,83] > >>>> pg 3.1683 is stuck unclean since forever, current state incomplete, > last > >>>> acting [166,329,281] > >>>> pg 3.1683 is incomplete, acting [166,329,281] (reducing pool vms > >>>> min_size from 2 may help; search ceph.com/docs for 'incomplete') > >>>> pg 3.2929 is incomplete, acting [109,272,83] (reducing pool vms > min_size > >>>> from 2 may help; search ceph.com/docs for 'incomplete') > >>>> > >>>> Directory for PG 3.1683 is present on OSD 166 and containes ~8GB. > >>>> > >>>> We didn't try setting min_size to 1 yet (we treat is as a last > resort). > >>>> > >>>> > >>>> > >>>> Some cluster info: > >>>> # ceph --version > >>>> > >>>> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > >>>> > >>>> # ceph -s > >>>> health HEALTH_WARN > >>>> 2 pgs incomplete > >>>> 2 pgs stuck inactive > >>>> 2 pgs stuck unclean > >>>> 200 requests are blocked > 32 sec > >>>> noscrub,nodeep-scrub flag(s) set > >>>> monmap e7: 5 mons at > >>>> > {mon-03=*.2:6789/0,mon-04=*.36:6789/0,mon-05=*.81:6789/0,mon-06=*.0:6789/0,mon-07=*.40:6789/0} > >>>> election epoch 3250, quorum 0,1,2,3,4 > >>>> mon-06,mon-07,mon-04,mon-03,mon-05 > >>>> osdmap e613040: 346 osds: 346 up, 337 in > >>>> flags noscrub,nodeep-scrub > >>>> pgmap v27163053: 18624 pgs, 6 pools, 138 TB data, 39062 kobjects > >>>> 415 TB used, 186 TB / 601 TB avail > >>>> 18622 active+clean > >>>> 2 incomplete > >>>> client io 9992 kB/s rd, 64867 kB/s wr, 8458 op/s > >>>> > >>>> > >>>> # ceph osd pool get vms pg_num > >>>> pg_num: 16384 > >>>> > >>>> # ceph osd pool get vms size > >>>> size: 3 > >>>> > >>>> # ceph osd pool get vms min_size > >>>> min_size: 2 > >> -- > >> PS > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Best regards, 施柏安 Desmond Shih 技術研發部 Technical Development <http://www.inwinstack.com/> 迎棧科技股份有限公司 │ 886-975-857-982 │ desmond.s@inwinstack <desmon...@inwinstack.com>.com │ 886-2-7738-2858 #7725 │ 新北市220板橋區遠東路3號5樓C室 Rm.C, 5F., No.3, Yuandong Rd., Banqiao Dist., New Taipei City 220, Taiwan (R.O.C)
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com