Re: [ceph-users] stopped backfilling process
I hope it will help. crush: https://www.dropbox.com/s/inrmq3t40om26vf/crush.txt ceph osd dump: https://www.dropbox.com/s/jsbt7iypyfnnbqm/ceph_osd_dump.txt -- Regards Dominik 2013/11/6 yy-nm yxdyours...@gmail.com: On 2013/11/5 22:02, Dominik Mostowiec wrote: Hi, After remove ( ceph osd out X) osd from one server ( 11 osd ) ceph starts data migration process. It stopped on: 32424 pgs: 30635 active+clean, 191 active+remapped, 1596 active+degraded, 2 active+clean+scrubbing; degraded (1.718%) All osd with reweight==1 are UP. ceph -v ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) health details: https://www.dropbox.com/s/149zvee2ump1418/health_details.txt pg active+degraded query: https://www.dropbox.com/s/46emswxd7s8xce1/pg_11.39_query.txt pg active+remapped query: https://www.dropbox.com/s/wij4uqh8qoz60fd/pg_16.2172_query.txt Please help - how can we fix it? can you show your decoded crushmap? and output of #ceph osd dump ? --- 此电子邮件没有病毒和恶意软件,因为 avast! 防病毒保护处于活动状态。 http://www.avast.com -- Pozdrawiam Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] stopped backfilling process
On Tue, Nov 5, 2013 at 3:02 PM, Dominik Mostowiec dominikmostow...@gmail.com wrote: After remove ( ceph osd out X) osd from one server ( 11 osd ) ceph starts data migration process. It stopped on: 32424 pgs: 30635 active+clean, 191 active+remapped, 1596 active+degraded, 2 active+clean+scrubbing; degraded (1.718%) All osd with reweight==1 are UP. ceph -v ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) Hi, Below, I'm pasting some more information on this issue. The cluster status hasn't been changed for more than 24 hours: # ceph health HEALTH_WARN 1596 pgs degraded; 1787 pgs stuck unclean; recovery 2142704/123949567 degraded (1.729%) I parsed the output of ceph pg dump and can see there three types of pg states: 1. *Two* osd's are up and *two* acting: 16.11 [42, 92][42, 92]active+degraded 17.10 [42, 92][42, 92]active+degraded 2. *Three* osd's are up and *three* acting: 12.d[114, 138, 5] [114, 138, 5] active+clean 15.e[13, 130, 142] [13, 130, 142] active+clean 3. *Two* osd's that are and *three* acting: 16.2256 [63, 109] [63, 109, 40] active+remapped 16.220b [129, 22] [129, 22, 47] active+remapped A part of the crush map: rack rack1 { id -5 # do not change unnecessarily # weight 60.000 alg straw hash 0 # rjenkins1 item storinodfs1 weight 12.000 item storinodfs11 weight 12.000 item storinodfs6 weight 12.000 item storinodfs9 weight 12.000 item storinodfs8 weight 12.000 } rack rack2 { id -7 # do not change unnecessarily # weight 48.000 alg straw hash 0 # rjenkins1 item storinodfs3 weight 12.000 item storinodfs4 weight 12.000 item storinodfs2 weight 12.000 item storinodfs10 weight 12.000 } rack rack3 { id -10 # do not change unnecessarily # weight 36.000 alg straw hash 0 # rjenkins1 item storinodfs5 weight 12.000 === all osd's on this node have been disabled by ceph osd out item storinodfs7 weight 12.000 item storinodfs12 weight 12.000 } rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit } The command ceph osd out has been invoked on all osd's on storinodfs5, and I can see them all down while listing with ceph osd tree: -11 12 host storinodfs5 48 1 osd.48 down0 49 1 osd.49 down0 50 1 osd.50 down0 51 1 osd.51 down0 52 1 osd.52 down0 53 1 osd.53 down0 54 1 osd.54 down0 55 1 osd.55 down0 56 1 osd.56 down0 57 1 osd.57 down0 58 1 osd.58 down0 59 1 osd.59 down0 I wonder if the current cluster state might be related to the fact that the crush map keeps information that storinodfs5 has weight 12? We're unable to make ceph recover from this faulty state. Any hints are very appreciated. -- Regards, Bohdan Sydor ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] stopped backfilling process
Hi, After remove ( ceph osd out X) osd from one server ( 11 osd ) ceph starts data migration process. It stopped on: 32424 pgs: 30635 active+clean, 191 active+remapped, 1596 active+degraded, 2 active+clean+scrubbing; degraded (1.718%) All osd with reweight==1 are UP. ceph -v ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) health details: https://www.dropbox.com/s/149zvee2ump1418/health_details.txt pg active+degraded query: https://www.dropbox.com/s/46emswxd7s8xce1/pg_11.39_query.txt pg active+remapped query: https://www.dropbox.com/s/wij4uqh8qoz60fd/pg_16.2172_query.txt Please help - how can we fix it? -- Pozdrawiam Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] stopped backfilling process
Hi, This is s3/ceph cluster, .rgw.buckets has 3 copies of data. Many PG's are only on 2 OSD's and are marked as 'degraded'. Scrubbing can fix this on degraded object's? I don't have set tunables in cruch, mabye this can help (this is safe?)? -- Regards Dominik 2013/11/5 Dominik Mostowiec dominikmostow...@gmail.com: Hi, After remove ( ceph osd out X) osd from one server ( 11 osd ) ceph starts data migration process. It stopped on: 32424 pgs: 30635 active+clean, 191 active+remapped, 1596 active+degraded, 2 active+clean+scrubbing; degraded (1.718%) All osd with reweight==1 are UP. ceph -v ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) health details: https://www.dropbox.com/s/149zvee2ump1418/health_details.txt pg active+degraded query: https://www.dropbox.com/s/46emswxd7s8xce1/pg_11.39_query.txt pg active+remapped query: https://www.dropbox.com/s/wij4uqh8qoz60fd/pg_16.2172_query.txt Please help - how can we fix it? -- Pozdrawiam Dominik -- Pozdrawiam Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com