Re: [ceph-users] Another cluster completely hang

Mario Giammarco Wed, 29 Jun 2016 00:51:07 -0700

I have searched google and I see that there is no official procedure.

Il giorno mer 29 giu 2016 alle ore 09:43 Mario Giammarco <
mgiamma...@gmail.com> ha scritto:


> I have read many times the post "incomplete pgs, oh my"
> I think my case is different.
> The broken disk is completely broken.
> So how can I simply mark incomplete pgs as complete?
> Should I stop ceph before?
>
>
> Il giorno mer 29 giu 2016 alle ore 09:36 Tomasz Kuzemko <
> tomasz.kuze...@corp.ovh.com> ha scritto:
>
>> Hi,
>> if you need fast access to your remaining data you can use
>> ceph-objectstore-tool to mark those PGs as complete, however this will
>> irreversibly lose the missing data.
>>
>> If you understand the risks, this procedure is pretty good explained here:
>> http://ceph.com/community/incomplete-pgs-oh-my/
>>
>> Since this article was written, ceph-objectstore-tool gained a feature
>> that was not available at that time, that is "--op mark-complete". I
>> think it will be necessary in your case to call --op mark-complete after
>> you import the PG to temporary OSD (between steps 12 and 13).
>>
>> On 29.06.2016 09:09, Mario Giammarco wrote:
>> > Now I have also discovered that, by mistake, someone has put production
>> > data on a virtual machine of the cluster. I need that ceph starts I/O so
>> > I can boot that virtual machine.
>> > Can I mark the incomplete pgs as valid?
>> > If needed, where can I buy some paid support?
>> > Thanks again,
>> > Mario
>> >
>> > Il giorno mer 29 giu 2016 alle ore 08:02 Mario Giammarco
>> > <mgiamma...@gmail.com <mailto:mgiamma...@gmail.com>> ha scritto:
>> >
>> >     pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0
>> >     object_hash rjenkins pg_num 512 pgp_num 512 last_change 9313 flags
>> >     hashpspool stripe_width 0
>> >            removed_snaps [1~3]
>> >     pool 1 'rbd2' replicated size 2 min_size 1 crush_ruleset 0
>> >     object_hash rjenkins pg_num 512 pgp_num 512 last_change 9314 flags
>> >     hashpspool stripe_width 0
>> >            removed_snaps [1~3]
>> >     pool 2 'rbd3' replicated size 2 min_size 1 crush_ruleset 0
>> >     object_hash rjenkins pg_num 512 pgp_num 512 last_change 10537 flags
>> >     hashpspool stripe_width 0
>> >            removed_snaps [1~3]
>> >
>> >
>> >     ID WEIGHT  REWEIGHT SIZE   USE   AVAIL %USE  VAR
>> >     5 1.81000  1.00000  1857G  984G  872G 53.00 0.86
>> >     6 1.81000  1.00000  1857G 1202G  655G 64.73 1.05
>> >     2 1.81000  1.00000  1857G 1158G  698G 62.38 1.01
>> >     3 1.35999  1.00000  1391G  906G  485G 65.12 1.06
>> >     4 0.89999  1.00000   926G  702G  223G 75.88 1.23
>> >     7 1.81000  1.00000  1857G 1063G  793G 57.27 0.93
>> >     8 1.81000  1.00000  1857G 1011G  846G 54.44 0.88
>> >     9 0.89999  1.00000   926G  573G  352G 61.91 1.01
>> >     0 1.81000  1.00000  1857G 1227G  629G 66.10 1.07
>> >     13 0.45000  1.00000   460G  307G  153G 66.74 1.08
>> >                  TOTAL 14846G 9136G 5710G 61.54
>> >     MIN/MAX VAR: 0.86/1.23  STDDEV: 6.47
>> >
>> >
>> >
>> >     ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> >
>> >     http://pastebin.com/SvGfcSHb
>> >     http://pastebin.com/gYFatsNS
>> >     http://pastebin.com/VZD7j2vN
>> >
>> >     I do not understand why I/O on ENTIRE cluster is blocked when only
>> >     few pgs are incomplete.
>> >
>> >     Many thanks,
>> >     Mario
>> >
>> >
>> >     Il giorno mar 28 giu 2016 alle ore 19:34 Stefan Priebe - Profihost
>> >     AG <s.pri...@profihost.ag <mailto:s.pri...@profihost.ag>> ha
>> scritto:
>> >
>> >         And ceph health detail
>> >
>> >         Stefan
>> >
>> >         Excuse my typo sent from my mobile phone.
>> >
>> >         Am 28.06.2016 um 19:28 schrieb Oliver Dzombic
>> >         <i...@ip-interactive.de <mailto:i...@ip-interactive.de>>:
>> >
>> >>         Hi Mario,
>> >>
>> >>         please give some more details:
>> >>
>> >>         Please the output of:
>> >>
>> >>         ceph osd pool ls detail
>> >>         ceph osd df
>> >>         ceph --version
>> >>
>> >>         ceph -w for 10 seconds ( use http://pastebin.com/ please )
>> >>
>> >>         ceph osd crush dump ( also pastebin pls )
>> >>
>> >>         --
>> >>         Mit freundlichen Gruessen / Best regards
>> >>
>> >>         Oliver Dzombic
>> >>         IP-Interactive
>> >>
>> >>         mailto:i...@ip-interactive.de
>> >>
>> >>         Anschrift:
>> >>
>> >>         IP Interactive UG ( haftungsbeschraenkt )
>> >>         Zum Sonnenberg 1-3
>> >>         63571 Gelnhausen
>> >>
>> >>         HRB 93402 beim Amtsgericht Hanau
>> >>         Geschäftsführung: Oliver Dzombic
>> >>
>> >>         Steuer Nr.: 35 236 3622 1
>> >>         UST ID: DE274086107
>> >>
>> >>
>> >>         Am 28.06.2016 um 18:59 schrieb Mario Giammarco:
>> >>>         Hello,
>> >>>         this is the second time that happens to me, I hope that
>> >>>         someone can
>> >>>         explain what I can do.
>> >>>         Proxmox ceph cluster with 8 servers, 11 hdd. Min_size=1,
>> size=2.
>> >>>
>> >>>         One hdd goes down due to bad sectors.
>> >>>         Ceph recovers but it ends with:
>> >>>
>> >>>         cluster f2a8dd7d-949a-4a29-acab-11d4900249f4
>> >>>             health HEALTH_WARN
>> >>>                    3 pgs down
>> >>>                    19 pgs incomplete
>> >>>                    19 pgs stuck inactive
>> >>>                    19 pgs stuck unclean
>> >>>                    7 requests are blocked > 32 sec
>> >>>             monmap e11: 7 mons at
>> >>>         {0=192.168.0.204:6789/0,1=192.168.0.201:6789/0
>> >>>         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>,
>> >>>         2=192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202
>> >>>         <
>> http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>:
>> >>>         6789/0,5=192.168.0.206:6789/0,6=192.168.0.207:6789/0
>> >>>         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>}
>> >>>                    election epoch 722, quorum
>> >>>         0,1,2,3,4,5,6 1,4,2,0,3,5,6
>> >>>             osdmap e10182: 10 osds: 10 up, 10 in
>> >>>              pgmap v3295880: 1024 pgs, 2 pools, 4563 GB data, 1143
>> >>>         kobjects
>> >>>                    9136 GB used, 5710 GB / 14846 GB avail
>> >>>                        1005 active+clean
>> >>>                          16 incomplete
>> >>>                           3 down+incomplete
>> >>>
>> >>>         Unfortunately "7 requests blocked" means no virtual machine
>> >>>         can boot
>> >>>         because ceph has stopped i/o.
>> >>>
>> >>>         I can accept to lose some data, but not ALL data!
>> >>>         Can you help me please?
>> >>>         Thanks,
>> >>>         Mario
>> >>>
>> >>>         _______________________________________________
>> >>>         ceph-users mailing list
>> >>>         ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> >>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>
>> >>         _______________________________________________
>> >>         ceph-users mailing list
>> >>         ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> >>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >         _______________________________________________
>> >         ceph-users mailing list
>> >         ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>> --
>> Tomasz Kuzemko
>> tomasz.kuze...@corp.ovh.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another cluster completely hang

Reply via email to