Hi Mario
Perhaps its covered under proxmox support, Do you have support on your
proxmox install from the guys in Proxmox?
Otherwise you can always buy from Redhat
https://www.redhat.com/en/technologies/storage/ceph
On Thu, Jun 30, 2016 at 7:37 AM, Mario Giammarco
Last two questions:
1) I have used other systems in the past. In case of split brain or serious
problems they offered me to choose which copy is "good" and then work
again. Is there a way to tell ceph that all is ok? This morning again I
have 19 incomplete pgs after recovery
2) Where can I find
This time at the end of recovery procedure you described it was like most
pgs active+clean 20 pgs incomplete.
After that when trying to use the cluster I got "request blocked more than"
and no vm can start.
I know that something has happened after the broken disk, probably a server
reboot. I am
Hi,
Le 29/06/2016 12:00, Mario Giammarco a écrit :
> Now the problem is that ceph has put out two disks because scrub has
> failed (I think it is not a disk fault but due to mark-complete)
There is something odd going on. I've only seen deep-scrub failing (ie
detect one inconsistency and
Hi,
it does not.
But in your case, you have 10 OSD, and 7 of them have incomplete PG's.
So since your proxmox vps's are not on single PG's but spread across
many PG's you have a good chance that at least some data of any vps is
on one of the defect PG's.
--
Mit freundlichen Gruessen / Best
Just one question: why when ceph has some incomplete pgs it refuses to do
I/o on good pgs?
Il giorno mer 29 giu 2016 alle ore 12:55 Oliver Dzombic <
i...@ip-interactive.de> ha scritto:
> Hi,
>
> again:
>
> You >must< check all your logs ( as fucky as it is for sure ).
>
> Means on the ceph nodes
Hi,
again:
You >must< check all your logs ( as fucky as it is for sure ).
Means on the ceph nodes in /var/log/ceph/*
And go back to the time where things went down the hill.
There must be something else going on, beyond normal osd crash.
And your manual pg repair/pg remove/pg set complete
Thanks,
I can put in osds but the do not stay in, and I am pretty sure that are not
broken.
Il giorno mer 29 giu 2016 alle ore 12:07 Oliver Dzombic <
i...@ip-interactive.de> ha scritto:
> hi,
>
> ceph osd set noscrub
> ceph osd set nodeep-scrub
>
> ceph osd in
>
>
> --
> Mit freundlichen
hi,
ceph osd set noscrub
ceph osd set nodeep-scrub
ceph osd in
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
IP-Interactive
mailto:i...@ip-interactive.de
Anschrift:
IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen
HRB 93402 beim Amtsgericht
Now the problem is that ceph has put out two disks because scrub has
failed (I think it is not a disk fault but due to mark-complete)
How can I:
- disable scrub
- put in again the two disks
I will wait anyway the end of recovery to be sure it really works again
Il giorno mer 29 giu 2016 alle
Infact I am worried because:
1) ceph is under proxmox, and proxmox may decide to reboot a server if it
is not responding
2) probably a server was rebooted while ceph was reconstructing
3) even using max=3 do not help
Anyway this is the "unofficial" procedure that I am using, much simpler
than
Hi,
removing ONE disk while your replication is 2, is no problem.
You dont need to wait a single second to replace of remove it. Its
anyway not used and out/down. So from ceph's point of view its not existent.
But as christian told you already, what we see now fits to a
Yes I have removed it from crush because it was broken. I have waited 24
hours to see if cephs would like to heals itself. Then I removed the disk
completely (it was broken...) and I waited 24 hours again. Then I start
getting worried.
Are you saying to me that I should not remove a broken disk
Just loosing one disk doesn’t automagically delete it from CRUSH, but in the
output you had 10 disks listed, so there must be something else going - did you
delete the disk from the crush map as well?
Ceph waits by default 300 secs AFAIK to mark an OSD out after it will start to
recover.
>
I thank you for your reply so I can add my experience:
1) the other time this thing happened to me I had a cluster with min_size=2
and size=3 and the problem was the same. That time I put min_size=1 to
recover the pool but it did not help. So I do not understand where is the
advantage to put
Hello,
On Wed, 29 Jun 2016 06:02:59 + Mario Giammarco wrote:
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
^
And that's the root cause of all your woes.
The default replication size is 3 for a reason and while I do run pools
with
Hi Mario,
in my opinion you should
1. fix
too many PGs per OSD (307 > max 300)
2. stop scrubbing / deeb scrubbing
--
How looks your current
ceph osd tree
?
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
IP-Interactive
mailto:i...@ip-interactive.de
Anschrift:
As far as I know there isn't, which is a shame. We have covered a
situation like this in our dev environment to be ready for it in
production and it worked, however be aware that the data that Ceph
believes is missing will be lost after you mark a PG complete.
In your situation I would find OSD
I have searched google and I see that there is no official procedure.
Il giorno mer 29 giu 2016 alle ore 09:43 Mario Giammarco <
mgiamma...@gmail.com> ha scritto:
> I have read many times the post "incomplete pgs, oh my"
> I think my case is different.
> The broken disk is completely broken.
>
Hi,
if you need fast access to your remaining data you can use
ceph-objectstore-tool to mark those PGs as complete, however this will
irreversibly lose the missing data.
If you understand the risks, this procedure is pretty good explained here:
http://ceph.com/community/incomplete-pgs-oh-my/
Now I have also discovered that, by mistake, someone has put production
data on a virtual machine of the cluster. I need that ceph starts I/O so I
can boot that virtual machine.
Can I mark the incomplete pgs as valid?
If needed, where can I buy some paid support?
Thanks again,
Mario
Il giorno mer
pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 9313 flags hashpspool
stripe_width 0
removed_snaps [1~3]
pool 1 'rbd2' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 9314
And ceph health detail
Stefan
Excuse my typo sent from my mobile phone.
> Am 28.06.2016 um 19:28 schrieb Oliver Dzombic :
>
> Hi Mario,
>
> please give some more details:
>
> Please the output of:
>
> ceph osd pool ls detail
> ceph osd df
> ceph --version
>
> ceph
Hi Mario,
please give some more details:
Please the output of:
ceph osd pool ls detail
ceph osd df
ceph --version
ceph -w for 10 seconds ( use http://pastebin.com/ please )
ceph osd crush dump ( also pastebin pls )
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
IP-Interactive
Hello,
this is the second time that happens to me, I hope that someone can
explain what I can do.
Proxmox ceph cluster with 8 servers, 11 hdd. Min_size=1, size=2.
One hdd goes down due to bad sectors.
Ceph recovers but it ends with:
cluster f2a8dd7d-949a-4a29-acab-11d4900249f4
health
25 matches
Mail list logo