Re: [ceph-users] Another cluster completely hang

2016-06-30 Thread Brian ::
Hi Mario Perhaps its covered under proxmox support, Do you have support on your proxmox install from the guys in Proxmox? Otherwise you can always buy from Redhat https://www.redhat.com/en/technologies/storage/ceph On Thu, Jun 30, 2016 at 7:37 AM, Mario Giammarco

Re: [ceph-users] Another cluster completely hang

2016-06-30 Thread Mario Giammarco
Last two questions: 1) I have used other systems in the past. In case of split brain or serious problems they offered me to choose which copy is "good" and then work again. Is there a way to tell ceph that all is ok? This morning again I have 19 incomplete pgs after recovery 2) Where can I find

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
This time at the end of recovery procedure you described it was like most pgs active+clean 20 pgs incomplete. After that when trying to use the cluster I got "request blocked more than" and no vm can start. I know that something has happened after the broken disk, probably a server reboot. I am

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Lionel Bouton
Hi, Le 29/06/2016 12:00, Mario Giammarco a écrit : > Now the problem is that ceph has put out two disks because scrub has > failed (I think it is not a disk fault but due to mark-complete) There is something odd going on. I've only seen deep-scrub failing (ie detect one inconsistency and

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic
Hi, it does not. But in your case, you have 10 OSD, and 7 of them have incomplete PG's. So since your proxmox vps's are not on single PG's but spread across many PG's you have a good chance that at least some data of any vps is on one of the defect PG's. -- Mit freundlichen Gruessen / Best

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
Just one question: why when ceph has some incomplete pgs it refuses to do I/o on good pgs? Il giorno mer 29 giu 2016 alle ore 12:55 Oliver Dzombic < i...@ip-interactive.de> ha scritto: > Hi, > > again: > > You >must< check all your logs ( as fucky as it is for sure ). > > Means on the ceph nodes

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic
Hi, again: You >must< check all your logs ( as fucky as it is for sure ). Means on the ceph nodes in /var/log/ceph/* And go back to the time where things went down the hill. There must be something else going on, beyond normal osd crash. And your manual pg repair/pg remove/pg set complete

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
Thanks, I can put in osds but the do not stay in, and I am pretty sure that are not broken. Il giorno mer 29 giu 2016 alle ore 12:07 Oliver Dzombic < i...@ip-interactive.de> ha scritto: > hi, > > ceph osd set noscrub > ceph osd set nodeep-scrub > > ceph osd in > > > -- > Mit freundlichen

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic
hi, ceph osd set noscrub ceph osd set nodeep-scrub ceph osd in -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:i...@ip-interactive.de Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
Now the problem is that ceph has put out two disks because scrub has failed (I think it is not a disk fault but due to mark-complete) How can I: - disable scrub - put in again the two disks I will wait anyway the end of recovery to be sure it really works again Il giorno mer 29 giu 2016 alle

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
Infact I am worried because: 1) ceph is under proxmox, and proxmox may decide to reboot a server if it is not responding 2) probably a server was rebooted while ceph was reconstructing 3) even using max=3 do not help Anyway this is the "unofficial" procedure that I am using, much simpler than

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic
Hi, removing ONE disk while your replication is 2, is no problem. You dont need to wait a single second to replace of remove it. Its anyway not used and out/down. So from ceph's point of view its not existent. But as christian told you already, what we see now fits to a

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
Yes I have removed it from crush because it was broken. I have waited 24 hours to see if cephs would like to heals itself. Then I removed the disk completely (it was broken...) and I waited 24 hours again. Then I start getting worried. Are you saying to me that I should not remove a broken disk

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Zoltan Arnold Nagy
Just loosing one disk doesn’t automagically delete it from CRUSH, but in the output you had 10 disks listed, so there must be something else going - did you delete the disk from the crush map as well? Ceph waits by default 300 secs AFAIK to mark an OSD out after it will start to recover. >

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
I thank you for your reply so I can add my experience: 1) the other time this thing happened to me I had a cluster with min_size=2 and size=3 and the problem was the same. That time I put min_size=1 to recover the pool but it did not help. So I do not understand where is the advantage to put

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Christian Balzer
Hello, On Wed, 29 Jun 2016 06:02:59 + Mario Giammarco wrote: > pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash ^ And that's the root cause of all your woes. The default replication size is 3 for a reason and while I do run pools with

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Oliver Dzombic
Hi Mario, in my opinion you should 1. fix too many PGs per OSD (307 > max 300) 2. stop scrubbing / deeb scrubbing -- How looks your current ceph osd tree ? -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:i...@ip-interactive.de Anschrift:

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Tomasz Kuzemko
As far as I know there isn't, which is a shame. We have covered a situation like this in our dev environment to be ready for it in production and it worked, however be aware that the data that Ceph believes is missing will be lost after you mark a PG complete. In your situation I would find OSD

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
I have searched google and I see that there is no official procedure. Il giorno mer 29 giu 2016 alle ore 09:43 Mario Giammarco < mgiamma...@gmail.com> ha scritto: > I have read many times the post "incomplete pgs, oh my" > I think my case is different. > The broken disk is completely broken. >

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Tomasz Kuzemko
Hi, if you need fast access to your remaining data you can use ceph-objectstore-tool to mark those PGs as complete, however this will irreversibly lose the missing data. If you understand the risks, this procedure is pretty good explained here: http://ceph.com/community/incomplete-pgs-oh-my/

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
Now I have also discovered that, by mistake, someone has put production data on a virtual machine of the cluster. I need that ceph starts I/O so I can boot that virtual machine. Can I mark the incomplete pgs as valid? If needed, where can I buy some paid support? Thanks again, Mario Il giorno mer

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Mario Giammarco
pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 9313 flags hashpspool stripe_width 0 removed_snaps [1~3] pool 1 'rbd2' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 9314

Re: [ceph-users] Another cluster completely hang

2016-06-28 Thread Stefan Priebe - Profihost AG
And ceph health detail Stefan Excuse my typo sent from my mobile phone. > Am 28.06.2016 um 19:28 schrieb Oliver Dzombic : > > Hi Mario, > > please give some more details: > > Please the output of: > > ceph osd pool ls detail > ceph osd df > ceph --version > > ceph

Re: [ceph-users] Another cluster completely hang

2016-06-28 Thread Oliver Dzombic
Hi Mario, please give some more details: Please the output of: ceph osd pool ls detail ceph osd df ceph --version ceph -w for 10 seconds ( use http://pastebin.com/ please ) ceph osd crush dump ( also pastebin pls ) -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive

[ceph-users] Another cluster completely hang

2016-06-28 Thread Mario Giammarco
Hello, this is the second time that happens to me, I hope that someone can explain what I can do. Proxmox ceph cluster with 8 servers, 11 hdd. Min_size=1, size=2. One hdd goes down due to bad sectors. Ceph recovers but it ends with: cluster f2a8dd7d-949a-4a29-acab-11d4900249f4 health