[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Dan van der Ster
BTW -- i've created https://tracker.ceph.com/issues/55169 to ask that we add some input validation. Injecting such a crush map would ideally not be possible. -- dan On Mon, Apr 4, 2022 at 11:02 AM Dan van der Ster wrote: > > Excellent news! > After everything is back to active+clean, don't forge

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Dan van der Ster
Excellent news! After everything is back to active+clean, don't forget to set min_size to 4 :) have a nice day On Mon, Apr 4, 2022 at 10:59 AM Fulvio Galeazzi wrote: > > Yesss! Fixing the choose/chooseleaf thing did make the magic. :-) > >Thanks a lot for your support Dan. Lots of lessons l

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Fulvio Galeazzi
Yesss! Fixing the choose/chooseleaf thing did make the magic. :-) Thanks a lot for your support Dan. Lots of lessons learned from my side, I'm really grateful. All PGs are now active, will let Ceph rebalance. Ciao ciao Fulvio On 4/4/22 10:50, Dan van der Ster

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Dan van der Ster
Could you share the output of `ceph pg 85.25 query`. Then increase the crush weights of those three osds to 0.1, then check if the PG goes active. (It is possible that the OSDs are not registering as active while they have weight zero). -- dan On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi wr

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Dan van der Ster
Hi Fulvio, Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch! I suggest to fix it and inject the new crush map and see how it goes. Next, in your crush map for the storage type, you have an error: # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Fulvio Galeazzi
Hi again Dan! Things are improving, all OSDs are up, but still that one PG is down. More info below. On 4/1/22 19:26, Dan van der Ster wrote: Here is the output of "pg 85.12 query": https://pastebin.ubuntu.com/p/ww3JdwDXVd/ and its status (also showing the other 85.XX, for refere

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-01 Thread Dan van der Ster
We're on the right track! On Fri, Apr 1, 2022 at 6:57 PM Fulvio Galeazzi wrote: > > Ciao Dan, thanks for your messages! > > On 4/1/22 11:25, Dan van der Ster wrote: > > The PGs are stale, down, inactive *because* the OSDs don't start. > > Your main efforts should be to bring OSDs up, without purg

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-01 Thread Fulvio Galeazzi
Ciao Dan, thanks for your messages! On 4/1/22 11:25, Dan van der Ster wrote: The PGs are stale, down, inactive *because* the OSDs don't start. Your main efforts should be to bring OSDs up, without purging or zapping or anyting like that. (Currently your cluster is down, but there are hopes to re

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-01 Thread Dan van der Ster
The PGs are stale, down, inactive *because* the OSDs don't start. Your main efforts should be to bring OSDs up, without purging or zapping or anyting like that. (Currently your cluster is down, but there are hopes to recover. If you start purging things that can result in permanent data loss.). Mo

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-01 Thread Dan van der Ster
Don't purge anything! On Fri, Apr 1, 2022 at 9:38 AM Fulvio Galeazzi wrote: > > Ciao Dan, > thanks for your time! > > So you are suggesting that my problems with PG 85.25 may somehow resolve > if I manage to bring up the three OSDs currently "down" (possibly due to > PG 85.12, and other PGs)

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-01 Thread Fulvio Galeazzi
Ciao Dan, thanks for your time! So you are suggesting that my problems with PG 85.25 may somehow resolve if I manage to bring up the three OSDs currently "down" (possibly due to PG 85.12, and other PGs)? Looking for the string 'start interval does not contain the required bound' I found

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-30 Thread Dan van der Ster
Hi Fulvio, I'm not sure why that PG doesn't register. But let's look into your log. The relevant lines are: -635> 2022-03-30 14:49:57.810 7ff904970700 -1 log_channel(cluster) log [ERR] : 85.12s0 past_intervals [616435,616454) start interval does not contain the required bound [605868,616454) st

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-30 Thread Fulvio Galeazzi
Ciao Dan, this is what I did with chunk s3, copying it from osd.121 to osd.176 (which is managed by the same host). But still pg 85.25 is stuck stale for 85029.707069, current state stale+down+remapped, last acting [2147483647,2147483647,96,2147483647,2147483647] So "health detail" appa

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-29 Thread Dan van der Ster
Hi Fulvio, I don't think upmap will help -- that is used to remap where data should be "up", but your problem is more that the PG chunks are not going active due to the bug. What happens if you export one of the PG chunks then import it to another OSD -- does that chunk become active? -- dan

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-29 Thread Fulvio Galeazzi
Hallo again Dan, I am afraid I'd need a little more help, please... Current status is as follows. This is where I moved the chunk which was on osd.121: ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/cephpa1-176 --no-mon-config --op list-pgs | grep ^85\.25 85.25s3 while other chunks

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-29 Thread Fulvio Galeazzi
Thanks a lot, Dan! > The EC pgs have a naming convention like 85.25s1 etc.. for the various > k/m EC shards. That was the bit of information I was missing... I was looking for the wrong object. I can now go on and export/import that one PGid chunk. Thanks again! Ful

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-28 Thread Dan van der Ster
Hi Fulvio, You can check (offline) which PGs are on an OSD with the list-pgs op, e.g. ceph-objectstore-tool --data-path /var/lib/ceph/osd/cephpa1-158/ --op list-pgs The EC pgs have a naming convention like 85.25s1 etc.. for the various k/m EC shards. -- dan On Mon, Mar 28, 2022 at 2:29 PM F