[ceph-users] Re: PGs stuck down

2022-11-30 Thread Eugen Block

Hi,

thanks for the clarification, I missed the cable cut part, I should  
read more carefully before responding. ;-) I don't really know which  
tests were performed because we joined that project at a later phase,  
but it makes sense with the two subclusters.


Thanks,
Eugen

Zitat von Frank Schilder :


Hi Eugen,

power outage is one thing, a cable cut is another. With power  
outages you will have OSDs down and only one sub-cluster up at a  
time. OSD's will peer locally on a single DC and stuff moves on.


With a cable cut you have a split brain. Have you actually tested  
your setup with everything up except the network connection between  
OSDs on your 2 DCs? My bet is that it goes into standstill just like  
in Dale's case because OSDs are up on both sides and, therefore, PGs  
will want to peer cross-site but can't.


I don't think it is possible to do a 2DC stretched setup unless you  
have 2 or more physically separated possibilities for network  
routing that will never go down simultaneously. If just network link  
goes between OSDs on both sides access will be down.


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 30 November 2022 10:07:13
To: ceph-users@ceph.io
Subject: [ceph-users] Re: PGs stuck down

Hi,

while I basically agree with Frank's response (e. g. min_size = 2) I
disagree that it won't work without the stretch mode. We have a
customer with a similar setup, two datacenters and a third mon in a
different location. And this setup has proven multiple times the
resiliency of ceph. Due to hardware issues in the power supplies they
experiences two or three power outages in one DC without data loss.
They use an erasure coded pool stretched across these two DCs, the
third mon is reachable both ways around the DCs, of course. But this
works quite well, they were very happy with ceph's resiliency. The
cluster is still running on Nautilus.

Regards,
Eugen

Zitat von Frank Schilder :


Hi Dale,


we thought we had set it up to prevent.. and with size = 4 and
min_size set = 1


I'm afraid this is exactly what you didn't. Firstly, min_size=1 is
always a bad idea. Secondly, if you have 2 data centres, the only
way to get this to work is to use stretch mode. Even if you had
min_size=2 (which, by the way you should have in any case), without
stretch mode you would not be guaranteed that you have all PGs
active+clean after one DC goes down (or cable gets cut). There is a
quite long and very detailed explanation of why this is the case and
with min_size=1 you are very certain to hit one of these cases or
even loose data.

What you could check in your situation are these two:

mon_osd_min_up_ratio
mon_osd_min_in_ratio

My guess is that these prevented the mons from marking sufficiently
many OSDs as out and therefore they got stuck peering (maybe even
nothing was marked down?). The other thing is that you almost
certainly had exactly the split brain situation that stretch mode is
there to prevent. You probably ended up with 2 sub-clusters with 2
mons each and now what? If the third mon could still see the other 2
I don't think you get a meaningful quorum. Stretch mode will
actually change the crush rule depending on a decision by the
tie-breaking monitor to re-configure the pool to use only OSDs in
one of the 2 DCs so that no cross-site peering happens.

Maybe if you explicitly shut down one of the DC-mons you get stuff
to work in one of the DCs?

Without stretch mode you need 3 DCs and a geo-replicated 3(2) pool.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Wolfpaw - Dale Corse 
Sent: 29 November 2022 07:20:20
To: 'ceph-users'
Subject: [ceph-users] PGs stuck down

Hi All,



We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
do very well :( We ended up with 98% of PGs as down.



This setup has 2 data centers defined, with 4 copies across both, and a
minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
center connected to each of the other 2 by VPN.



When I did a pg query on the PG's that were stuck it said they were blocked
from coming up because they couldn't contact 2 of the OSDs (located in the
other data center that it was unable to reach).. but the other 2 were fine.



I'm at a loss because this was exactly the thing we thought we had set it up
to prevent.. and with size = 4 and min_size set = 1 I understood that it
would continue without a problem? :(



Crush map is below .. if anyone has any ideas? I would sincerely appreciate
it :)



Thanks!

Dale



# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable chooseleaf_vary_r 1

tunable straw_calc_version 1



# devices

device 0 osd.0 class ssd

device 1 osd.1 class ssd

device 2 osd.2 class ssd

[ceph-users] Re: PGs stuck down

2022-11-30 Thread Dan van der Ster
Hi all,

It's difficult to say exactly what happened here without cluster logs.
Dale, would you be able to share the ceph.log showing the start of the
incident?

Cheers, dan

On Wed, Nov 30, 2022 at 10:30 AM Frank Schilder  wrote:
>
> Hi Eugen,
>
> power outage is one thing, a cable cut is another. With power outages you 
> will have OSDs down and only one sub-cluster up at a time. OSD's will peer 
> locally on a single DC and stuff moves on.
>
> With a cable cut you have a split brain. Have you actually tested your setup 
> with everything up except the network connection between OSDs on your 2 DCs? 
> My bet is that it goes into standstill just like in Dale's case because OSDs 
> are up on both sides and, therefore, PGs will want to peer cross-site but 
> can't.
>
> I don't think it is possible to do a 2DC stretched setup unless you have 2 or 
> more physically separated possibilities for network routing that will never 
> go down simultaneously. If just network link goes between OSDs on both sides 
> access will be down.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Eugen Block 
> Sent: 30 November 2022 10:07:13
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: PGs stuck down
>
> Hi,
>
> while I basically agree with Frank's response (e. g. min_size = 2) I
> disagree that it won't work without the stretch mode. We have a
> customer with a similar setup, two datacenters and a third mon in a
> different location. And this setup has proven multiple times the
> resiliency of ceph. Due to hardware issues in the power supplies they
> experiences two or three power outages in one DC without data loss.
> They use an erasure coded pool stretched across these two DCs, the
> third mon is reachable both ways around the DCs, of course. But this
> works quite well, they were very happy with ceph's resiliency. The
> cluster is still running on Nautilus.
>
> Regards,
> Eugen
>
> Zitat von Frank Schilder :
>
> > Hi Dale,
> >
> >> we thought we had set it up to prevent.. and with size = 4 and
> >> min_size set = 1
> >
> > I'm afraid this is exactly what you didn't. Firstly, min_size=1 is
> > always a bad idea. Secondly, if you have 2 data centres, the only
> > way to get this to work is to use stretch mode. Even if you had
> > min_size=2 (which, by the way you should have in any case), without
> > stretch mode you would not be guaranteed that you have all PGs
> > active+clean after one DC goes down (or cable gets cut). There is a
> > quite long and very detailed explanation of why this is the case and
> > with min_size=1 you are very certain to hit one of these cases or
> > even loose data.
> >
> > What you could check in your situation are these two:
> >
> > mon_osd_min_up_ratio
> > mon_osd_min_in_ratio
> >
> > My guess is that these prevented the mons from marking sufficiently
> > many OSDs as out and therefore they got stuck peering (maybe even
> > nothing was marked down?). The other thing is that you almost
> > certainly had exactly the split brain situation that stretch mode is
> > there to prevent. You probably ended up with 2 sub-clusters with 2
> > mons each and now what? If the third mon could still see the other 2
> > I don't think you get a meaningful quorum. Stretch mode will
> > actually change the crush rule depending on a decision by the
> > tie-breaking monitor to re-configure the pool to use only OSDs in
> > one of the 2 DCs so that no cross-site peering happens.
> >
> > Maybe if you explicitly shut down one of the DC-mons you get stuff
> > to work in one of the DCs?
> >
> > Without stretch mode you need 3 DCs and a geo-replicated 3(2) pool.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Wolfpaw - Dale Corse 
> > Sent: 29 November 2022 07:20:20
> > To: 'ceph-users'
> > Subject: [ceph-users] PGs stuck down
> >
> > Hi All,
> >
> >
> >
> > We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
> > do very well :( We ended up with 98% of PGs as down.
> >
> >
> >
> > This setup has 2 data centers defined, with 4 copies across both, and a
> > minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
> > center connected to each of the other 2 by VPN.
> >
> >
> >
> > When I did a pg query on the PG's that were stuck 

[ceph-users] Re: PGs stuck down

2022-11-30 Thread Frank Schilder
Hi Eugen,

power outage is one thing, a cable cut is another. With power outages you will 
have OSDs down and only one sub-cluster up at a time. OSD's will peer locally 
on a single DC and stuff moves on.

With a cable cut you have a split brain. Have you actually tested your setup 
with everything up except the network connection between OSDs on your 2 DCs? My 
bet is that it goes into standstill just like in Dale's case because OSDs are 
up on both sides and, therefore, PGs will want to peer cross-site but can't.

I don't think it is possible to do a 2DC stretched setup unless you have 2 or 
more physically separated possibilities for network routing that will never go 
down simultaneously. If just network link goes between OSDs on both sides 
access will be down.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 30 November 2022 10:07:13
To: ceph-users@ceph.io
Subject: [ceph-users] Re: PGs stuck down

Hi,

while I basically agree with Frank's response (e. g. min_size = 2) I
disagree that it won't work without the stretch mode. We have a
customer with a similar setup, two datacenters and a third mon in a
different location. And this setup has proven multiple times the
resiliency of ceph. Due to hardware issues in the power supplies they
experiences two or three power outages in one DC without data loss.
They use an erasure coded pool stretched across these two DCs, the
third mon is reachable both ways around the DCs, of course. But this
works quite well, they were very happy with ceph's resiliency. The
cluster is still running on Nautilus.

Regards,
Eugen

Zitat von Frank Schilder :

> Hi Dale,
>
>> we thought we had set it up to prevent.. and with size = 4 and
>> min_size set = 1
>
> I'm afraid this is exactly what you didn't. Firstly, min_size=1 is
> always a bad idea. Secondly, if you have 2 data centres, the only
> way to get this to work is to use stretch mode. Even if you had
> min_size=2 (which, by the way you should have in any case), without
> stretch mode you would not be guaranteed that you have all PGs
> active+clean after one DC goes down (or cable gets cut). There is a
> quite long and very detailed explanation of why this is the case and
> with min_size=1 you are very certain to hit one of these cases or
> even loose data.
>
> What you could check in your situation are these two:
>
> mon_osd_min_up_ratio
> mon_osd_min_in_ratio
>
> My guess is that these prevented the mons from marking sufficiently
> many OSDs as out and therefore they got stuck peering (maybe even
> nothing was marked down?). The other thing is that you almost
> certainly had exactly the split brain situation that stretch mode is
> there to prevent. You probably ended up with 2 sub-clusters with 2
> mons each and now what? If the third mon could still see the other 2
> I don't think you get a meaningful quorum. Stretch mode will
> actually change the crush rule depending on a decision by the
> tie-breaking monitor to re-configure the pool to use only OSDs in
> one of the 2 DCs so that no cross-site peering happens.
>
> Maybe if you explicitly shut down one of the DC-mons you get stuff
> to work in one of the DCs?
>
> Without stretch mode you need 3 DCs and a geo-replicated 3(2) pool.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Wolfpaw - Dale Corse 
> Sent: 29 November 2022 07:20:20
> To: 'ceph-users'
> Subject: [ceph-users] PGs stuck down
>
> Hi All,
>
>
>
> We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
> do very well :( We ended up with 98% of PGs as down.
>
>
>
> This setup has 2 data centers defined, with 4 copies across both, and a
> minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
> center connected to each of the other 2 by VPN.
>
>
>
> When I did a pg query on the PG's that were stuck it said they were blocked
> from coming up because they couldn't contact 2 of the OSDs (located in the
> other data center that it was unable to reach).. but the other 2 were fine.
>
>
>
> I'm at a loss because this was exactly the thing we thought we had set it up
> to prevent.. and with size = 4 and min_size set = 1 I understood that it
> would continue without a problem? :(
>
>
>
> Crush map is below .. if anyone has any ideas? I would sincerely appreciate
> it :)
>
>
>
> Thanks!
>
> Dale
>
>
>
> # begin crush map
>
> tunable choose_local_tries 0
>
> tunable choose_local_fallback_tries 0
>
> tunable choose_total_tries 50
>
> tunable chooseleaf_descend_once 1
>
> tunable choos

[ceph-users] Re: PGs stuck down

2022-11-30 Thread Eugen Block

Hi,

while I basically agree with Frank's response (e. g. min_size = 2) I  
disagree that it won't work without the stretch mode. We have a  
customer with a similar setup, two datacenters and a third mon in a  
different location. And this setup has proven multiple times the  
resiliency of ceph. Due to hardware issues in the power supplies they  
experiences two or three power outages in one DC without data loss.  
They use an erasure coded pool stretched across these two DCs, the  
third mon is reachable both ways around the DCs, of course. But this  
works quite well, they were very happy with ceph's resiliency. The  
cluster is still running on Nautilus.


Regards,
Eugen

Zitat von Frank Schilder :


Hi Dale,

we thought we had set it up to prevent.. and with size = 4 and  
min_size set = 1


I'm afraid this is exactly what you didn't. Firstly, min_size=1 is  
always a bad idea. Secondly, if you have 2 data centres, the only  
way to get this to work is to use stretch mode. Even if you had  
min_size=2 (which, by the way you should have in any case), without  
stretch mode you would not be guaranteed that you have all PGs  
active+clean after one DC goes down (or cable gets cut). There is a  
quite long and very detailed explanation of why this is the case and  
with min_size=1 you are very certain to hit one of these cases or  
even loose data.


What you could check in your situation are these two:

mon_osd_min_up_ratio
mon_osd_min_in_ratio

My guess is that these prevented the mons from marking sufficiently  
many OSDs as out and therefore they got stuck peering (maybe even  
nothing was marked down?). The other thing is that you almost  
certainly had exactly the split brain situation that stretch mode is  
there to prevent. You probably ended up with 2 sub-clusters with 2  
mons each and now what? If the third mon could still see the other 2  
I don't think you get a meaningful quorum. Stretch mode will  
actually change the crush rule depending on a decision by the  
tie-breaking monitor to re-configure the pool to use only OSDs in  
one of the 2 DCs so that no cross-site peering happens.


Maybe if you explicitly shut down one of the DC-mons you get stuff  
to work in one of the DCs?


Without stretch mode you need 3 DCs and a geo-replicated 3(2) pool.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Wolfpaw - Dale Corse 
Sent: 29 November 2022 07:20:20
To: 'ceph-users'
Subject: [ceph-users] PGs stuck down

Hi All,



We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
do very well :( We ended up with 98% of PGs as down.



This setup has 2 data centers defined, with 4 copies across both, and a
minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
center connected to each of the other 2 by VPN.



When I did a pg query on the PG's that were stuck it said they were blocked
from coming up because they couldn't contact 2 of the OSDs (located in the
other data center that it was unable to reach).. but the other 2 were fine.



I'm at a loss because this was exactly the thing we thought we had set it up
to prevent.. and with size = 4 and min_size set = 1 I understood that it
would continue without a problem? :(



Crush map is below .. if anyone has any ideas? I would sincerely appreciate
it :)



Thanks!

Dale



# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable chooseleaf_vary_r 1

tunable straw_calc_version 1



# devices

device 0 osd.0 class ssd

device 1 osd.1 class ssd

device 2 osd.2 class ssd

device 3 osd.3 class ssd

device 4 osd.4 class ssd

device 5 osd.5 class ssd

device 6 osd.6 class ssd

device 7 osd.7 class ssd

device 8 osd.8 class ssd

device 9 osd.9 class ssd

device 10 osd.10 class ssd

device 11 osd.11 class ssd

device 12 osd.12 class ssd

device 13 osd.13 class ssd

device 14 osd.14 class ssd

device 15 osd.15 class ssd

device 16 osd.16 class ssd

device 17 osd.17 class ssd

device 18 osd.18 class ssd

device 19 osd.19 class ssd

device 20 osd.20 class ssd

device 21 osd.21 class ssd

device 22 osd.22 class ssd

device 23 osd.23 class ssd

device 24 osd.24 class ssd

device 25 osd.25 class ssd

device 26 osd.26 class ssd

device 27 osd.27 class ssd

device 28 osd.28 class ssd

device 29 osd.29 class ssd

device 30 osd.30 class ssd

device 31 osd.31 class ssd

device 32 osd.32 class ssd

device 33 osd.33 class ssd

device 34 osd.34 class ssd

device 35 osd.35 class ssd

device 36 osd.36 class ssd

device 37 osd.37 class ssd

device 38 osd.38 class ssd

device 39 osd.39 class ssd

device 40 osd.40 class ssd

device 41 osd.41 class ssd

device 42 osd.42 class ssd

device 43 osd.43 class ssd

device 44 osd.44 class ssd

device 45 osd.45 class ssd

device 46 osd.46 class ssd

device 47 osd.47 class ssd

device 49 osd.49 class ssd



# 

[ceph-users] Re: PGs stuck down

2022-11-29 Thread Wolfpaw - Dale Corse
Thanks! Appreciate everyone who responded :) 

 After reading up on stretch mode, it appears some of the exact things it
was created to prevent happened, so this would be the solution!

Cheers,
D.

-Original Message-
From: Frank Schilder [mailto:fr...@dtu.dk] 
Sent: Tuesday, November 29, 2022 1:49 AM
To: Wolfpaw - Dale Corse ; 'ceph-users'

Subject: [ceph-users] Re: PGs stuck down

Hi Dale,

> we thought we had set it up to prevent.. and with size = 4 and 
> min_size set = 1

I'm afraid this is exactly what you didn't. Firstly, min_size=1 is always a
bad idea. Secondly, if you have 2 data centres, the only way to get this to
work is to use stretch mode. Even if you had min_size=2 (which, by the way
you should have in any case), without stretch mode you would not be
guaranteed that you have all PGs active+clean after one DC goes down (or
cable gets cut). There is a quite long and very detailed explanation of why
this is the case and with min_size=1 you are very certain to hit one of
these cases or even loose data.

What you could check in your situation are these two:

mon_osd_min_up_ratio
mon_osd_min_in_ratio

My guess is that these prevented the mons from marking sufficiently many
OSDs as out and therefore they got stuck peering (maybe even nothing was
marked down?). The other thing is that you almost certainly had exactly the
split brain situation that stretch mode is there to prevent. You probably
ended up with 2 sub-clusters with 2 mons each and now what? If the third mon
could still see the other 2 I don't think you get a meaningful quorum.
Stretch mode will actually change the crush rule depending on a decision by
the tie-breaking monitor to re-configure the pool to use only OSDs in one of
the 2 DCs so that no cross-site peering happens.

Maybe if you explicitly shut down one of the DC-mons you get stuff to work
in one of the DCs?

Without stretch mode you need 3 DCs and a geo-replicated 3(2) pool.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Wolfpaw - Dale Corse 
Sent: 29 November 2022 07:20:20
To: 'ceph-users'
Subject: [ceph-users] PGs stuck down

Hi All,



We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
do very well :( We ended up with 98% of PGs as down.



This setup has 2 data centers defined, with 4 copies across both, and a
minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
center connected to each of the other 2 by VPN.



When I did a pg query on the PG's that were stuck it said they were blocked
from coming up because they couldn't contact 2 of the OSDs (located in the
other data center that it was unable to reach).. but the other 2 were fine.



I'm at a loss because this was exactly the thing we thought we had set it up
to prevent.. and with size = 4 and min_size set = 1 I understood that it
would continue without a problem? :(



Crush map is below .. if anyone has any ideas? I would sincerely appreciate
it :)



Thanks!

Dale



# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable chooseleaf_vary_r 1

tunable straw_calc_version 1



# devices

device 0 osd.0 class ssd

device 1 osd.1 class ssd

device 2 osd.2 class ssd

device 3 osd.3 class ssd

device 4 osd.4 class ssd

device 5 osd.5 class ssd

device 6 osd.6 class ssd

device 7 osd.7 class ssd

device 8 osd.8 class ssd

device 9 osd.9 class ssd

device 10 osd.10 class ssd

device 11 osd.11 class ssd

device 12 osd.12 class ssd

device 13 osd.13 class ssd

device 14 osd.14 class ssd

device 15 osd.15 class ssd

device 16 osd.16 class ssd

device 17 osd.17 class ssd

device 18 osd.18 class ssd

device 19 osd.19 class ssd

device 20 osd.20 class ssd

device 21 osd.21 class ssd

device 22 osd.22 class ssd

device 23 osd.23 class ssd

device 24 osd.24 class ssd

device 25 osd.25 class ssd

device 26 osd.26 class ssd

device 27 osd.27 class ssd

device 28 osd.28 class ssd

device 29 osd.29 class ssd

device 30 osd.30 class ssd

device 31 osd.31 class ssd

device 32 osd.32 class ssd

device 33 osd.33 class ssd

device 34 osd.34 class ssd

device 35 osd.35 class ssd

device 36 osd.36 class ssd

device 37 osd.37 class ssd

device 38 osd.38 class ssd

device 39 osd.39 class ssd

device 40 osd.40 class ssd

device 41 osd.41 class ssd

device 42 osd.42 class ssd

device 43 osd.43 class ssd

device 44 osd.44 class ssd

device 45 osd.45 class ssd

device 46 osd.46 class ssd

device 47 osd.47 class ssd

device 49 osd.49 class ssd



# types

type 0 osd

type 1 host

type 2 chassis

type 3 rack

type 4 row

type 5 pdu

type 6 pod

type 7 room

type 8 datacenter

type 9 region

type 10 root



# buckets

host Pnode01 {

id -8   # do not change unnecessarily

id -9 class ssd # do not change unnecessarily

# weight 0.000

alg straw2


[ceph-users] Re: PGs stuck down

2022-11-29 Thread Frank Schilder
Hi Dale,

> we thought we had set it up to prevent.. and with size = 4 and min_size set = 
> 1

I'm afraid this is exactly what you didn't. Firstly, min_size=1 is always a bad 
idea. Secondly, if you have 2 data centres, the only way to get this to work is 
to use stretch mode. Even if you had min_size=2 (which, by the way you should 
have in any case), without stretch mode you would not be guaranteed that you 
have all PGs active+clean after one DC goes down (or cable gets cut). There is 
a quite long and very detailed explanation of why this is the case and with 
min_size=1 you are very certain to hit one of these cases or even loose data.

What you could check in your situation are these two:

mon_osd_min_up_ratio
mon_osd_min_in_ratio

My guess is that these prevented the mons from marking sufficiently many OSDs 
as out and therefore they got stuck peering (maybe even nothing was marked 
down?). The other thing is that you almost certainly had exactly the split 
brain situation that stretch mode is there to prevent. You probably ended up 
with 2 sub-clusters with 2 mons each and now what? If the third mon could still 
see the other 2 I don't think you get a meaningful quorum. Stretch mode will 
actually change the crush rule depending on a decision by the tie-breaking 
monitor to re-configure the pool to use only OSDs in one of the 2 DCs so that 
no cross-site peering happens.

Maybe if you explicitly shut down one of the DC-mons you get stuff to work in 
one of the DCs?

Without stretch mode you need 3 DCs and a geo-replicated 3(2) pool.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Wolfpaw - Dale Corse 
Sent: 29 November 2022 07:20:20
To: 'ceph-users'
Subject: [ceph-users] PGs stuck down

Hi All,



We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
do very well :( We ended up with 98% of PGs as down.



This setup has 2 data centers defined, with 4 copies across both, and a
minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
center connected to each of the other 2 by VPN.



When I did a pg query on the PG's that were stuck it said they were blocked
from coming up because they couldn't contact 2 of the OSDs (located in the
other data center that it was unable to reach).. but the other 2 were fine.



I'm at a loss because this was exactly the thing we thought we had set it up
to prevent.. and with size = 4 and min_size set = 1 I understood that it
would continue without a problem? :(



Crush map is below .. if anyone has any ideas? I would sincerely appreciate
it :)



Thanks!

Dale



# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable chooseleaf_vary_r 1

tunable straw_calc_version 1



# devices

device 0 osd.0 class ssd

device 1 osd.1 class ssd

device 2 osd.2 class ssd

device 3 osd.3 class ssd

device 4 osd.4 class ssd

device 5 osd.5 class ssd

device 6 osd.6 class ssd

device 7 osd.7 class ssd

device 8 osd.8 class ssd

device 9 osd.9 class ssd

device 10 osd.10 class ssd

device 11 osd.11 class ssd

device 12 osd.12 class ssd

device 13 osd.13 class ssd

device 14 osd.14 class ssd

device 15 osd.15 class ssd

device 16 osd.16 class ssd

device 17 osd.17 class ssd

device 18 osd.18 class ssd

device 19 osd.19 class ssd

device 20 osd.20 class ssd

device 21 osd.21 class ssd

device 22 osd.22 class ssd

device 23 osd.23 class ssd

device 24 osd.24 class ssd

device 25 osd.25 class ssd

device 26 osd.26 class ssd

device 27 osd.27 class ssd

device 28 osd.28 class ssd

device 29 osd.29 class ssd

device 30 osd.30 class ssd

device 31 osd.31 class ssd

device 32 osd.32 class ssd

device 33 osd.33 class ssd

device 34 osd.34 class ssd

device 35 osd.35 class ssd

device 36 osd.36 class ssd

device 37 osd.37 class ssd

device 38 osd.38 class ssd

device 39 osd.39 class ssd

device 40 osd.40 class ssd

device 41 osd.41 class ssd

device 42 osd.42 class ssd

device 43 osd.43 class ssd

device 44 osd.44 class ssd

device 45 osd.45 class ssd

device 46 osd.46 class ssd

device 47 osd.47 class ssd

device 49 osd.49 class ssd



# types

type 0 osd

type 1 host

type 2 chassis

type 3 rack

type 4 row

type 5 pdu

type 6 pod

type 7 room

type 8 datacenter

type 9 region

type 10 root



# buckets

host Pnode01 {

id -8   # do not change unnecessarily

id -9 class ssd # do not change unnecessarily

# weight 0.000

alg straw2

hash 0  # rjenkins1

}

host node01 {

id -2   # do not change unnecessarily

id -15 class ssd# do not change unnecessarily

# weight 14.537

alg straw2

hash 0  # rjenkins1

item osd.4 weight 1.817

item osd.1 weight 1.817

item osd.3 weight 1.817

item osd.2 weight 1.817

item 

[ceph-users] Re: PGs stuck down

2022-11-28 Thread Yanko Davila
Hi Dale

Can you please post the ceph status ? I’m no expert but I would make sure that 
the datacenter you intend to operate (while the connection gets reestablished) 
has two active monitors. Thanks.

Yanko.


> On Nov 29, 2022, at 7:20 AM, Wolfpaw - Dale Corse  wrote:
> 
> Hi All,
> 
> 
> 
> We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
> do very well :( We ended up with 98% of PGs as down.
> 
> 
> 
> This setup has 2 data centers defined, with 4 copies across both, and a
> minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
> center connected to each of the other 2 by VPN.
> 
> 
> 
> When I did a pg query on the PG's that were stuck it said they were blocked
> from coming up because they couldn't contact 2 of the OSDs (located in the
> other data center that it was unable to reach).. but the other 2 were fine.
> 
> 
> 
> I'm at a loss because this was exactly the thing we thought we had set it up
> to prevent.. and with size = 4 and min_size set = 1 I understood that it
> would continue without a problem? :(
> 
> 
> 
> Crush map is below .. if anyone has any ideas? I would sincerely appreciate
> it :)
> 
> 
> 
> Thanks!
> 
> Dale
> 
> 
> 
> # begin crush map
> 
> tunable choose_local_tries 0
> 
> tunable choose_local_fallback_tries 0
> 
> tunable choose_total_tries 50
> 
> tunable chooseleaf_descend_once 1
> 
> tunable chooseleaf_vary_r 1
> 
> tunable straw_calc_version 1
> 
> 
> 
> # devices
> 
> device 0 osd.0 class ssd
> 
> device 1 osd.1 class ssd
> 
> device 2 osd.2 class ssd
> 
> device 3 osd.3 class ssd
> 
> device 4 osd.4 class ssd
> 
> device 5 osd.5 class ssd
> 
> device 6 osd.6 class ssd
> 
> device 7 osd.7 class ssd
> 
> device 8 osd.8 class ssd
> 
> device 9 osd.9 class ssd
> 
> device 10 osd.10 class ssd
> 
> device 11 osd.11 class ssd
> 
> device 12 osd.12 class ssd
> 
> device 13 osd.13 class ssd
> 
> device 14 osd.14 class ssd
> 
> device 15 osd.15 class ssd
> 
> device 16 osd.16 class ssd
> 
> device 17 osd.17 class ssd
> 
> device 18 osd.18 class ssd
> 
> device 19 osd.19 class ssd
> 
> device 20 osd.20 class ssd
> 
> device 21 osd.21 class ssd
> 
> device 22 osd.22 class ssd
> 
> device 23 osd.23 class ssd
> 
> device 24 osd.24 class ssd
> 
> device 25 osd.25 class ssd
> 
> device 26 osd.26 class ssd
> 
> device 27 osd.27 class ssd
> 
> device 28 osd.28 class ssd
> 
> device 29 osd.29 class ssd
> 
> device 30 osd.30 class ssd
> 
> device 31 osd.31 class ssd
> 
> device 32 osd.32 class ssd
> 
> device 33 osd.33 class ssd
> 
> device 34 osd.34 class ssd
> 
> device 35 osd.35 class ssd
> 
> device 36 osd.36 class ssd
> 
> device 37 osd.37 class ssd
> 
> device 38 osd.38 class ssd
> 
> device 39 osd.39 class ssd
> 
> device 40 osd.40 class ssd
> 
> device 41 osd.41 class ssd
> 
> device 42 osd.42 class ssd
> 
> device 43 osd.43 class ssd
> 
> device 44 osd.44 class ssd
> 
> device 45 osd.45 class ssd
> 
> device 46 osd.46 class ssd
> 
> device 47 osd.47 class ssd
> 
> device 49 osd.49 class ssd
> 
> 
> 
> # types
> 
> type 0 osd
> 
> type 1 host
> 
> type 2 chassis
> 
> type 3 rack
> 
> type 4 row
> 
> type 5 pdu
> 
> type 6 pod
> 
> type 7 room
> 
> type 8 datacenter
> 
> type 9 region
> 
> type 10 root
> 
> 
> 
> # buckets
> 
> host Pnode01 {
> 
>id -8   # do not change unnecessarily
> 
>id -9 class ssd # do not change unnecessarily
> 
># weight 0.000
> 
>alg straw2
> 
>hash 0  # rjenkins1
> 
> }
> 
> host node01 {
> 
>id -2   # do not change unnecessarily
> 
>id -15 class ssd# do not change unnecessarily
> 
># weight 14.537
> 
>alg straw2
> 
>hash 0  # rjenkins1
> 
>item osd.4 weight 1.817
> 
>item osd.1 weight 1.817
> 
>item osd.3 weight 1.817
> 
>item osd.2 weight 1.817
> 
>item osd.6 weight 1.817
> 
>item osd.9 weight 1.817
> 
>item osd.5 weight 1.817
> 
>item osd.0 weight 1.818
> 
> }
> 
> host node02 {
> 
>id -3   # do not change unnecessarily
> 
>id -16 class ssd# do not change unnecessarily
> 
># weight 14.536
> 
>alg straw2
> 
>hash 0  # rjenkins1
> 
>item osd.10 weight 1.817
> 
>item osd.11 weight 1.817
> 
>item osd.12 weight 1.817
> 
>item osd.13 weight 1.817
> 
>item osd.14 weight 1.817
> 
>item osd.15 weight 1.817
> 
>item osd.16 weight 1.817
> 
>item osd.19 weight 1.817
> 
> }
> 
> host node03 {
> 
>id -4   # do not change unnecessarily
> 
>id -17 class ssd# do not change unnecessarily
> 
># weight 14.536
> 
>alg straw2
> 
>hash 0  # rjenkins1
> 
>item osd.20 weight 1.817
> 
>item osd.21 weight 1.817
> 
>item osd.22 weight 1.817
> 
>item osd.23 weight 1.817
> 
>item osd.25 weight 1.817
> 
>item