Re: [ceph-users] resolve split brain situation in ceph cluster

2016-10-17 Thread Gregory Farnum
On Mon, Oct 17, 2016 at 4:58 AM, Manuel Lausch  wrote:
> Hi Gregory,
>
> each datacenter has its own IP subnet which is routed. We created
> simultaneously iptables rules on each host wich drops all packages in and
> outgoing to the other datacenter. After this our application wrote to DC A,
> there are 3 of 5 Monitor Nodes.
> Now we modified in B the monmap (removed all mon nodes from DC A, so there
> are now 2 of 2 mon active). The monmap in A is untouched. The cluster in B
> was now active as well and the applications in B could now write to it. So
> we wrote definitely data in both clusterparts.
> After this we shut down the mon nodes in A. The part in A was now
> unavailable.
>
> Some hours later we removed the iptables rules and tried to rejoin the tow
> parts.
> we rejoined he three mon nodes from A as new nodes. the old mon data from
> this nodes was destroyed.
>
>
> Do you need further information?

Oh, so you actually forced both data centers to go active on purpose.
Yeah, there's no realistic recovery from that besides throwing out one
side and then adding it back to the cluster in the other DC. Sorry.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resolve split brain situation in ceph cluster

2016-10-17 Thread Manuel Lausch

Hi Gregory,

each datacenter has its own IP subnet which is routed. We created 
simultaneously iptables rules on each host wich drops all packages in 
and outgoing to the other datacenter. After this our application wrote 
to DC A, there are 3 of 5 Monitor Nodes.
Now we modified in B the monmap (removed all mon nodes from DC A, so 
there are now 2 of 2 mon active). The monmap in A is untouched. The 
cluster in B was now active as well and the applications in B could now 
write to it. So we wrote definitely data in both clusterparts.
After this we shut down the mon nodes in A. The part in A was now 
unavailable.


Some hours later we removed the iptables rules and tried to rejoin the 
tow parts.
we rejoined he three mon nodes from A as new nodes. the old mon data 
from this nodes was destroyed.



Do you need further information?

Regards,
Manuel


Am 14.10.2016 um 17:58 schrieb Gregory Farnum:

On Fri, Oct 14, 2016 at 7:27 AM, Manuel Lausch  wrote:

Hi,

I need some help to fix a broken cluster. I think we broke the cluster, but
I want to know your opinion and if you see a possibility to recover it.

Let me explain what happend.

We have a cluster (Version 0.94.9) in two datacenters (A and B). In each 12
nodes á 60 ODSs. In A we have 3 monitor nodes and in B  2. The crushrule and
replication factor forces two replicas in each datacenter.

We write objects via librados in the cluster. The objects are immutable, so
they are either present or absent.

In this cluster we tested what happens if datacenter A will fail and we need
to bring up the cluster in B by creating a monitor quorum in B. We did this
by cut off the network connection betwenn the two datacenters. The OSDs from
DC B went down like expected. Now we removed the mon Nodes from the monmap
in B (by extracting it offline and edit it). Our clients wrote now data in
both independent clusterparts before we stopped the mons in A. (YES I know.
This is a really bad thing).

This story line seems to be missing some points. How did you cut off
the network connection? What leads you to believe the OSDs accepted
writes on both sides of the split? Did you edit the monmap in both
data centers, or just DC A (that you wanted to remain alive)? What
monitor counts do you have in each DC?
-Greg


Now we try to join the two sides again. But so far without success.

Only the OSDs in B are running. The OSDs in A started but the OSDs stay
down. In the mon log we see a lot of „...(leader).pg v3513957 ignoring stats
from non-active osd“ alerts.

We see, that the current osdmap epoch in the running cluster is „28873“. In
the OSDs in A the epoch is „29003“. We assume that this is the reason why
the OSDs won't to jump in.


BTW: This is only a testcluster, so no important data are harmed.


Regards
Manuel


--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135
Karlsruhe | Germany
Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese
E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und
vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten
ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt
auf welche Weise auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient of this e-mail, you are hereby notified that
saving, distribution or use of the content of this e-mail in any way is
prohibited. If you have received this e-mail in error, please notify the
sender and delete the e-mail.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 
Karlsruhe | Germany
Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen 
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten 
Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, 
diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise 
auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient of this e-mail, you are hereby notified that saving, 

Re: [ceph-users] resolve split brain situation in ceph cluster

2016-10-14 Thread Gregory Farnum
On Fri, Oct 14, 2016 at 7:27 AM, Manuel Lausch  wrote:
> Hi,
>
> I need some help to fix a broken cluster. I think we broke the cluster, but
> I want to know your opinion and if you see a possibility to recover it.
>
> Let me explain what happend.
>
> We have a cluster (Version 0.94.9) in two datacenters (A and B). In each 12
> nodes á 60 ODSs. In A we have 3 monitor nodes and in B  2. The crushrule and
> replication factor forces two replicas in each datacenter.
>
> We write objects via librados in the cluster. The objects are immutable, so
> they are either present or absent.
>
> In this cluster we tested what happens if datacenter A will fail and we need
> to bring up the cluster in B by creating a monitor quorum in B. We did this
> by cut off the network connection betwenn the two datacenters. The OSDs from
> DC B went down like expected. Now we removed the mon Nodes from the monmap
> in B (by extracting it offline and edit it). Our clients wrote now data in
> both independent clusterparts before we stopped the mons in A. (YES I know.
> This is a really bad thing).

This story line seems to be missing some points. How did you cut off
the network connection? What leads you to believe the OSDs accepted
writes on both sides of the split? Did you edit the monmap in both
data centers, or just DC A (that you wanted to remain alive)? What
monitor counts do you have in each DC?
-Greg

>
> Now we try to join the two sides again. But so far without success.
>
> Only the OSDs in B are running. The OSDs in A started but the OSDs stay
> down. In the mon log we see a lot of „...(leader).pg v3513957 ignoring stats
> from non-active osd“ alerts.
>
> We see, that the current osdmap epoch in the running cluster is „28873“. In
> the OSDs in A the epoch is „29003“. We assume that this is the reason why
> the OSDs won't to jump in.
>
>
> BTW: This is only a testcluster, so no important data are harmed.
>
>
> Regards
> Manuel
>
>
> --
> Manuel Lausch
>
> Systemadministrator
> Cloud Services
>
> 1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135
> Karlsruhe | Germany
> Phone: +49 721 91374-1847
> E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de
>
> Amtsgericht Montabaur, HRB 5452
>
> Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen
>
>
> Member of United Internet
>
> Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen
> enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese
> E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und
> vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten
> ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt
> auf welche Weise auch immer zu verwenden.
>
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient of this e-mail, you are hereby notified that
> saving, distribution or use of the content of this e-mail in any way is
> prohibited. If you have received this e-mail in error, please notify the
> sender and delete the e-mail.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] resolve split brain situation in ceph cluster

2016-10-14 Thread Manuel Lausch

Hi,

I need some help to fix a broken cluster. I think we broke the cluster, 
but I want to know your opinion and if you see a possibility to recover it.


Let me explain what happend.

We have a cluster (Version 0.94.9) in two datacenters (A and B). In each 
12 nodes á 60 ODSs. In A we have 3 monitor nodes and in B  2. The 
crushrule and replication factor forces two replicas in each datacenter.


We write objects via librados in the cluster. The objects are immutable, 
so they are either present or absent.


In this cluster we tested what happens if datacenter A will fail and we 
need to bring up the cluster in B by creating a monitor quorum in B. We 
did this by cut off the network connection betwenn the two datacenters. 
The OSDs from DC B went down like expected. Now we removed the mon Nodes 
from the monmap in B (by extracting it offline and edit it). Our clients 
wrote now data in both independent clusterparts before we stopped the 
mons in A. (YES I know. This is a really bad thing).


Now we try to join the two sides again. But so far without success.

Only the OSDs in B are running. The OSDs in A started but the OSDs stay 
down. In the mon log we see a lot of „...(leader).pg v3513957 ignoring 
stats from non-active osd“ alerts.


We see, that the current osdmap epoch in the running cluster is „28873“. 
In the OSDs in A the epoch is „29003“. We assume that this is the reason 
why the OSDs won't to jump in.



BTW: This is only a testcluster, so no important data are harmed.


Regards
Manuel


--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 
Karlsruhe | Germany
Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen 
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten 
Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, 
diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise 
auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient of this e-mail, you are hereby notified that saving, 
distribution or use of the content of this e-mail in any way is prohibited. If 
you have received this e-mail in error, please notify the sender and delete the 
e-mail.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com