Re: Avoiding Split Brain problem

Eugene Strokin Wed, 04 May 2016 06:29:58 -0700

This is Geode.
After I've set enable-network-partition-detection=true, I've got such
problem:
The cluster (10 nodes) was working under normal production load. One node
went down. All other nodes started getting the exception (see bellow).
The line I'm getting exception on is: region.size()
I hoped that if a node goes down, the system would function normally, it
would just loose a portion of data, that is understood, but the rest would
continue to work.
Is anything could be done here to avoid the exception?
Thanks,
Eugene


com.gemstone.gemfire.distributed.DistributedSystemDisconnectedException:
GemFire on 10.132.49.101(3787)<ec><v6>:1024 started at Tue May 03 17:06:13
EDT 2016: Message distribution has terminated
    at com.gemstone.gemfire.distributed.internal.
DistributionManager$Stopper.generateCancelledException(
DistributionManager.java:745)
    at com.gemstone.gemfire.distributed.internal.InternalDistributedSystem$
Stopper.generateCancelledException(InternalDistributedSystem.java:861)
    at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.
generateCancelledException(GemFireCacheImpl.java:1453)
    at com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(
CancelCriterion.java:91)
    at com.gemstone.gemfire.internal.cache.LocalRegion.checkRegionDestroyed(
LocalRegion.java:8118)
    at com.gemstone.gemfire.internal.cache.LocalRegion.
checkReadiness(LocalRegion.java:2994)
    at com.gemstone.gemfire.internal.cache.LocalRegion.size(
LocalRegion.java:9668)
    at ccio.image.ImageServer$2.run(ImageServer.java:135)
    at java.util.concurrent.Executors$RunnableAdapter.
call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$
ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$
ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.gemstone.gemfire.ForcedDisconnectException: Member isn't
responding to heartbeat requests
    at com.gemstone.gemfire.distributed.internal.membership.gms.mgr.
GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2571)
    at com.gemstone.gemfire.distributed.internal.membership.gms.membership.
GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:811)
    at com.gemstone.gemfire.distributed.internal.membership.gms.membership.
GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:519)
    at com.gemstone.gemfire.distributed.internal.membership.gms.membership.
GMSJoinLeave.processMessage(GMSJoinLeave.java:1459)
    at com.gemstone.gemfire.distributed.internal.membership.gms.messenger.
JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1051)
    at org.jgroups.JChannel.invokeCallback(JChannel.java:817)
    at org.jgroups.JChannel.up(JChannel.java:741)
    at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1029)
    at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
    at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1064)
    at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:779)
    at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426)
    at com.gemstone.gemfire.distributed.internal.membership.gms.messenger.
StatRecorder.up(StatRecorder.java:72)
    at com.gemstone.gemfire.distributed.internal.membership.gms.messenger.
AddressManager.up(AddressManager.java:76)
    at org.jgroups.protocols.TP.passMessageUp(TP.java:1577)
    at org.jgroups.protocols.TP <http://org.jgroups.protocols.tp/>$
MyHandler.run(TP.java:1796)
    at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
    at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1693)
    at org.jgroups.protocols.TP.receive(TP.java:1630)
    at com.gemstone.gemfire.distributed.internal.membership.gms.messenger.
Transport.receive(Transport.java:165)
    at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:691)
     ... 1 common frames omitted


On Tue, May 3, 2016 at 8:10 PM, Bruce Schuchardt <[email protected]>
wrote:

> Is this using Geode or GemFire?  Either way If you continue to have
> problems you can PM Udo and me directly.  Send us a zip with the log files
> and we'll help you figure it out.
> Le 5/3/2016 à 2:13 PM, Eugene Strokin a écrit :
>
> Udo, thanks for the hint. The property was missing indeed.
> I've put it into my gemfire.properties file and the cluster waits all
> nodes to start before proceed to any activity.
> Eugene
>
> On Tue, May 3, 2016 at 4:28 PM, Udo Kohlmeyer <[email protected]>
> wrote:
>
>> Hi there Eugene,
>>
>> Can you check if the enable-network-partition-detection property is set,
>> as per the documentation.
>> Handling Network partitioning
>> <http://geode.docs.pivotal.io/docs/managing/network_partitioning/handling_network_partitioning.html>
>>
>> --Udo
>>
>>
>> On 4/05/2016 6:22 am, Eugene Strokin wrote:
>>
>> I'm testing my 10 nodes cluster under production load and with production
>> data.
>> I was using automated tool which created the nodes (VMs) configured
>> everything and restarted all of them.
>> Everything worked, I mean, I was getting the data I expected, but when
>> I've checked the stats I noticed that I'm running 10 one node clusters. My
>> nodes didn't see each other, they had a separate duplicated set of data on
>> each node.
>> I've stopped all the nodes, cleaned all logs/storage files, and restarted
>> the nodes again.
>> Now I had one cluster with 7 nodes and 3 nodes separate.
>> I've stopped the 3 nodes, cleaned them up, and started them up one by
>> one, they successfully joined the cluster. At the end I've got all 10 nodes
>> working as a single cluster.
>> But I'm afraid that if nodes would get restarted or network would have
>> some problems, I could end up with split cluster again.
>> I use API to start Cache with locators, and all locator's IPs are
>> provided in the config. From the documentation I had impression that Geode
>> would wait till N/2+1 nodes would start before forming the cluster, since
>> the number of locators is preset. But looks like it is not the case.
>> Or should I set some setting to force such behavior?
>>
>> Thank you,
>> Eugene
>>
>>
>>
>
>

Re: Avoiding Split Brain problem

Reply via email to