Thanks Aravind - the inclusion of dead servers in the persistent view is hitting yet another code path & the problem is still there. I've reopened the ticket.

On 6/9/17 5:10 AM, Aravind Musigumpula wrote:

Hi,

1.I added this fix GEODE-3052 <https://issues.apache.org/jira/browse/GEODE-3052> in Geode 1.1.1 release code. But still able to reproduce the problem, if we restart the second locator within a time difference of less than a second after the restart of first locator. Is there any dependency for this fix?

I have attached the logs:

Locator1 log : klyazma-locator-0-21290.log

Locator2 log : vyazma-locator-0-21290.log

1.I also added some logger prints in the new fix I added from GEODE-3052 <https://issues.apache.org/jira/browse/GEODE-3052>.

Thanks,

Aravind Musigumpula

*From:*Anton Mironenko
*Sent:* Friday, June 09, 2017 3:45 PM
*To:* [email protected]
*Subject:* RE: What functionality do we lose, if we delete locator*view.dat before starting a locator

Hello,

Thank you very much for the commit

https://issues.apache.org/jira/browse/GEODE-3052?focusedCommentId=16043697&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16043697

I’m looking forward to including it into 1.2.

Unfortunately, we cannot wait. We have to deliver our corporate release earlier than Geode 1.2 release.

So we need to come up with some workaround.

Once the Geode 1.2 release is available, we will remove the workaround.

As I understood, there are two workarounds for the issue GEODE-3003 <https://issues.apache.org/jira/browse/GEODE-3003>:

1.Put a pause between locators start greater than 2 seconds

2.Remove locator*.dat file before locator start

“I would really like people to not get into the habit of deleting these files. If you accidentally delete them while servers are still up you will be forced to do a complete shutdown.”

I’ve played a bit with the artifacts from the issue GEODE-3003 <https://issues.apache.org/jira/browse/GEODE-3003> and found out that the side effect of deleting locator*.dat is the following:

If we start a locator with deleted locator*.dat file, when there are no other locators, the servers will be gone from membership view, so restart of servers will be needed.

In more details:

1)Start locator on host1, start locator on host2,

2)Start server on host1, start server on host2,

3)Stop locator on host1: kill [host1-locator-PID],

4)Remove locator*.dat file from host1,

5)Stop locator on host2: kill [host2-locator-PID],

6)Start locator on host1, start locator on host2,

7)Via gfsh we see that the cluster consists only of locators, the servers are gone. In order to join servers to the cluster, we need to restart them!

So if we don’t do the bullet 4), we won’t get into bullet 7) side effect.

If what I’ve described is the only side effect, we are ok to go with this temporary workaround = removing locator*.dat file.

Anton Mironenko

*From:*Bruce Schuchardt [mailto:[email protected]]
*Sent:* Thursday, June 08, 2017 20:06
*To:* [email protected] <mailto:[email protected]>
*Subject:* Re: What functionality do we lose, if we delete locator*view.dat before starting a locator

The split-brain issue is easily reproduced in LocatorDUnitTest.testStartTwoLocators by duplicating the last line in the method:

*public void *testStartTwoLocators() *throws *Exception {
/disconnectAllFromDS/();
  Host host = Host./getHost/(0);
  VM loc1 = host.getVM(1);
  VM loc2 = host.getVM(2);
*int *ports[] = AvailablePortHelper./getRandomAvailableTCPPorts/(2);
*final int *port1 = ports[0];
*this*.*port1 *= port1;
*final int *port2 = ports[1];
*this*.*port2 *= port2; /// for cleanup in tearDown2/
//DistributedTestUtils./deleteLocatorStateFile/(port1);
  DistributedTestUtils./deleteLocatorStateFile/(port2);
*final *String host0 = NetworkUtils./getServerHostName/(host);
*final *String locators = host0 + *"[" *+ port1 + *"]," *+ host0 + *"[" *+ port2 + 
*"]"*;
*final *Properties properties = *new *Properties();
  properties.put(*/MCAST_PORT/*, *"0"*);
  properties.put(*/LOCATORS/*, locators);
  properties.put(*/ENABLE_NETWORK_PARTITION_DETECTION/*, *"false"*);
  properties.put(*/DISABLE_AUTO_RECONNECT/*, *"true"*);
  properties.put(*/MEMBER_TIMEOUT/*, *"2000"*);
  properties.put(*/LOG_LEVEL/*, LogWriterUtils./getDUnitLogLevel/());
  properties.put(*/ENABLE_CLUSTER_CONFIGURATION/*, *"false"*);
  addDSProps(properties);
  startVerifyAndStopLocator(loc1, loc2, port1, port2, properties);
_  startVerifyAndStopLocator(loc1, loc2, port1, port2, properties);_
}


It fails every time on the second startVerifyAndStopLocator invocation. The fix for this is pretty simple and I'll try to get it in the upcoming 1.2 release. Then you won't have to delete the locator view.dat files or stagger startup anymore.

I would really like people to not get into the habit of deleting these files. If you accidentally delete them while servers are still up you will be forced to do a complete shutdown.

On 6/8/17 9:45 AM, Udo Kohlmeyer wrote:

    Dharam,

    Thank you for testing this out as well. Using Anton's guidance,
    I've managed to reproduce the issue, byt restarting the 2 locators
    within 1s (try for sub-second if possible).

    Anton, did describe he did not see this behavior when the
    restarting between the two locators was more than 2s.

    --Udo

    On 6/8/17 09:29, Dharam Thacker wrote:

        Hi Anton,

        I also tried to reproduce your scenario in my local ubuntu
        machine with Geode 1.1.1, but I was able to restart cluster
        safely as explained below.

        host1> start locator1

        host2> start locator2

        host1> start server1

        host2> start server2

        host1> stop server1

        host2> stop server2

        host1> stop locator1

        host2> stop locator2

        verify all members shutdown well...

        host1> start locator2 [Even though i terminated this last as
        per above sequence i am starting it as first member]

        host2> start locator1

        Of course start of locator2 gave me same warning as I have
        higlighted in red below. Then I waited for greater than 10s
        before starting second locator(locator1 was stopped earlier
        than locator2 in past)

        But as soon as I locator1 started, locator2 detected that and
        started up cluster configuration service. Cluster was reformed
        after that.

        *_Logs below for verification:_*

        *
        [info 2017/06/08 21:47:10.911 IST locator2 <Pooled Message
        Processor 1> tid=0x2d] Region /_ConfigurationRegion has
        potentially stale data. It is waiting for another member to
        recover the latest data.*
          My persistent id:

            DiskStore ID: a267d876-40c8-4c85-848a-5a397adb5e5b
            Name: locator2
            Location:
        
/192.168.1.12:/home/dharam/Downloads/apache-geode/locator2/ConfigDiskDir_locator2

          Members with potentially new data:
          [
            DiskStore ID: 39d28da8-6b2c-414c-9608-3550219b624d
            Name: locator1
            Location:
        
/192.168.1.12:/home/dharam/Downloads/apache-geode/locator1/ConfigDiskDir_locator1
          ]
          Use the "gfsh show missing-disk-stores" command to see all
        disk stores that are being waited on by other members.

        *[warning 2017/06/08 21:47:45.606 IST locator2 <WAN Locator
        Discovery Thread> tid=0x2f] Locator discovery task could not
        exchange locator information 192.168.1.12[10335] with
        localhost[10334] after 6 retry attempts. Retrying in 10,000 ms.
        *
        [info 2017/06/08 21:48:02.886 IST locator2 <unicast
        receiver,dharam-ThinkPad-Edge-E431-1183> tid=0x1c] received
        join request from 192.168.1.12(locator1:10969:locator)<ec>:1025

        [info 2017/06/08 21:48:03.187 IST locator2 <Geode Membership
        View Creator> tid=0x22] View Creator is processing 1 requests
        for the next membership view

        [info 2017/06/08 21:48:03.188 IST locator2 <Geode Membership
        View Creator> tid=0x22] preparing new view
        View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1]
        members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024,
        192.168.1.12(locator1:10969:locator)<ec><v1>:1025]
          failure detection ports: 14001 42428

        [info 2017/06/08 21:48:03.221 IST locator2 <Geode Membership
        View Creator> tid=0x22] finished waiting for responses to view
        preparation

        [info 2017/06/08 21:48:03.221 IST locator2 <Geode Membership
        View Creator> tid=0x22] received new view:
        View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1]
        members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024,
        192.168.1.12(locator1:10969:locator)<ec><v1>:1025]
          old view is:
        View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|0]
        members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024]

        [info 2017/06/08 21:48:03.222 IST locator2 <Geode Membership
        View Creator> tid=0x22] Peer locator received new membership
        view:
        View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1]
        members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024,
        192.168.1.12(locator1:10969:locator)<ec><v1>:1025]

        [info 2017/06/08 21:48:03.228 IST locator2 <Geode Membership
        View Creator> tid=0x22] sending new view
        View[192.168.1.12(locator2:10853:locator)<ec><v0>:1024|1]
        members: [192.168.1.12(locator2:10853:locator)<ec><v0>:1024,
        192.168.1.12(locator1:10969:locator)<ec><v1>:1025]
          failure detection ports: 14001 42428

        [info 2017/06/08 21:48:03.232 IST locator2 <View Message
        Processor> tid=0x48] Membership: Processing addition <
        192.168.1.12(locator1:10969:locator)<ec><v1>:1025 >

        [info 2017/06/08 21:48:03.233 IST locator2 <View Message
        Processor> tid=0x48] Admitting member
        <192.168.1.12(locator1:10969:locator)<ec><v1>:1025>. Now there
        are 2 non-admin member(s).

        [info 2017/06/08 21:48:03.242 IST locator2 <pool-3-thread-1>
        tid=0x4a] Initializing region
        _monitoringRegion_192.168.1.12<v1>1025

        [info 2017/06/08 21:48:03.275 IST locator2 <Pooled High
        Priority Message Processor 1> tid=0x4e] Member
        192.168.1.12(locator1:10969:locator)<ec><v1>:1025 is
        equivalent or in the same redundancy zone.

        [info 2017/06/08 21:48:03.326 IST locator2 <pool-3-thread-1>
        tid=0x4a] Initialization of region
        _monitoringRegion_192.168.1.12<v1>1025 completed

        [info 2017/06/08 21:48:03.336 IST locator2 <pool-3-thread-1>
        tid=0x4a] Initializing region
        _notificationRegion_192.168.1.12<v1>1025

        [info 2017/06/08 21:48:03.338 IST locator2 <pool-3-thread-1>
        tid=0x4a] Initialization of region
        _notificationRegion_192.168.1.12<v1>1025 completed

        [info 2017/06/08 21:48:04.611 IST locator2 <Pooled Message
        Processor 1> tid=0x2d] Region _ConfigurationRegion requesting
        initial image from
        192.168.1.12(locator1:10969:locator)<ec><v1>:1025

        [info 2017/06/08 21:48:04.615 IST locator2 <Pooled Message
        Processor 1> tid=0x2d] _ConfigurationRegion is done getting
        image from 192.168.1.12(locator1:10969:locator)<ec><v1>:1025.
        isDeltaGII is true

        [info 2017/06/08 21:48:04.616 IST locator2 <Pooled Message
        Processor 1> tid=0x2d] Region _ConfigurationRegion initialized
        persistent id:
        
/192.168.1.12:/home/dharam/Downloads/apache-geode/locator2/ConfigDiskDir_locator2
        created at timestamp 1496938615755 version 0 diskStoreId
        a267d87640c84c85-848a5a397adb5e5b name locator2 with data from
        192.168.1.12(locator1:10969:locator)<ec><v1>:1025.

        [info 2017/06/08 21:48:04.617 IST locator2 <Pooled Message
        Processor 1> tid=0x2d] Initialization of region
        _ConfigurationRegion completed

        [info 2017/06/08 21:48:04.637 IST locator2 <Pooled Message
        Processor 1> tid=0x2d] ConfigRequestHandler installed

        *[info 2017/06/08 21:48:04.637 IST locator2 <Pooled Message
        Processor 1> tid=0x2d] Cluster configuration service start up
        completed successfully and is now running ....*

        [info 2017/06/08 21:48:05.692 IST locator2 <WAN Locator
        Discovery Thread> tid=0x2f] Locator discovery task exchanged
        locator information 192.168.1.12[10335] with localhost[10334]:
        {-1=[192.168.1.12[10335], 192.168.1.12[10334]]}.

        Thanks,

        Dharam


        - Dharam Thacker

        On Thu, Jun 8, 2017 at 9:25 PM, Bruce Schuchardt
        <[email protected] <mailto:[email protected]>> wrote:

            The locator view file exists to allow locators to be
            bounced without shutting down the rest of the cluster.  On
            startup a locator will try to find the current membership
            coordinator of the cluster from an existing locator and
            join the system using that information.  If there is no
            existing locator that knows who the coordinator might be
            then the new locator will try to find the coordinator
            using the membership "view" that is stored in the view
            file.  If there is no view file the locator will not be
            able to join the existing cluster.

            If you've done a full shutdown of the cluster it is safe
            to delete the locator*view.dat files.

            When there is no .dat file the locators will use a
            concurrent-startup algorithm to form a unified system.

            On 6/8/17 7:48 AM, Anton Mironenko wrote:

                Hello,

                We found out that if we delete “locator*view.dat”
                before starting a locator,

                It fixes the first part of the issue

                https://issues.apache.org/jira/browse/GEODE-3003

                “Geode doesn't start after cluster restart when using
                cluster-configuration”

                “The second start goes wrong: the locator on the first
                host always doesn't join the rest of the cluster with
                the error in the locator log:
                "Region /_ConfigurationRegion has potentially stale
                data. It is waiting for another member to recover the
                latest data."”

                What is a side effect of deleting the file
                "locator0/locator*view.dat"? What functionality do we
                lose?

                A use case with some example would be great.

                Anton Mironenko

                This message and the information contained herein is
                proprietary and confidential and subject to the Amdocs
                policy statement,

                you may review at
                https://www.amdocs.com/about/email-disclaimer

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
you may review at https://www.amdocs.com/about/email-disclaimer

Reply via email to