[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

Dmitry Sherstobitov (JIRA) Wed, 15 Aug 2018 05:05:18 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580970#comment-16580970
 ]


Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:04 PM:
-----------------------------------------------------------------------

[~Mmuzaf]

Config:
{code:java}
<property name="clientMode" value="false"/>
<property name="activeOnStart" value="false"/>
<property name="consistentId" value="${CONSISTENT_ID}"/>
<property name="cacheConfiguration" ref="caches"/>
<property name="communicationSpi">
    <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
        <property name="sharedMemoryPort" value="-1"/>
    </bean>
</property>

<property name="binaryConfiguration">
    <bean class="org.apache.ignite.configuration.BinaryConfiguration">
        <property name="compactFooter" value="false"/>
    </bean>
</property>

<property name="dataStorageConfiguration">
    <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
        <property name="defaultDataRegionConfiguration">
            <bean 
class="org.apache.ignite.configuration.DataRegionConfiguration">
                <property name="persistenceEnabled" value="true"/>
            </bean>
        </property>
        <property name="walSegments" value="5"/>
        <property name="walSegmentSize" value="1000000"/>
    </bean>
</property>{code}
Test code:
{code}
def test_blinking_clients_clean_lfs(self):
    """
    IGNITE-7165
    """
    self.wait_for_running_clients_num(client_num=0, timeout=120)

    self.start_grid() # start 4 nodes

    for _ in range(0, 10):
        log_print("Iteration %s" % str(_))

        self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

        self.ignite.kill_node(2)
        self._cleanup_lfs(2)
        self.ignite.start_node(2)

        # start Ignition.start() with client config and do nothing 3 times
        with PiClient(self.ignite, self.get_client_config()):
            pass 

        with PiClient(self.ignite, self.get_client_config()):
            pass

        with PiClient(self.ignite, self.get_client_config()):
            pass

        # check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
        # wait that for all cache groups this value will be 0
        self.wait_for_finish_rebalance(){code}
self.start_grid() start real grid on distributed servers using ignite.sh 
scripts. with PiClient block start JVM and runs Ignition.start() with client 
config (major difference with server config is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19
....
[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}
<bean class="org.apache.ignite.configuration.CacheConfiguration">
    <property name="name" value="cache_group_1_028"/>
    <property name="atomicityMode" value="ATOMIC"/>
    <property name="backups" value="1"/>
    <property name="cacheMode" value="PARTITIONED"/>
    <property name="writeSynchronizationMode" value="FULL_SYNC"/>
    <property name="evictionPolicy">
        <bean class="org.apache.ignite.cache.eviction.fifo.FifoEvictionPolicy">
            <property name="maxSize" value="1000"/>
        </bean>
    </property>
    <property name="onheapCacheEnabled" value="true"/>
    <property name="affinity">
        <bean 
class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
            <constructor-arg value="false"/>
            <constructor-arg value="32"/>
        </bean>
    </property>
</bean>{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]^[^node-2-jstack.log]^


was (Author: qvad):
[~Mmuzaf]

Config:
{code:java}
<property name="clientMode" value="false"/>
<property name="activeOnStart" value="false"/>
<property name="consistentId" value="${CONSISTENT_ID}"/>
<property name="cacheConfiguration" ref="caches"/>
<property name="communicationSpi">
    <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
        <property name="sharedMemoryPort" value="-1"/>
    </bean>
</property>

<property name="binaryConfiguration">
    <bean class="org.apache.ignite.configuration.BinaryConfiguration">
        <property name="compactFooter" value="false"/>
    </bean>
</property>

<property name="dataStorageConfiguration">
    <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
        <property name="defaultDataRegionConfiguration">
            <bean 
class="org.apache.ignite.configuration.DataRegionConfiguration">
                <property name="persistenceEnabled" value="true"/>
            </bean>
        </property>
        <property name="walSegments" value="5"/>
        <property name="walSegmentSize" value="1000000"/>
    </bean>
</property>{code}

Test code:
{code:python}
def test_blinking_clients_clean_lfs(self):
    """
    IGNITE-7165
    """
    self.wait_for_running_clients_num(client_num=0, timeout=120)

    self.start_grid() # start 4 nodes

    for _ in range(0, 10):
        log_print("Iteration %s" % str(_))

        self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

        self.ignite.kill_node(2)
        self._cleanup_lfs(2)
        self.ignite.start_node(2)

        # start Ignition.start() with client config and do nothing 3 times
        with PiClient(self.ignite, self.get_client_config()):
            pass 

        with PiClient(self.ignite, self.get_client_config()):
            pass

        with PiClient(self.ignite, self.get_client_config()):
            pass

        # check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
        # wait that for all cache groups this value will be 0
        self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:sh}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19
....
[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}
<bean class="org.apache.ignite.configuration.CacheConfiguration">
    <property name="name" value="cache_group_1_028"/>
    <property name="atomicityMode" value="ATOMIC"/>
    <property name="backups" value="1"/>
    <property name="cacheMode" value="PARTITIONED"/>
    <property name="writeSynchronizationMode" value="FULL_SYNC"/>
    <property name="evictionPolicy">
        <bean class="org.apache.ignite.cache.eviction.fifo.FifoEvictionPolicy">
            <property name="maxSize" value="1000"/>
        </bean>
    </property>
    <property name="onheapCacheEnabled" value="true"/>
    <property name="affinity">
        <bean 
class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
            <constructor-arg value="false"/>
            <constructor-arg value="32"/>
        </bean>
    </property>
</bean>{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]

> Re-balancing is cancelled if client node joins
> ----------------------------------------------
>
>                 Key: IGNITE-7165
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7165
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Cherkasov
>            Assignee: Maxim Muzafarov
>            Priority: Critical
>              Labels: rebalance
>             Fix For: 2.7
>
>         Attachments: node-2-jstack.log, node-NO_REBALANCE-7165.log
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, 
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, 
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=36, minorTopVer=0], evt=NODE_JOINED, 
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing started [top=null, evt=NODE_JOINED, 
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=b3a8be53-e61f-4023-a906-a265923837ba, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=f825cb4e-7dcc-405f-a40d-c1dc1a3ade5a, partitionsCount=12, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=4ae1db91-8b88-4180-a84b-127a303959e9, partitionsCount=11, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=7c286481-7638-49e4-8c68-fa6aa65d8b76, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> so in clusters with a big amount of data and the frequent client left/join 
> events this means that a new server will never receive its partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

Reply via email to