[ https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580970#comment-16580970 ]
Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:04 PM: ----------------------------------------------------------------------- [~Mmuzaf] Config: {code:java} <property name="clientMode" value="false"/> <property name="activeOnStart" value="false"/> <property name="consistentId" value="${CONSISTENT_ID}"/> <property name="cacheConfiguration" ref="caches"/> <property name="communicationSpi"> <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi"> <property name="sharedMemoryPort" value="-1"/> </bean> </property> <property name="binaryConfiguration"> <bean class="org.apache.ignite.configuration.BinaryConfiguration"> <property name="compactFooter" value="false"/> </bean> </property> <property name="dataStorageConfiguration"> <bean class="org.apache.ignite.configuration.DataStorageConfiguration"> <property name="defaultDataRegionConfiguration"> <bean class="org.apache.ignite.configuration.DataRegionConfiguration"> <property name="persistenceEnabled" value="true"/> </bean> </property> <property name="walSegments" value="5"/> <property name="walSegmentSize" value="1000000"/> </bean> </property>{code} Test code: {code} def test_blinking_clients_clean_lfs(self): """ IGNITE-7165 """ self.wait_for_running_clients_num(client_num=0, timeout=120) self.start_grid() # start 4 nodes for _ in range(0, 10): log_print("Iteration %s" % str(_)) self.assert_nodes_alive() # check that no nodes left grid because of FailHandler self.ignite.kill_node(2) self._cleanup_lfs(2) self.ignite.start_node(2) # start Ignition.start() with client config and do nothing 3 times with PiClient(self.ignite, self.get_client_config()): pass with PiClient(self.ignite, self.get_client_config()): pass with PiClient(self.ignite, self.get_client_config()): pass # check LocalNodeMovingPartitionsCount metric for all cache groups in cluster # wait that for all cache groups this value will be 0 self.wait_for_finish_rebalance(){code} self.start_grid() start real grid on distributed servers using ignite.sh scripts. with PiClient block start JVM and runs Ignition.start() with client config (major difference with server config is clientMode=true) Log file of this test contains following information: metric dos not change their state in 240 seconds in current master. (I've recently check this on 15 Aug nightly build) {code} Current metric state for cache cache_group_1_028 on node 2: 19 [14:44:58][:568 :617] Wait rebalance to finish 7/240 Current metric state for cache cache_group_1_028 on node 2: 19 [14:45:04][:568 :617] Wait rebalance to finish 13/240 Current metric state for cache cache_group_1_028 on node 2: 19 .... [14:48:47][:568 :617] Wait rebalance to finish 236/240 Current metric state for cache cache_group_1_028 on node 2: 19{code} Config of the cache that fails: {code:xml} <bean class="org.apache.ignite.configuration.CacheConfiguration"> <property name="name" value="cache_group_1_028"/> <property name="atomicityMode" value="ATOMIC"/> <property name="backups" value="1"/> <property name="cacheMode" value="PARTITIONED"/> <property name="writeSynchronizationMode" value="FULL_SYNC"/> <property name="evictionPolicy"> <bean class="org.apache.ignite.cache.eviction.fifo.FifoEvictionPolicy"> <property name="maxSize" value="1000"/> </bean> </property> <property name="onheapCacheEnabled" value="true"/> <property name="affinity"> <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction"> <constructor-arg value="false"/> <constructor-arg value="32"/> </bean> </property> </bean>{code} I'm afraid that this all information that I can provide for you for now. I've attached jstack from node2: [^node-2-jstack.log]^[^node-2-jstack.log]^ was (Author: qvad): [~Mmuzaf] Config: {code:java} <property name="clientMode" value="false"/> <property name="activeOnStart" value="false"/> <property name="consistentId" value="${CONSISTENT_ID}"/> <property name="cacheConfiguration" ref="caches"/> <property name="communicationSpi"> <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi"> <property name="sharedMemoryPort" value="-1"/> </bean> </property> <property name="binaryConfiguration"> <bean class="org.apache.ignite.configuration.BinaryConfiguration"> <property name="compactFooter" value="false"/> </bean> </property> <property name="dataStorageConfiguration"> <bean class="org.apache.ignite.configuration.DataStorageConfiguration"> <property name="defaultDataRegionConfiguration"> <bean class="org.apache.ignite.configuration.DataRegionConfiguration"> <property name="persistenceEnabled" value="true"/> </bean> </property> <property name="walSegments" value="5"/> <property name="walSegmentSize" value="1000000"/> </bean> </property>{code} Test code: {code:python} def test_blinking_clients_clean_lfs(self): """ IGNITE-7165 """ self.wait_for_running_clients_num(client_num=0, timeout=120) self.start_grid() # start 4 nodes for _ in range(0, 10): log_print("Iteration %s" % str(_)) self.assert_nodes_alive() # check that no nodes left grid because of FailHandler self.ignite.kill_node(2) self._cleanup_lfs(2) self.ignite.start_node(2) # start Ignition.start() with client config and do nothing 3 times with PiClient(self.ignite, self.get_client_config()): pass with PiClient(self.ignite, self.get_client_config()): pass with PiClient(self.ignite, self.get_client_config()): pass # check LocalNodeMovingPartitionsCount metric for all cache groups in cluster # wait that for all cache groups this value will be 0 self.wait_for_finish_rebalance(){code} Here is code from our test on python. self.start_grid() start real grid on distributed servers using ignite.sh scripts. with PiClient block start JVM and runs Ignition.start() with client config (major difference with server config is clientMode=true) Log file of this test contains following information: metric dos not change their state in 240 seconds in current master. (I've recently check this on 15 Aug nightly build) {code:sh} Current metric state for cache cache_group_1_028 on node 2: 19 [14:44:58][:568 :617] Wait rebalance to finish 7/240 Current metric state for cache cache_group_1_028 on node 2: 19 [14:45:04][:568 :617] Wait rebalance to finish 13/240 Current metric state for cache cache_group_1_028 on node 2: 19 .... [14:48:47][:568 :617] Wait rebalance to finish 236/240 Current metric state for cache cache_group_1_028 on node 2: 19{code} Config of the cache that fails: {code:xml} <bean class="org.apache.ignite.configuration.CacheConfiguration"> <property name="name" value="cache_group_1_028"/> <property name="atomicityMode" value="ATOMIC"/> <property name="backups" value="1"/> <property name="cacheMode" value="PARTITIONED"/> <property name="writeSynchronizationMode" value="FULL_SYNC"/> <property name="evictionPolicy"> <bean class="org.apache.ignite.cache.eviction.fifo.FifoEvictionPolicy"> <property name="maxSize" value="1000"/> </bean> </property> <property name="onheapCacheEnabled" value="true"/> <property name="affinity"> <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction"> <constructor-arg value="false"/> <constructor-arg value="32"/> </bean> </property> </bean>{code} I'm afraid that this all information that I can provide for you for now. I've attached jstack from node2: [^node-2-jstack.log] > Re-balancing is cancelled if client node joins > ---------------------------------------------- > > Key: IGNITE-7165 > URL: https://issues.apache.org/jira/browse/IGNITE-7165 > Project: Ignite > Issue Type: Bug > Reporter: Mikhail Cherkasov > Assignee: Maxim Muzafarov > Priority: Critical > Labels: rebalance > Fix For: 2.7 > > Attachments: node-2-jstack.log, node-NO_REBALANCE-7165.log > > > Re-balancing is canceled if client node joins. Re-balancing can take hours > and each time when client node joins it starts again: > [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager] > Added new node to topology: TcpDiscoveryNode > [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, > 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, > /172.31.16.213:0], discPort=0, order=36, intOrder=24, > lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, > isClient=true] > [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager] > Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB] > [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started > exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], > crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, > customEvt=null, allowMerge=true] > [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture] > Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, > minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], > err=null] > [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished > exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], > crd=false] > [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager] > Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion > [topVer=36, minorTopVer=0], evt=NODE_JOINED, > node=979cf868-1c37-424a-9ad1-12db501f32ef] > [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] > Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion > [topVer=35, minorTopVer=0]] > [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager] > Rebalancing scheduled [order=[statementp]] > [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager] > Rebalancing started [top=null, evt=NODE_JOINED, > node=a8be3c14-9add-48c3-b099-3fd304cfdbf4] > [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] > Starting rebalancing [mode=ASYNC, > fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, > topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], > updateSeq=-1754630006] > [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] > Starting rebalancing [mode=ASYNC, > fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15, > topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], > updateSeq=-1754630006] > [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] > Starting rebalancing [mode=ASYNC, > fromNode=b3a8be53-e61f-4023-a906-a265923837ba, partitionsCount=15, > topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], > updateSeq=-1754630006] > [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] > Starting rebalancing [mode=ASYNC, > fromNode=f825cb4e-7dcc-405f-a40d-c1dc1a3ade5a, partitionsCount=12, > topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], > updateSeq=-1754630006] > [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] > Starting rebalancing [mode=ASYNC, > fromNode=4ae1db91-8b88-4180-a84b-127a303959e9, partitionsCount=11, > topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], > updateSeq=-1754630006] > [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander] > Starting rebalancing [mode=ASYNC, > fromNode=7c286481-7638-49e4-8c68-fa6aa65d8b76, partitionsCount=18, > topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], > updateSeq=-1754630006] > so in clusters with a big amount of data and the frequent client left/join > events this means that a new server will never receive its partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)