[jira] [Created] (IGNITE-14452) Add cehcking of the iptables settings applied.
Vladimir Steshin created IGNITE-14452: - Summary: Add cehcking of the iptables settings applied. Key: IGNITE-14452 URL: https://issues.apache.org/jira/browse/IGNITE-14452 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin Sometimes, we lack settings of iptables for unknows reason. Let's monitor this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14437) Adjust test params: exclude input net failures with disabled connRecovery
Vladimir Steshin created IGNITE-14437: - Summary: Adjust test params: exclude input net failures with disabled connRecovery Key: IGNITE-14437 URL: https://issues.apache.org/jira/browse/IGNITE-14437 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14378) Remove delay from node ping.
Vladimir Steshin created IGNITE-14378: - Summary: Remove delay from node ping. Key: IGNITE-14378 URL: https://issues.apache.org/jira/browse/IGNITE-14378 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin Remove U.sleep(200) from the node ping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14377) Enchance log of node ping failure.
Vladimir Steshin created IGNITE-14377: - Summary: Enchance log of node ping failure. Key: IGNITE-14377 URL: https://issues.apache.org/jira/browse/IGNITE-14377 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin Log of unsuccessful ping during the joining is insufficient. No failure reason is logged. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14096) Try to bring randomization in node waiting with TcpDiscoverySpi.reconnectDelay.
Vladimir Steshin created IGNITE-14096: - Summary: Try to bring randomization in node waiting with TcpDiscoverySpi.reconnectDelay. Key: IGNITE-14096 URL: https://issues.apache.org/jira/browse/IGNITE-14096 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin To speed up cluster start slyghtly, try to bring randomization in node waiting with TcpDiscoverySpi.reconnectDelay. Check with the ducktape integration tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14095) Try fasten cluster start in the ducktests with decreasing 'spi.reconnectDelay'
Vladimir Steshin created IGNITE-14095: - Summary: Try fasten cluster start in the ducktests with decreasing 'spi.reconnectDelay' Key: IGNITE-14095 URL: https://issues.apache.org/jira/browse/IGNITE-14095 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14068) Infinite node persistance in the ring while outcoming connections are lost
Vladimir Steshin created IGNITE-14068: - Summary: Infinite node persistance in the ring while outcoming connections are lost Key: IGNITE-14068 URL: https://issues.apache.org/jira/browse/IGNITE-14068 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14054) Improve discovery ducktest: add partial network drop.
Vladimir Steshin created IGNITE-14054: - Summary: Improve discovery ducktest: add partial network drop. Key: IGNITE-14054 URL: https://issues.apache.org/jira/browse/IGNITE-14054 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14053) Remove status check message at all.
Vladimir Steshin created IGNITE-14053: - Summary: Remove status check message at all. Key: IGNITE-14053 URL: https://issues.apache.org/jira/browse/IGNITE-14053 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14038) Separate JVM settings in the ducktests.
Vladimir Steshin created IGNITE-14038: - Summary: Separate JVM settings in the ducktests. Key: IGNITE-14038 URL: https://issues.apache.org/jira/browse/IGNITE-14038 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14037) Separate JVM settings in the ducktests.
Vladimir Steshin created IGNITE-14037: - Summary: Separate JVM settings in the ducktests. Key: IGNITE-14037 URL: https://issues.apache.org/jira/browse/IGNITE-14037 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13980) Remove duplicated ping: status check message.
Vladimir Steshin created IGNITE-13980: - Summary: Remove duplicated ping: status check message. Key: IGNITE-13980 URL: https://issues.apache.org/jira/browse/IGNITE-13980 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13835) Improve discovery ducktape test to research small timeouts and behavior on large cluster.
Vladimir Steshin created IGNITE-13835: - Summary: Improve discovery ducktape test to research small timeouts and behavior on large cluster. Key: IGNITE-13835 URL: https://issues.apache.org/jira/browse/IGNITE-13835 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin Improve discovery ducktape test to research the cluster behavior with bigger node number and smaller timeouts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13705) Fix middle node failed when failed next node and previous.
Vladimir Steshin created IGNITE-13705: - Summary: Fix middle node failed when failed next node and previous. Key: IGNITE-13705 URL: https://issues.apache.org/jira/browse/IGNITE-13705 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin The discovery ducktape test has detected failure of third node in the middle of 2 simulateously failed nodes. First research shows the trouble in backward connection checking: next node has checked itself: [2020-11-13 14:50:44,463][INFO ][tcp-disco-sock-reader-[47cc6f70 10.53.125.224:35381]-#7-#79][org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi1] Connection check done [liveAddr=tkles-pprb00188.vm.esrt.cloud.sbrf.ru/10.53.125.160:47500, previousNode=TcpDiscoveryNode [id=8331a61c-ea93-4bf5-bc8c-b24c032068d0, consistentId=tkles-pprb00188.vm.esrt.cloud.sbrf.ru, addrs=ArrayList [10.53.125.160], sockAddrs=HashSet [tkles-pprb00188.vm.esrt.cloud.sbrf.ru/10.53.125.160:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1605268203598, loc=false, ver=2.10.0#20201113-sha1:, isClient=false], addressesToCheck=[tkles-pprb00188.vm.esrt.cloud.sbrf.ru/10.53.125.160:47500], connectingNodeId=47cc6f70-9fe4-437d-b183-826f2687aac8] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13704) Try failuredetectionTimeout==500 in ducktape integration test.
Vladimir Steshin created IGNITE-13704: - Summary: Try failuredetectionTimeout==500 in ducktape integration test. Key: IGNITE-13704 URL: https://issues.apache.org/jira/browse/IGNITE-13704 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin Try failuredetectionTimeout==500 in ducktape integration test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13702) Fix description of soLibger for DiscoveryTcpSpi.
Vladimir Steshin created IGNITE-13702: - Summary: Fix description of soLibger for DiscoveryTcpSpi. Key: IGNITE-13702 URL: https://issues.apache.org/jira/browse/IGNITE-13702 Project: Ignite Issue Type: Improvement Components: documentation Affects Versions: 2.10 Reporter: Vladimir Steshin Assignee: Vladimir Steshin Fix For: 2.10 Fix description of soLibger for DiscoveryTcpSpi. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13695) Move javadoc of affection of several addresses on failure detection.
Vladimir Steshin created IGNITE-13695: - Summary: Move javadoc of affection of several addresses on failure detection. Key: IGNITE-13695 URL: https://issues.apache.org/jira/browse/IGNITE-13695 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin Current javadoc of affection several node addresses of failure detection is located under `TcpDiscoverySpi.setIpFinder()`. Correct place is by `TcpDiscoverySpi.setLocalAddress()`. Perhaps, the test might be slightly changed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13666) Disable socket linger in discovery ducktape test.
Vladimir Steshin created IGNITE-13666: - Summary: Disable socket linger in discovery ducktape test. Key: IGNITE-13666 URL: https://issues.apache.org/jira/browse/IGNITE-13666 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin soLinger might be disabled to fasten the discovery tests. Additionally, we could reduce failureDetectionTimeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13663) Represent in the documenttion affection of several node addresses on failure detection v2.
Vladimir Steshin created IGNITE-13663: - Summary: Represent in the documenttion affection of several node addresses on failure detection v2. Key: IGNITE-13663 URL: https://issues.apache.org/jira/browse/IGNITE-13663 Project: Ignite Issue Type: Improvement Components: documentation Affects Versions: 2.9 Reporter: Vladimir Steshin Assignee: Vladimir Steshin Fix For: 2.10 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13662) Discribe soLinger setting in TCP Discovery and SSL issues.
Vladimir Steshin created IGNITE-13662: - Summary: Discribe soLinger setting in TCP Discovery and SSL issues. Key: IGNITE-13662 URL: https://issues.apache.org/jira/browse/IGNITE-13662 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin Discribe soLinger setting in TCP Discovery and SSL issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13646) Discovery ducktape test might have setting for socket linger.
Vladimir Steshin created IGNITE-13646: - Summary: Discovery ducktape test might have setting for socket linger. Key: IGNITE-13646 URL: https://issues.apache.org/jira/browse/IGNITE-13646 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin Since IGNITE-13643, discovery ducktape test might have additional setting for socket linger. This could unveil new issues with the linger and start fixing or redeeming tcp discovery settings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13645) Discovery ducktape test should detect failed nodes by asking the cluster.
Vladimir Steshin created IGNITE-13645: - Summary: Discovery ducktape test should detect failed nodes by asking the cluster. Key: IGNITE-13645 URL: https://issues.apache.org/jira/browse/IGNITE-13645 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin Discovery ducktape test should measure detection time of failed nodes by asking whole rest of the cluster. Currently, we measure by asking only one watching node. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13644) Close socket bravely.
Vladimir Steshin created IGNITE-13644: - Summary: Close socket bravely. Key: IGNITE-13644 URL: https://issues.apache.org/jira/browse/IGNITE-13644 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin We should not to wait for socket closing once we finisshed logical connection and data exchange. This can violate configured timeouts and detection guaranties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13643) Fix long closing of the socker in ServerImpl (TcpDiscoverySpi)
Vladimir Steshin created IGNITE-13643: - Summary: Fix long closing of the socker in ServerImpl (TcpDiscoverySpi) Key: IGNITE-13643 URL: https://issues.apache.org/jira/browse/IGNITE-13643 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to close socket. Probably it is default soTimeout. This violates node detection failure. Despite we set failureDetectionTiemout == 1000, node failure is detected within 6.5 secs in average. Logging shows delay on socket closing in IgniteUtils.closeQuiet(@Nullable Socket sock). Suggestion: use forced closing, set soLinger=0, do now wait for rest of the socket IO. We close socket in TcpDiscoverySpi when we already waited for target timeouts and consider connection is lost or invalid. We do not need to wait for any traffic on the socket any more. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13641) More logs for debugging DiscoveryTcpSpi
Vladimir Steshin created IGNITE-13641: - Summary: More logs for debugging DiscoveryTcpSpi Key: IGNITE-13641 URL: https://issues.apache.org/jira/browse/IGNITE-13641 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin Logs in DiscoveryTcp (ServerImpl) are insufficient. We do not see actual passed timeouts in sockets. It's difficult to realise why the timeouts, awaits happened are what they are. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13638) Bring log config to ducktape tests
Vladimir Steshin created IGNITE-13638: - Summary: Bring log config to ducktape tests Key: IGNITE-13638 URL: https://issues.apache.org/jira/browse/IGNITE-13638 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13625) Make network timeout rely on failureDetectionTimeout in TcpDiscovery
Vladimir Steshin created IGNITE-13625: - Summary: Make network timeout rely on failureDetectionTimeout in TcpDiscovery Key: IGNITE-13625 URL: https://issues.apache.org/jira/browse/IGNITE-13625 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13620) Bind ignite node to 1 address in the ducktests
Vladimir Steshin created IGNITE-13620: - Summary: Bind ignite node to 1 address in the ducktests Key: IGNITE-13620 URL: https://issues.apache.org/jira/browse/IGNITE-13620 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13603) TcpDiscoverySpi seems do not drop network recovery state and it's timer.
Vladimir Steshin created IGNITE-13603: - Summary: TcpDiscoverySpi seems do not drop network recovery state and it's timer. Key: IGNITE-13603 URL: https://issues.apache.org/jira/browse/IGNITE-13603 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin ServerImpl keeps sndState (CrossRingMessageSendState) in its message send cycle. Once created with a failure recovery timer, it is not cleared or refreshed any more. This may issue instant timeout on next send failure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13602) Create discovery node failure test based on network malfunction emulation.
Vladimir Steshin created IGNITE-13602: - Summary: Create discovery node failure test based on network malfunction emulation. Key: IGNITE-13602 URL: https://issues.apache.org/jira/browse/IGNITE-13602 Project: Ignite Issue Type: Task Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13282) Fix TcpDiscoveryCoordinatorFailureTest.testClusterFailedNewCoordinatorInitialized()
Vladimir Steshin created IGNITE-13282: - Summary: Fix TcpDiscoveryCoordinatorFailureTest.testClusterFailedNewCoordinatorInitialized() Key: IGNITE-13282 URL: https://issues.apache.org/jira/browse/IGNITE-13282 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13208) Refactoring of IgniteSpiOperationTimeoutHelper
Vladimir Steshin created IGNITE-13208: - Summary: Refactoring of IgniteSpiOperationTimeoutHelper Key: IGNITE-13208 URL: https://issues.apache.org/jira/browse/IGNITE-13208 Project: Ignite Issue Type: Task Reporter: Vladimir Steshin Assignee: Vladimir Steshin IgniteSpiOperationTimeoutHelper has many timeout fields. It looks like to get simplified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13206) Represent in the doc affection of several node addresses on failure detection.
Vladimir Steshin created IGNITE-13206: - Summary: Represent in the doc affection of several node addresses on failure detection. Key: IGNITE-13206 URL: https://issues.apache.org/jira/browse/IGNITE-13206 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13205) Represent in logs, javadoc affection of several node addresses on failure detection.
Vladimir Steshin created IGNITE-13205: - Summary: Represent in logs, javadoc affection of several node addresses on failure detection. Key: IGNITE-13205 URL: https://issues.apache.org/jira/browse/IGNITE-13205 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin Current TcpDiscoverySpi can prolong detection of node failure which has several IP addresses. This happens because most of the timeouts like failureDetectionTimeout, sockTimeout, ackTimeout works per address. And the node addresses are sorted out serially. This affection on failure detection should be noted in logs, javadocs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13194) Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster()
Vladimir Steshin created IGNITE-13194: - Summary: Fix testNodeWithIncompatibleMetadataIsProhibitedToJoinTheCluster() Key: IGNITE-13194 URL: https://issues.apache.org/jira/browse/IGNITE-13194 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13134) Fix connection recovery timout.
Vladimir Steshin created IGNITE-13134: - Summary: Fix connection recovery timout. Key: IGNITE-13134 URL: https://issues.apache.org/jira/browse/IGNITE-13134 Project: Ignite Issue Type: Improvement Affects Versions: 2.8.1 Reporter: Vladimir Steshin Assignee: Vladimir Steshin If node experiences connection issues it must establish new connection or fail within failureDetectionTimeout + connectionRecoveryTimout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13111) Simplify backward checking of node connection.
Vladimir Steshin created IGNITE-13111: - Summary: Simplify backward checking of node connection. Key: IGNITE-13111 URL: https://issues.apache.org/jira/browse/IGNITE-13111 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin We should fix several drawbacks in the backward checking of failed node. They prolong node failure detection upto: ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout + 300ms. See: * ‘_NodeFailureResearch.patch_'. It creates test 'FailureDetectionResearch' which emulates long answears on a failed node and measures failure detection delays. * '_FailureDetectionResearch.txt_' - results of the test. * '_FailureDetectionResearch_fixed.txt_' - results of the test after this fix. * '_WostCaseStepByStep.txt_' - description how the worst case happens. *Suggestion:* 1) We can simplify backward connection checking as we implement IGNITE-13012. Once we get robust, predictable connection ping, we don't need to check previous node because we can see whether it sent ping to current node within failure detection timeout. If not, previous node can be considered lost. Instead of: {code:java} // Node cannot connect to it's next (for local node it's previous). // Need to check connectivity to it. long rcvdTime = lastRingMsgReceivedTime; long now = U.currentTimeMillis(); // We got message from previous in less than double connection check interval. boolean ok = rcvdTime + effectiveExchangeTimeout() >= now; TcpDiscoveryNode previous = null; if (ok) { // Check case when previous node suddenly died. This will speed up // node failing. Checking connection to previous node } {code} 2) Then, seems we can remove: {code:java} ServerImpl.SocketReader.isConnectionRefused(SocketAddress addr); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13090) Add parameter of connection check period to TcpDiscoverySpi
Vladimir Steshin created IGNITE-13090: - Summary: Add parameter of connection check period to TcpDiscoverySpi Key: IGNITE-13090 URL: https://issues.apache.org/jira/browse/IGNITE-13090 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin We should add parameter of connection check period to TcpDiscoverySpi. If it isn't automatically set by IgniteConfiguration.setFailureDetectionTimeout(), user should be able to adjust it. Similar params: {code:java} TcpDiscoverySpi.setReconnectCount() TcpDiscoverySpi.setAckTimeout() TcpDiscoverySpi.setSocketTimeout() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13040) Remove unused parameter from TcpDiscoverySpi.writeToSocket()
Vladimir Steshin created IGNITE-13040: - Summary: Remove unused parameter from TcpDiscoverySpi.writeToSocket() Key: IGNITE-13040 URL: https://issues.apache.org/jira/browse/IGNITE-13040 Project: Ignite Issue Type: Task Environment: Unused parameter {code:java}TcpDiscoveryAbstractMessage msg{code} should be removed from {code:java} TcpDiscovery.writeToSocket(Socket sock, TcpDiscoveryAbstractMessage msg, byte[] data, long timeout){code} This method seems to send raw data, not a message. Reporter: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13018) Get rid of duplicated checking of failed node.
Vladimir Steshin created IGNITE-13018: - Summary: Get rid of duplicated checking of failed node. Key: IGNITE-13018 URL: https://issues.apache.org/jira/browse/IGNITE-13018 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin Failed node checking should be simplified to one step: ping node (send a message) from previous one in the ring and wait for response within IgniteConfiguration.failureDetectionTimeout. If node doesn't respond, we should consider it failed. Extra steps of connection checking may seriously delay failure detection, bring confusion and weird behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13017) Remove delay of 200ms from re-marking failed node as alive.
Vladimir Steshin created IGNITE-13017: - Summary: Remove delay of 200ms from re-marking failed node as alive. Key: IGNITE-13017 URL: https://issues.apache.org/jira/browse/IGNITE-13017 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin We should remove hardcoded timeout from: {code:java} boolean ServerImpl.CrossRingMessageSendState.markLastFailedNodeAlive() { if (state == RingMessageSendState.FORWARD_PASS || state == RingMessageSendState.BACKWARD_PASS) { ... if (--failedNodes <= 0) { ... state = RingMessageSendState.STARTING_POINT; try { Thread.sleep(200); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } return true; } return false; } {code} This can bring additional 200ms to duration of failed node detection. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13016) Remove hardcoded values/timeouts from backward checking of failed node.
Vladimir Steshin created IGNITE-13016: - Summary: Remove hardcoded values/timeouts from backward checking of failed node. Key: IGNITE-13016 URL: https://issues.apache.org/jira/browse/IGNITE-13016 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin Backward checking of failed node rely on hardcoced timeout 100ms: {code:java} private boolean ServerImpls.isConnectionRefused(SocketAddress addr) { try (Socket sock = new Socket()) { sock.connect(addr, 100); } catch (ConnectException e) { return true; } catch (IOException e) { return false; } return false; } {code} We should make it bound to configurable params like IgniteConfiguration.failureDetectionTimeout -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13015) Use nono time instead of currentMills() in node failure ddetection.
Vladimir Steshin created IGNITE-13015: - Summary: Use nono time instead of currentMills() in node failure ddetection. Key: IGNITE-13015 URL: https://issues.apache.org/jira/browse/IGNITE-13015 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin Make sure in node failure detection not used: {code:java} System.currentTimeMillis() and IgniteUtils.currentTimeMillis() {code} Disadventages: 1) Current system time has no quarantine of strict forward movement. System time can be adjusted, synchronized by NTP as example. This can lead to incorrect and negative delays. 2) IgniteUtils.currentTimeMillis() is granulated by 10ms -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13014) Remove long, double checking of node availability. Fix hardcoded values.
Vladimir Steshin created IGNITE-13014: - Summary: Remove long, double checking of node availability. Fix hardcoded values. Key: IGNITE-13014 URL: https://issues.apache.org/jira/browse/IGNITE-13014 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin For the present, we have duplicated checking of node availability. This prolongs node failure detection and gives no additional benefits. There are mesh and hardcoded values in this routine. Let's imagine node 2 doesn't answer any more. Node 1 becomes unable to ping node 2 and asks Node 3 to establish permanent connection instead of node 2. Despite node 2 has been already pinged within configured timeouts, node 3 try to connect to node 2 too. Disadvantages: 1) Possible long detection of node failure up to ServerImpl.CON_CHECK_INTERVAL + 2 * IgniteConfiguretion.failureDetectionTimeout + 300ms. See ‘WostCase.txt’ 2) Unexpected, not-configurable decision to check availability of previous node based on ‘2 * ServerImpl.CON_CHECK_INTERVAL‘: // We got message from previous in less than double connection check interval. boolean ok = rcvdTime + CON_CHECK_INTERVAL * 2 >= now; If ‘ok == true’ node 3 checks node 2. 3) Double node checking brings several not-configurable hardcoded delays: Node 3 checks node 2 with hardcoded timeout 100ms: ServerImpl.isConnectionRefused(): sock.connect(addr, 100); Checking availability of previous node considers any exception but ConnectionException (connection refused) as existing connection. Even a timeout. See ServerImpl.isConnectionRefused(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13012) Make node connection checking rely on the configuration. Simplify node ping routine.
Vladimir Steshin created IGNITE-13012: - Summary: Make node connection checking rely on the configuration. Simplify node ping routine. Key: IGNITE-13012 URL: https://issues.apache.org/jira/browse/IGNITE-13012 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin Current noted-to-node connection checking has several drawbacks: 1) Minimal connection checking interval is not bound to failure detection parameters: static int ServerImpls.CON_CHECK_INTERVAL = 500; 2) Connection checking is made as ability of periodical message sending (TcpDiscoveryConnectionCheckMessage). It is bound to own time (ServerImpl. RingMessageWorker.lastTimeConnCheckMsgSent), not to common time of last sent message. This is weird because any discovery message actually checks connection. And TpDiscoveryConnectionCheckMessage is just an addition when message queue is empty for a long time. 3) Period of Node-to-Node connection checking can be sometimes shortened for strange reason: if no sent or received message appears within failureDetectionTimeout. Here, despite we have minimal period of connection checking (ServerImpls.CON_CHECK_INTERVAL), we can also send TpDiscoveryConnectionCheckMessage before this period exhausted. Moreover, this premature node ping relies also on time of last received message. Imagine: if node 2 receives no message from node 1 within some time it decides to do extra ping node 3 not waiting for regular ping interval. Such behavior makes confusion and gives no additional guaranties. 4) If #3 happens, node writes in the log on INFO: “Local node seems to be disconnected from topology …” whereas it is not actually disconnected. User can see this message if he typed failureDetectionTimeout < 500ms. I wouldn’t like seeing INFO in a log saying a node is might be disconnected. This sounds like some troubles raised in network. But not as everything is OK. Suggestions: 1) Make connection check interval be based on failureDetectionTimeout or similar params. 2) Make connection check interval rely on common time of last sent message. Not on dedicated time. 3) Remove additional, random, quickened connection checking. 4) Do not worry user with “Node disconnected” when everything is OK. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12779) Split Ignite and IgniteMXBean, make different behavior of the active(boolean)
Vladimir Steshin created IGNITE-12779: - Summary: Split Ignite and IgniteMXBean, make different behavior of the active(boolean) Key: IGNITE-12779 URL: https://issues.apache.org/jira/browse/IGNITE-12779 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin To make cluster deactivation through JMX without sudden erasure in-memory data we should: 1) Add _IgniteMXBean#state(String state, boolean force)_. 2) Let _IgniteMXBean#state(String state)_ and _IgniteMXBean#active(boolean active)_ fail when deactivating cluster with in-memory data. 3) Separate implementations _Ignite_ and _IgniteMXBean_ from _IgniteKernal_. They have same method _void active(boolean active)_ which is required with different behavior. In case of _Ignite#active(boolean active)_ it should not fail when deactivating cluster with in-memory data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12773) Reduce number of cluster deactivation methods in internal API.
Vladimir Steshin created IGNITE-12773: - Summary: Reduce number of cluster deactivation methods in internal API. Key: IGNITE-12773 URL: https://issues.apache.org/jira/browse/IGNITE-12773 Project: Ignite Issue Type: Improvement Reporter: Vladimir Steshin Assignee: Vladimir Steshin To reduce number of cluster deactivation methods in internal API we might: 1. Remove GridClientClusterState#active() 2. Remove GridClientClusterState#active(boolean active) 3. Remove IGridClusterStateProcessor#changeGlobalState( boolean activate, Collection baselineNodes, boolean forceChangeBaselineTopology ) 4. Remove GridClusterStateProcessor#changeGlobalState( final boolean activate, Collection baselineNodes, boolean forceChangeBaselineTopology, boolean isAutoAdjust ) 5. Remove GridClusterStateProcessor#changeGlobalState( final boolean activate, Collection baselineNodes, boolean forceChangeBaselineTopology ) 6. Remove GridClusterStateProcessor#changeGlobalState( ClusterState state, boolean forceDeactivation, Collection baselineNodes, boolean forceChangeBaselineTopology ) 7. Add boolean isAutoAdjust to IGridClusterStateProcessor#changeGlobalState( ClusterState state, boolean forceDeactivation, Collection baselineNodes, boolean forceChangeBaselineTopology, /* here */ boolean isAutoAdjust /* here */ ) 8. Add @Override to /* here */ @Override /* here */ GridClusterStateProcessor#changeGlobalState( ClusterState state, boolean forceDeactivation, Collection baselineNodes, boolean forceChangeBaselineTopology ) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12704) Fail of recognition of default scheme in SQL queries.
Vladimir Steshin created IGNITE-12704: - Summary: Fail of recognition of default scheme in SQL queries. Key: IGNITE-12704 URL: https://issues.apache.org/jira/browse/IGNITE-12704 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Got a connectionConnection conn = ...; // execute() - is just a helper function. Creates prepared statement, pass params... // Get all the tables. List> lst = execute(conn, "select SCHEMA_NAME, TABLE_NAME from SYS.TABLES"); for( List row : lst ){ String schemaName = (String)row.get(0); String tableName = (String)row.get(1); // Shows: "schema: default, table: PERSON" System.out.println("schema: " + schemName + ", table: " + tableName)); // Fails with with: java.sql.SQLException: Failed to parse query. Схема "DEFAULT" не найдена execute( conn, "drop table "+schemaName + "."+tableName+"'" ); } I think this case should fail with error like "only cache created tables can be removed with drop table. ", not with "scheme not found." SQL-engine is supposed to accept and understand values it returns itself. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12701) Disallow silent deactivation in CLI and REST.
Vladimir Steshin created IGNITE-12701: - Summary: Disallow silent deactivation in CLI and REST. Key: IGNITE-12701 URL: https://issues.apache.org/jira/browse/IGNITE-12701 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin Disallow silent deactivation through the CLI and REST. Skip JMX call {code:java} void IgniteMXBean#active(boolean active) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12614) Disallow silent deactivation of cluster to prevent in-mem data loss.
Vladimir Steshin created IGNITE-12614: - Summary: Disallow silent deactivation of cluster to prevent in-mem data loss. Key: IGNITE-12614 URL: https://issues.apache.org/jira/browse/IGNITE-12614 Project: Ignite Issue Type: Bug Reporter: Vladimir Steshin Currently, anyone is able to deactivate cluster with command line utility (control.sh). Probably with JMX too. That would lead to data loss when the persistence is off. In-memory data is erased during deactivation. Such behavior can be considered as unexpected to user. Suggestions: 1) Disallow silent deactivate cluster keeping caches. Show a warning like “Your cluster has in-memory cache configured. During deactivation all data from these caches will be cleared!” 2) Add param ‘--force’ which skips the warning message. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12606) Parametrize IgniteTxStoreExceptionAbstractSelfTest
Vladimir Steshin created IGNITE-12606: - Summary: Parametrize IgniteTxStoreExceptionAbstractSelfTest Key: IGNITE-12606 URL: https://issues.apache.org/jira/browse/IGNITE-12606 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin IgniteTxStoreExceptionAbstractSelfTest seems to fit well the parametrization. It has only single depth of sub-tests which are used in one place together. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12597) IgniteTxStoreExceptionAbstractSelfTest
Vladimir Steshin created IGNITE-12597: - Summary: IgniteTxStoreExceptionAbstractSelfTest Key: IGNITE-12597 URL: https://issues.apache.org/jira/browse/IGNITE-12597 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin org.apache.ignite.internal.processors.cache.GridCacheColocatedTxStoreExceptionSelfTest might be parametrized. Extending classes wear only params and are executed in a row -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12596) Parametrization of IgniteCacheAbstractExecutionContextTest
Vladimir Steshin created IGNITE-12596: - Summary: Parametrization of IgniteCacheAbstractExecutionContextTest Key: IGNITE-12596 URL: https://issues.apache.org/jira/browse/IGNITE-12596 Project: Ignite Issue Type: Sub-task Environment: org.apache.ignite.internal.processors.cache.context.IgniteCacheAbstractExecutionContextTest is activated 3 times with just various params via inheritance. The problem is that the extending classes are included in the target test suits not always with entire combinations of params. Sometimes only 2 extendins classes are involved within tests, sometimes 3. I think of using subclasses of IgniteCacheAbstractExecutionContextTest as set of params. Reporter: Vladimir Steshin Assignee: Vladimir Steshin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12595) Parametrization of GridCacheSetAbstractSelfTest
Vladimir Steshin created IGNITE-12595: - Summary: Parametrization of GridCacheSetAbstractSelfTest Key: IGNITE-12595 URL: https://issues.apache.org/jira/browse/IGNITE-12595 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin org.apache.ignite.internal.processors.cache.datastructures.GridCacheSetAbstractSelfTest might be used with params. Not the best candidate, but is still able to reduce tests code base. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12583) Parametrization of JdbcThinBulkLoadAbstractSelfTest
Vladimir Steshin created IGNITE-12583: - Summary: Parametrization of JdbcThinBulkLoadAbstractSelfTest Key: IGNITE-12583 URL: https://issues.apache.org/jira/browse/IGNITE-12583 Project: Ignite Issue Type: Sub-task Reporter: Vladimir Steshin Assignee: Vladimir Steshin org.apache.ignite.jdbc.thin.JdbcThinBulkLoadAbstractSelfTest is extended several times using just parameter-assigning-getters like {code:java} protected CacheMode cacheMode() { return CacheMode.REPLICATED; } protected CacheAtomicityMode atomicityMode() { return CacheAtomicityMode.TRANSACTIONAL;} protected boolean nearCache() { return false; } {code} Should go with params instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)