[ https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryabov Dmitrii updated IGNITE-9026: ----------------------------------- Fix Version/s: 2.8 > Two levels of Peer class loading fails in CONTINUOUS mode > --------------------------------------------------------- > > Key: IGNITE-9026 > URL: https://issues.apache.org/jira/browse/IGNITE-9026 > Project: Ignite > Issue Type: Bug > Affects Versions: 2.5 > Reporter: David Harvey > Assignee: David Harvey > Priority: Major > Fix For: 2.8 > > Attachments: master_1b3742f4d7_p2p_two_hops.patch > > > We had an seemingly functional system in SHARED_MODE, where we have a custom > StreamReceiver that sometimes sends closures on the peer class loaded code to > other servers. However, we ended up running out of Metaspace, because we had > > 6000 class loaders! We suspected a regression in this change > [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c], > so we switched to CONTINUOUS mode. We then started getting failures to > load some of the classes for the closures on the second server. Through > some testing and code inspection, there seems to be the following flaws > between GridDeploymentCommunication.sendResourceRequest and its two callers. > The callers iterate though all the participant nodes until they find an > online node that responds to the request (timeout is treated as offline > node), with either success or failure, and then the loop terminates. The > assumption is that all nodes are equally capable of providing the resource, > so if one fails, then the others would also fail. > The first flaw is that GridDeploymentCommunication.sendResourceRequest() has > a check for a cycle, i.e., whether the destination node is one of the nodes > that originated or forwarded this request, and in that case, a failure > response is faked. However, that causes the caller's loop to terminate. So > depending on the order of the nodes in the participant list, > sendResourceRequest() may fail before trying any nodes because it has one of > the calling nodes on this list. It should instead be skipping any of the > calling nodes. > Example with 1 client node a 2 server nodes: C1 sends data to S1, which > forwards closure to S2. C1 also sends to S2 which forwards to S1. So now > the node lists on S1 and S2 contain C1 and the other S node. If the order > of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to > load a class, it will try S2, then S2 will try S1, but will get a fake > failure generated, causing S2 not to try more nodes (i.e., C1), and causing > S1 also not to try more nodes. > The other flaw is the assumption that all participants have equal access to > the resource. Assume S1 knows about userVersion1 via S3 and S4, with S3 > though C1 and S4 through C2. If C2 fails, then S4 is not capable of getting > back to a master, but S1 has no way of knowing that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)