[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2019-10-10 Thread Maxim Muzafarov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948448#comment-16948448
 ] 

Maxim Muzafarov commented on IGNITE-9026:
-

Moved to the next release due to inactivity. Please, feel free to move it back 
if you will be able to complete the ticket by 2.8 code freeze date, December 2, 
2019.


> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
> Fix For: 2.8
>
> Attachments: master_1b3742f4d7_p2p_two_hops.patch
>
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-12-14 Thread Dmitriy Pavlov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721485#comment-16721485
 ] 

Dmitriy Pavlov commented on IGNITE-9026:


[~akalashnikov], what can be our next steps? should we move the issue to Open?

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
> Fix For: 2.8
>
> Attachments: master_1b3742f4d7_p2p_two_hops.patch
>
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-11-27 Thread Anton Kalashnikov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700043#comment-16700043
 ] 

Anton Kalashnikov commented on IGNITE-9026:
---

[~syssoftsol], thanks for you participation. I have couple notes:
In your case you can try use PRIVATE mode it's most simple and clear way. 
Classes will be loaded from job's initiator node and will be undeployed when 
it's node was gone.
As prototype for tests you can use SharedDeploymentTest.
In general  SHARED and CONTINUOUS modes have too unclear behavior for loading 
classes from other nodes. In fact where we have chaining tasks(when task call 
other task), participant(master) for all classes should be the node initiator 
of first task so we always should load classes from this node. As result I 
think that GridDeploymentClassLoader#nodeList is not necessery. In any way 
DeploymentMode behavior should be revised - I think we will do it in nearest 
time on dev_list

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
> Fix For: 2.8
>
> Attachments: master_1b3742f4d7_p2p_two_hops.patch
>
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-19 Thread Ryabov Dmitrii (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620305#comment-16620305
 ] 

Ryabov Dmitrii commented on IGNITE-9026:


PDS tests were unmuted 2 days ago (branch have master 7 days old), so they 
should be ok. Test {{testGetReadThrough}} had 2 fails 2 weeks ago, 
[~syssoftsol], please, rebase your branch on current master and rerun "Run All" 
on [TeamCity|https://ci.ignite.apache.org].

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
> Attachments: master_1b3742f4d7_p2p_two_hops.patch
>
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-19 Thread Ignite TC Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620297#comment-16620297
 ] 

Ignite TC Bot commented on IGNITE-9026:
---

{panel:title=Possible 
Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}
{color:#d04437}SPI{color} [[tests 0 TIMEOUT , Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=1900586]]
* TcpDiscoverySslSelfTest.testNoExtraNodeFailedMessage (last started)

{color:#d04437}PDS (Direct IO) 2{color} [[tests 
2|https://ci.ignite.apache.org/viewLog.html?buildId=1900606]]
* IgnitePdsNativeIoTestSuite2: IgnitePersistentStoreDataStructuresTest.testSet 
- 0,0% fails in last 100 master runs.

{color:#d04437}PDS 2{color} [[tests 
1|https://ci.ignite.apache.org/viewLog.html?buildId=1900610]]
* IgnitePdsTestSuite2: IgnitePersistentStoreDataStructuresTest.testSet - 0,0% 
fails in last 100 master runs.

{color:#d04437}Cache (Expiry Policy){color} [[tests 
4|https://ci.ignite.apache.org/viewLog.html?buildId=1900621]]
* IgniteCacheExpiryPolicyTestSuite: 
IgniteCacheTxExpiryPolicyWithStoreTest.testGetReadThrough - 0,0% fails in last 
100 master runs.

{panel}
[TeamCity Run 
All|http://ci.ignite.apache.org/viewLog.html?buildId=1900650buildTypeId=IgniteTests24Java8_RunAll]

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
> Attachments: master_1b3742f4d7_p2p_two_hops.patch
>
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-18 Thread Ryabov Dmitrii (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619224#comment-16619224
 ] 

Ryabov Dmitrii commented on IGNITE-9026:


Hello, [~syssoftsol]. I left some comments about [coding 
style|https://cwiki.apache.org/confluence/display/IGNITE/Coding+Guidelines] in 
[upsource|https://reviews.ignite.apache.org/ignite/review/IGNT-CR-864].

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
> Attachments: master_1b3742f4d7_p2p_two_hops.patch
>
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-12 Thread David Harvey (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612962#comment-16612962
 ] 

David Harvey commented on IGNITE-9026:
--

The pull request identifies the production code change we are running, which 
addressed the issue we saw.   Have not written the test yet to demonstrate the 
failure.

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-12 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612823#comment-16612823
 ] 

ASF GitHub Bot commented on IGNITE-9026:


GitHub user DaveWHarvey opened a pull request:

https://github.com/apache/ignite/pull/4741

IGNITE-9026 fix random class loading failures

Skip recursive resource requests to orginating nodes, rather than failing 
the entire request.   Continue to search other nodes on errors, because 
assumption that all nodes have the same view is incorrect.
Restrict the recursive searches that a node should do when looking for 
resources by avoiding the nodes that the sender has or will search.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/percipiomedia/ignite p2p_two_hops

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/ignite/pull/4741.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4741


commit 218469c6157f2aada33acb69adac60e25112a73a
Author: Dave Harvey 
Date:   2018-07-18T20:51:50Z

IGNITE-9026 fix random class loading failures

Skip recursive resource requests to orginating nodes, rather than failing 
the entire request.   Continue to search other nodes on errors, because 
assumption that all nodes have the same view is incorrect.
Restrict the recursive searches that a node should do when looking for 
resources by avoiding the nodes that the sender has or will search.




> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-11 Thread David Harvey (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611324#comment-16611324
 ] 

David Harvey commented on IGNITE-9026:
--

We had a related issue that currently appears like separate bugs:   when the 
server went back to the client, it got a network timeout on the p2p message, 
but the client stayed in the grid.   It would seem like the p2p message should 
not fail for a network error which was not considered fatal to the node.  Then, 
every subsequent attempt to load the class failed.

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-10 Thread David Harvey (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609846#comment-16609846
 ] 

David Harvey commented on IGNITE-9026:
--

On the test front, it looks like there are no tests at all which are focused 
on: send closure to node A which sends closure to node B.   

I believe the existing bug can be demonstrated easily with a client C and 
servers A and B.  

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Assignee: David Harvey
>Priority: Major
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-10 Thread David Harvey (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609705#comment-16609705
 ] 

David Harvey commented on IGNITE-9026:
--

I'm finishing my first submission now, and will move to this over the next few 
days, either writing the test or at least sharing a patch.

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Priority: Major
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-9026) Two levels of Peer class loading fails in CONTINUOUS mode

2018-09-10 Thread Stanislav Lukyanov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609169#comment-16609169
 ] 

Stanislav Lukyanov commented on IGNITE-9026:


[~syssoftsol], can you share the fix as a pull request or a patch? Perhaps 
someone would be able to help with the test.

> Two levels of Peer class loading fails in CONTINUOUS mode
> -
>
> Key: IGNITE-9026
> URL: https://issues.apache.org/jira/browse/IGNITE-9026
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.5
>Reporter: David Harvey
>Priority: Major
>
> We had an seemingly functional system in SHARED_MODE, where we have a custom 
> StreamReceiver that sometimes sends closures on the peer class loaded code to 
> other servers.  However, we ended up running out of Metaspace, because we had 
> > 6000 class loaders!  We suspected a regression in this change 
> [https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
>  so we switched to CONTINUOUS mode.    We then started getting failures to 
> load some of the classes for the closures on the second server.   Through 
> some testing and code inspection, there seems to be the following flaws 
> between GridDeploymentCommunication.sendResourceRequest and its two callers.
> The callers iterate though all the participant nodes until they find an 
> online node that responds to the request (timeout is treated as offline 
> node), with either success or failure, and then the loop terminates.  The 
> assumption is that all nodes are equally capable of providing the resource, 
> so if one fails, then the others would also fail.   
> The first flaw is that GridDeploymentCommunication.sendResourceRequest() has 
> a check for a cycle, i.e., whether the destination node is one of the nodes 
> that originated or forwarded this request, and in that case,  a failure 
> response is faked.   However, that causes the caller's loop to terminate.  So 
> depending on the order of the nodes in the participant list,  
> sendResourceRequest() may fail before trying any nodes because it has one of 
> the calling nodes on this list.      It should instead be skipping any of the 
> calling nodes.
> Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
> forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now 
> the node lists on S1 and S2 contain C1 and the other S node.   If the order 
> of the node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to 
> load a class, it will try S2, then S2 will try S1, but will get a fake 
> failure generated, causing S2 not to try more nodes (i.e., C1), and causing 
> S1 also not to try more nodes.
> The other flaw is the assumption that all participants have equal access to 
> the resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 
> though C1 and S4 through C2.   If C2 fails, then S4 is not capable of getting 
> back to a master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)