[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts

2017-09-06 Thread Taras Ledkov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155188#comment-16155188
 ] 

Taras Ledkov commented on IGNITE-3558:
--

Waits for [tests 
results|https://ci.ignite.apache.org/project.html?projectId=Ignite20Tests&tab=projectOverview&branch_Ignite20Tests=pull%2F1326%2Fhead]

> Affinity task hangs when Collision SPI produces a lot of job rejections & 
> Failover SPI produces many attempts
> -
>
> Key: IGNITE-3558
> URL: https://issues.apache.org/jira/browse/IGNITE-3558
> Project: Ignite
>  Issue Type: Bug
>  Components: compute
>Reporter: Taras Ledkov
>Assignee: Taras Ledkov
> Fix For: 2.3
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The test to reproduce:
> {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}}
> *Root cause*
> {{GridJobExecuteResponse}} isn't set from target node because there is a 
> confusion with {{GridJobWorker}} instances in the {{CollisionContext}}.
> *Suggestion*
> The method {{GridJobProcessor.CollisionJobContext.cancel()}}
> use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}. 
> *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} 
> implements as a equation of jobId.
> So, when two thread try to cancel the two workers with *the same jobIds* we 
> have the case:
> - thread0 remove jobWorker0 & cancel jobWorker0.
> - thread0 put jobWorker1 (because jobWorker0 already removed);
> - thread1: (has a copy of jobWorker0) and try to cancel it.
> - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to 
> identify);
> - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
> *Proposal*
> Try to use system default equals for the GridJobWorker



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts

2017-09-06 Thread Yakov Zhdanov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155176#comment-16155176
 ] 

Yakov Zhdanov commented on IGNITE-3558:
---

I am ok with the changes.

[~tledkov-gridgain] please rerun team city and let's proceed.

> Affinity task hangs when Collision SPI produces a lot of job rejections & 
> Failover SPI produces many attempts
> -
>
> Key: IGNITE-3558
> URL: https://issues.apache.org/jira/browse/IGNITE-3558
> Project: Ignite
>  Issue Type: Bug
>  Components: compute
>Reporter: Taras Ledkov
>Assignee: Taras Ledkov
> Fix For: 2.3
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The test to reproduce:
> {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}}
> *Root cause*
> {{GridJobExecuteResponse}} isn't set from target node because there is a 
> confusion with {{GridJobWorker}} instances in the {{CollisionContext}}.
> *Suggestion*
> The method {{GridJobProcessor.CollisionJobContext.cancel()}}
> use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}. 
> *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} 
> implements as a equation of jobId.
> So, when two thread try to cancel the two workers with *the same jobIds* we 
> have the case:
> - thread0 remove jobWorker0 & cancel jobWorker0.
> - thread0 put jobWorker1 (because jobWorker0 already removed);
> - thread1: (has a copy of jobWorker0) and try to cancel it.
> - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to 
> identify);
> - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
> *Proposal*
> Try to use system default equals for the GridJobWorker



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts

2017-09-05 Thread Vladimir Ozerov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153182#comment-16153182
 ] 

Vladimir Ozerov commented on IGNITE-3558:
-

[~tledkov-gridgain], [~sboikov], any more comments here? Are we going implement 
anything as a part of this ticket?

> Affinity task hangs when Collision SPI produces a lot of job rejections & 
> Failover SPI produces many attempts
> -
>
> Key: IGNITE-3558
> URL: https://issues.apache.org/jira/browse/IGNITE-3558
> Project: Ignite
>  Issue Type: Bug
>  Components: compute
>Reporter: Taras Ledkov
>Assignee: Taras Ledkov
> Fix For: 2.3
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The test to reproduce:
> {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}}
> *Root cause*
> {{GridJobExecuteResponse}} isn't set from target node because there is a 
> confusion with {{GridJobWorker}} instances in the {{CollisionContext}}.
> *Suggestion*
> The method {{GridJobProcessor.CollisionJobContext.cancel()}}
> use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}. 
> *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} 
> implements as a equation of jobId.
> So, when two thread try to cancel the two workers with *the same jobIds* we 
> have the case:
> - thread0 remove jobWorker0 & cancel jobWorker0.
> - thread0 put jobWorker1 (because jobWorker0 already removed);
> - thread1: (has a copy of jobWorker0) and try to cancel it.
> - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to 
> identify);
> - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
> *Proposal*
> Try to use system default equals for the GridJobWorker



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts

2017-02-16 Thread Taras Ledkov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869797#comment-15869797
 ] 

Taras Ledkov commented on IGNITE-3558:
--

[[~vozerov], [~sboikov] please review the patch.

> Affinity task hangs when Collision SPI produces a lot of job rejections & 
> Failover SPI produces many attempts
> -
>
> Key: IGNITE-3558
> URL: https://issues.apache.org/jira/browse/IGNITE-3558
> Project: Ignite
>  Issue Type: Bug
>  Components: compute
>Reporter: Taras Ledkov
>Assignee: Taras Ledkov
> Fix For: 2.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The test to reproduce:
> {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}}
> *Root cause*
> {{GridJobExecuteResponse}} isn't set from target node because there is a 
> confusion with {{GridJobWorker}} instances in the {{CollisionContext}}.
> *Suggestion*
> The method {{GridJobProcessor.CollisionJobContext.cancel()}}
> use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}. 
> *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} 
> implements as a equation of jobId.
> So, when two thread try to cancel the two workers with *the same jobIds* we 
> have the case:
> - thread0 remove jobWorker0 & cancel jobWorker0.
> - thread0 put jobWorker1 (because jobWorker0 already removed);
> - thread1: (has a copy of jobWorker0) and try to cancel it.
> - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to 
> identify);
> - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
> *Proposal*
> Try to use system default equals for the GridJobWorker



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts

2016-12-05 Thread Taras Ledkov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722540#comment-15722540
 ] 

Taras Ledkov commented on IGNITE-3558:
--

Pull request to run tests: 
[pull/1316|https://github.com/apache/ignite/pull/1316]

> Affinity task hangs when Collision SPI produces a lot of job rejections & 
> Failover SPI produces many attempts
> -
>
> Key: IGNITE-3558
> URL: https://issues.apache.org/jira/browse/IGNITE-3558
> Project: Ignite
>  Issue Type: Bug
>  Components: compute
>Reporter: Taras Ledkov
>Assignee: Taras Ledkov
> Fix For: 2.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The test to reproduce:
> IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest#testJobFinishing
> *Root cause*
> GridJobExecuteResponse isn't set from target node because there is a 
> confusion with GridJobWorker instances in the CollisionContext.
> *Suggestion*
> The method GridJobProcessor.CollisionJobContext.cancel()
> use passiveJobs.remove(jobWorker.getJobId(), jobWorker). 
> *passiveJobs* is a ConcurrentHashMap and GridJobWorker.equals() implements as 
> a equation of jobId.
> So, when two thread try to cancel the two workers with *the same jobIds* we 
> have the case:
> - thread0 remove jobWorker0 & cancel jobWorker0.
> - thread0 put jobWorker1 (because jobWorker0 already removed);
> - thread1: (has a copy of jobWorker0) and try to cancel it.
> - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to 
> identify);
> - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
> *Proposal*
> Try to use system default equals for the GridJobWorker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)