[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts
[ https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155188#comment-16155188 ] Taras Ledkov commented on IGNITE-3558: -- Waits for [tests results|https://ci.ignite.apache.org/project.html?projectId=Ignite20Tests&tab=projectOverview&branch_Ignite20Tests=pull%2F1326%2Fhead] > Affinity task hangs when Collision SPI produces a lot of job rejections & > Failover SPI produces many attempts > - > > Key: IGNITE-3558 > URL: https://issues.apache.org/jira/browse/IGNITE-3558 > Project: Ignite > Issue Type: Bug > Components: compute >Reporter: Taras Ledkov >Assignee: Taras Ledkov > Fix For: 2.3 > > Time Spent: 3h > Remaining Estimate: 0h > > The test to reproduce: > {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}} > *Root cause* > {{GridJobExecuteResponse}} isn't set from target node because there is a > confusion with {{GridJobWorker}} instances in the {{CollisionContext}}. > *Suggestion* > The method {{GridJobProcessor.CollisionJobContext.cancel()}} > use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}. > *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} > implements as a equation of jobId. > So, when two thread try to cancel the two workers with *the same jobIds* we > have the case: > - thread0 remove jobWorker0 & cancel jobWorker0. > - thread0 put jobWorker1 (because jobWorker0 already removed); > - thread1: (has a copy of jobWorker0) and try to cancel it. > - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to > identify); > - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled. > *Proposal* > Try to use system default equals for the GridJobWorker -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts
[ https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155176#comment-16155176 ] Yakov Zhdanov commented on IGNITE-3558: --- I am ok with the changes. [~tledkov-gridgain] please rerun team city and let's proceed. > Affinity task hangs when Collision SPI produces a lot of job rejections & > Failover SPI produces many attempts > - > > Key: IGNITE-3558 > URL: https://issues.apache.org/jira/browse/IGNITE-3558 > Project: Ignite > Issue Type: Bug > Components: compute >Reporter: Taras Ledkov >Assignee: Taras Ledkov > Fix For: 2.3 > > Time Spent: 3h > Remaining Estimate: 0h > > The test to reproduce: > {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}} > *Root cause* > {{GridJobExecuteResponse}} isn't set from target node because there is a > confusion with {{GridJobWorker}} instances in the {{CollisionContext}}. > *Suggestion* > The method {{GridJobProcessor.CollisionJobContext.cancel()}} > use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}. > *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} > implements as a equation of jobId. > So, when two thread try to cancel the two workers with *the same jobIds* we > have the case: > - thread0 remove jobWorker0 & cancel jobWorker0. > - thread0 put jobWorker1 (because jobWorker0 already removed); > - thread1: (has a copy of jobWorker0) and try to cancel it. > - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to > identify); > - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled. > *Proposal* > Try to use system default equals for the GridJobWorker -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts
[ https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153182#comment-16153182 ] Vladimir Ozerov commented on IGNITE-3558: - [~tledkov-gridgain], [~sboikov], any more comments here? Are we going implement anything as a part of this ticket? > Affinity task hangs when Collision SPI produces a lot of job rejections & > Failover SPI produces many attempts > - > > Key: IGNITE-3558 > URL: https://issues.apache.org/jira/browse/IGNITE-3558 > Project: Ignite > Issue Type: Bug > Components: compute >Reporter: Taras Ledkov >Assignee: Taras Ledkov > Fix For: 2.3 > > Time Spent: 3h > Remaining Estimate: 0h > > The test to reproduce: > {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}} > *Root cause* > {{GridJobExecuteResponse}} isn't set from target node because there is a > confusion with {{GridJobWorker}} instances in the {{CollisionContext}}. > *Suggestion* > The method {{GridJobProcessor.CollisionJobContext.cancel()}} > use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}. > *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} > implements as a equation of jobId. > So, when two thread try to cancel the two workers with *the same jobIds* we > have the case: > - thread0 remove jobWorker0 & cancel jobWorker0. > - thread0 put jobWorker1 (because jobWorker0 already removed); > - thread1: (has a copy of jobWorker0) and try to cancel it. > - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to > identify); > - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled. > *Proposal* > Try to use system default equals for the GridJobWorker -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts
[ https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869797#comment-15869797 ] Taras Ledkov commented on IGNITE-3558: -- [[~vozerov], [~sboikov] please review the patch. > Affinity task hangs when Collision SPI produces a lot of job rejections & > Failover SPI produces many attempts > - > > Key: IGNITE-3558 > URL: https://issues.apache.org/jira/browse/IGNITE-3558 > Project: Ignite > Issue Type: Bug > Components: compute >Reporter: Taras Ledkov >Assignee: Taras Ledkov > Fix For: 2.0 > > Time Spent: 3h > Remaining Estimate: 0h > > The test to reproduce: > {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}} > *Root cause* > {{GridJobExecuteResponse}} isn't set from target node because there is a > confusion with {{GridJobWorker}} instances in the {{CollisionContext}}. > *Suggestion* > The method {{GridJobProcessor.CollisionJobContext.cancel()}} > use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}. > *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} > implements as a equation of jobId. > So, when two thread try to cancel the two workers with *the same jobIds* we > have the case: > - thread0 remove jobWorker0 & cancel jobWorker0. > - thread0 put jobWorker1 (because jobWorker0 already removed); > - thread1: (has a copy of jobWorker0) and try to cancel it. > - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to > identify); > - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled. > *Proposal* > Try to use system default equals for the GridJobWorker -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (IGNITE-3558) Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts
[ https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722540#comment-15722540 ] Taras Ledkov commented on IGNITE-3558: -- Pull request to run tests: [pull/1316|https://github.com/apache/ignite/pull/1316] > Affinity task hangs when Collision SPI produces a lot of job rejections & > Failover SPI produces many attempts > - > > Key: IGNITE-3558 > URL: https://issues.apache.org/jira/browse/IGNITE-3558 > Project: Ignite > Issue Type: Bug > Components: compute >Reporter: Taras Ledkov >Assignee: Taras Ledkov > Fix For: 2.0 > > Time Spent: 3h > Remaining Estimate: 0h > > The test to reproduce: > IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest#testJobFinishing > *Root cause* > GridJobExecuteResponse isn't set from target node because there is a > confusion with GridJobWorker instances in the CollisionContext. > *Suggestion* > The method GridJobProcessor.CollisionJobContext.cancel() > use passiveJobs.remove(jobWorker.getJobId(), jobWorker). > *passiveJobs* is a ConcurrentHashMap and GridJobWorker.equals() implements as > a equation of jobId. > So, when two thread try to cancel the two workers with *the same jobIds* we > have the case: > - thread0 remove jobWorker0 & cancel jobWorker0. > - thread0 put jobWorker1 (because jobWorker0 already removed); > - thread1: (has a copy of jobWorker0) and try to cancel it. > - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to > identify); > - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled. > *Proposal* > Try to use system default equals for the GridJobWorker -- This message was sent by Atlassian JIRA (v6.3.4#6332)