[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527787#comment-15527787 ] ASF subversion and git services commented on GEODE-1885: Commit 55a65840a4e4d427acaed1182aca869bf92ecae6 in incubator-geode's branch refs/heads/feature/GEODE-1801 from [~dschneider] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=55a6584 ] GEODE-1885: fix infinite loop The previous fix for GEODE-1885 introduced a hang on off-heap regions. If a concurrent close/destroy of the region happens while other threads are modifying it then the thread doing the modification can get stuck in a hot loop that never terminates. The hot loop is in AbstractRegionMap when it tests the existing region entry it finds to see if it can be modified. If the region entry has a value that says it is removed then the operation spins around and tries again. It expects the thread that marked it as being removed to also remove it from the map. The previous fix for GEODE-1885 can cause a remove to not happen. So this fix does two things: 1. On retry remove the existing removed region entry from the map. 2. putEntryIfAbsent now only releases the current entry if it has an off-heap reference. This prevents an infinite loop that was caused by the current thread who just added a new entry with REMOVE_PHASE1 from releasing it (changing it to REMOVE_PHASE2) because it sees that the region is closed/destroyed. > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Darrel Schneider > Fix For: 1.0.0-incubating > > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527786#comment-15527786 ] ASF subversion and git services commented on GEODE-1885: Commit 55a65840a4e4d427acaed1182aca869bf92ecae6 in incubator-geode's branch refs/heads/feature/GEODE-1801 from [~dschneider] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=55a6584 ] GEODE-1885: fix infinite loop The previous fix for GEODE-1885 introduced a hang on off-heap regions. If a concurrent close/destroy of the region happens while other threads are modifying it then the thread doing the modification can get stuck in a hot loop that never terminates. The hot loop is in AbstractRegionMap when it tests the existing region entry it finds to see if it can be modified. If the region entry has a value that says it is removed then the operation spins around and tries again. It expects the thread that marked it as being removed to also remove it from the map. The previous fix for GEODE-1885 can cause a remove to not happen. So this fix does two things: 1. On retry remove the existing removed region entry from the map. 2. putEntryIfAbsent now only releases the current entry if it has an off-heap reference. This prevents an infinite loop that was caused by the current thread who just added a new entry with REMOVE_PHASE1 from releasing it (changing it to REMOVE_PHASE2) because it sees that the region is closed/destroyed. > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Darrel Schneider > Fix For: 1.0.0-incubating > > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15527785#comment-15527785 ] ASF subversion and git services commented on GEODE-1885: Commit 55a65840a4e4d427acaed1182aca869bf92ecae6 in incubator-geode's branch refs/heads/feature/GEODE-1801 from [~dschneider] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=55a6584 ] GEODE-1885: fix infinite loop The previous fix for GEODE-1885 introduced a hang on off-heap regions. If a concurrent close/destroy of the region happens while other threads are modifying it then the thread doing the modification can get stuck in a hot loop that never terminates. The hot loop is in AbstractRegionMap when it tests the existing region entry it finds to see if it can be modified. If the region entry has a value that says it is removed then the operation spins around and tries again. It expects the thread that marked it as being removed to also remove it from the map. The previous fix for GEODE-1885 can cause a remove to not happen. So this fix does two things: 1. On retry remove the existing removed region entry from the map. 2. putEntryIfAbsent now only releases the current entry if it has an off-heap reference. This prevents an infinite loop that was caused by the current thread who just added a new entry with REMOVE_PHASE1 from releasing it (changing it to REMOVE_PHASE2) because it sees that the region is closed/destroyed. > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Darrel Schneider > Fix For: 1.0.0-incubating > > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524264#comment-15524264 ] ASF subversion and git services commented on GEODE-1885: Commit 55a65840a4e4d427acaed1182aca869bf92ecae6 in incubator-geode's branch refs/heads/develop from [~dschneider] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=55a6584 ] GEODE-1885: fix infinite loop The previous fix for GEODE-1885 introduced a hang on off-heap regions. If a concurrent close/destroy of the region happens while other threads are modifying it then the thread doing the modification can get stuck in a hot loop that never terminates. The hot loop is in AbstractRegionMap when it tests the existing region entry it finds to see if it can be modified. If the region entry has a value that says it is removed then the operation spins around and tries again. It expects the thread that marked it as being removed to also remove it from the map. The previous fix for GEODE-1885 can cause a remove to not happen. So this fix does two things: 1. On retry remove the existing removed region entry from the map. 2. putEntryIfAbsent now only releases the current entry if it has an off-heap reference. This prevents an infinite loop that was caused by the current thread who just added a new entry with REMOVE_PHASE1 from releasing it (changing it to REMOVE_PHASE2) because it sees that the region is closed/destroyed. > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Darrel Schneider > Fix For: 1.0.0-incubating > > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524267#comment-15524267 ] ASF subversion and git services commented on GEODE-1885: Commit 55a65840a4e4d427acaed1182aca869bf92ecae6 in incubator-geode's branch refs/heads/develop from [~dschneider] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=55a6584 ] GEODE-1885: fix infinite loop The previous fix for GEODE-1885 introduced a hang on off-heap regions. If a concurrent close/destroy of the region happens while other threads are modifying it then the thread doing the modification can get stuck in a hot loop that never terminates. The hot loop is in AbstractRegionMap when it tests the existing region entry it finds to see if it can be modified. If the region entry has a value that says it is removed then the operation spins around and tries again. It expects the thread that marked it as being removed to also remove it from the map. The previous fix for GEODE-1885 can cause a remove to not happen. So this fix does two things: 1. On retry remove the existing removed region entry from the map. 2. putEntryIfAbsent now only releases the current entry if it has an off-heap reference. This prevents an infinite loop that was caused by the current thread who just added a new entry with REMOVE_PHASE1 from releasing it (changing it to REMOVE_PHASE2) because it sees that the region is closed/destroyed. > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Darrel Schneider > Fix For: 1.0.0-incubating > > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524265#comment-15524265 ] ASF subversion and git services commented on GEODE-1885: Commit 55a65840a4e4d427acaed1182aca869bf92ecae6 in incubator-geode's branch refs/heads/develop from [~dschneider] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=55a6584 ] GEODE-1885: fix infinite loop The previous fix for GEODE-1885 introduced a hang on off-heap regions. If a concurrent close/destroy of the region happens while other threads are modifying it then the thread doing the modification can get stuck in a hot loop that never terminates. The hot loop is in AbstractRegionMap when it tests the existing region entry it finds to see if it can be modified. If the region entry has a value that says it is removed then the operation spins around and tries again. It expects the thread that marked it as being removed to also remove it from the map. The previous fix for GEODE-1885 can cause a remove to not happen. So this fix does two things: 1. On retry remove the existing removed region entry from the map. 2. putEntryIfAbsent now only releases the current entry if it has an off-heap reference. This prevents an infinite loop that was caused by the current thread who just added a new entry with REMOVE_PHASE1 from releasing it (changing it to REMOVE_PHASE2) because it sees that the region is closed/destroyed. > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Darrel Schneider > Fix For: 1.0.0-incubating > > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514048#comment-15514048 ] Darrel Schneider commented on GEODE-1885: - This fix caused a deadlock. If an offheap region is being destroyed while concurrent modifications are being done and if a clear is done on it then the deadlock can happen. The deadlock is caused by the code setting the offheap region entry value to a REMOVE token but not throwing an exception. This causes the higher level code to leave the entry in the map (if we had thrown an exception the higher level code would have removed the entry from the map). Then another thread that has the RVV read lock keeps seeing this entry with the REMOVE token and spinning around and seeing it again. Holding the RVV read lock blocks clear who is trying to get the RVV write lock. The clear blocks region destroy from completing because it waits for an in progress clear. > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Darrel Schneider > Fix For: 1.0.0-incubating > > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15497018#comment-15497018 ] ASF subversion and git services commented on GEODE-1885: Commit dbdf6fe1a99cbfb54e39c4de52802404d30404d4 in incubator-geode's branch refs/heads/develop from [~dschneider] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=dbdf6fe ] GEODE-1885: fix eclipse compilation issue > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Anilkumar Gingade > Fix For: 1.0.0-incubating > > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494571#comment-15494571 ] ASF subversion and git services commented on GEODE-1885: Commit 9b710ab0af2bc6af2667010c004ad4798b0b8700 in incubator-geode's branch refs/heads/develop from [~agingade] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=9b710ab ] GEODE-1885: Removed call to check readiness (region check) after the offheap region entry is released. GEODE-1885: Missing subsctiption event with Offheap partitioned region during bucket rebalance. During the trasaction commit on redundant bucket region, if the bucket region is moved, the call-back logic (to deliver subscription events) were not invoked due to check-readiness call with offheap region. The check-readiness throws exception, if the region is not found, which causes the code to return early without sending the subscription events. In this scenario, calling check-readiness is not needed... > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Anilkumar Gingade > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GEODE-1885) Missing subsctiption event with Offheap partitioned region during bucket rebalance.
[ https://issues.apache.org/jira/browse/GEODE-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494570#comment-15494570 ] ASF subversion and git services commented on GEODE-1885: Commit 9b710ab0af2bc6af2667010c004ad4798b0b8700 in incubator-geode's branch refs/heads/develop from [~agingade] [ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=9b710ab ] GEODE-1885: Removed call to check readiness (region check) after the offheap region entry is released. GEODE-1885: Missing subsctiption event with Offheap partitioned region during bucket rebalance. During the trasaction commit on redundant bucket region, if the bucket region is moved, the call-back logic (to deliver subscription events) were not invoked due to check-readiness call with offheap region. The check-readiness throws exception, if the region is not found, which causes the code to return early without sending the subscription events. In this scenario, calling check-readiness is not needed... > Missing subsctiption event with Offheap partitioned region during bucket > rebalance. > --- > > Key: GEODE-1885 > URL: https://issues.apache.org/jira/browse/GEODE-1885 > Project: Geode > Issue Type: Bug > Components: offheap >Reporter: Anilkumar Gingade >Assignee: Anilkumar Gingade > > During transaction operation, if there is concurrent redundant bucket > re-balance is in progress, the client can miss a subscription event, if its > primary queue is hosted on the node where bucket gets moved from. > Consider, three node cluster N1, N2 and N3. With: > - Client C1 connected to node N2. > - Primary bucket region B1 on N1. And secondary bucket for B1 on N2. > - A Transaction is started on N2, which creates a entry on B1. > - When the TX is committed. At the same time the Bucket B1 on N2 is moved to > N3. > - The Tx commit message from N1 is sent to N2. This also includes the > subscription message, satisfying the client C1. > - On N2, for offheap region, when bucket is not found locally, the exception > response is sent to back to N1 without processing the subscription message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)