[jira] [Updated] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging

2020-12-29 Thread Maxim Muzafarov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Muzafarov updated IGNITE-8391:

Fix Version/s: (was: 2.10)

> Removing some WAL history segments leads to WAL rebalance hanging
> -
>
> Key: IGNITE-8391
> URL: https://issues.apache.org/jira/browse/IGNITE-8391
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Priority: Major
>
> Problem:
> 1) Start 2 nodes, load some data to it.
> 2) Stop node 2, load some data to cache.
> 3) Remove WAL archived segment which doesn't contain Checkpoint record needed 
> to find start point for WAL rebalance, but contains necessary data for 
> rebalancing. 
> 4) Start node 2, this node will start rebalance data from node 1 using WAL.
> Rebalance will be hanged with following assertion:
> {noformat}
> java.lang.AssertionError: Partitions after rebalance should be either done or 
> missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  
> This happened because we never reached necessary data and updateCounters 
> contained in removed WAL segment.
> To resolve such problems we should introduce some fallback strategy if 
> rebalance by WAL has been failed. Example of fallback strategy is - re-run 
> full rebalance for partitions that were not able properly rebalanced using 
> WAL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging

2020-06-26 Thread Aleksey Plekhanov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Plekhanov updated IGNITE-8391:
--
Fix Version/s: (was: 2.9)
   2.10

> Removing some WAL history segments leads to WAL rebalance hanging
> -
>
> Key: IGNITE-8391
> URL: https://issues.apache.org/jira/browse/IGNITE-8391
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Priority: Major
> Fix For: 2.10
>
>
> Problem:
> 1) Start 2 nodes, load some data to it.
> 2) Stop node 2, load some data to cache.
> 3) Remove WAL archived segment which doesn't contain Checkpoint record needed 
> to find start point for WAL rebalance, but contains necessary data for 
> rebalancing. 
> 4) Start node 2, this node will start rebalance data from node 1 using WAL.
> Rebalance will be hanged with following assertion:
> {noformat}
> java.lang.AssertionError: Partitions after rebalance should be either done or 
> missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  
> This happened because we never reached necessary data and updateCounters 
> contained in removed WAL segment.
> To resolve such problems we should introduce some fallback strategy if 
> rebalance by WAL has been failed. Example of fallback strategy is - re-run 
> full rebalance for partitions that were not able properly rebalanced using 
> WAL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging

2019-10-08 Thread Maxim Muzafarov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Muzafarov updated IGNITE-8391:

Fix Version/s: (was: 2.8)
   2.9

> Removing some WAL history segments leads to WAL rebalance hanging
> -
>
> Key: IGNITE-8391
> URL: https://issues.apache.org/jira/browse/IGNITE-8391
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Assignee: Vladislav Pyatkov
>Priority: Major
> Fix For: 2.9
>
>
> Problem:
> 1) Start 2 nodes, load some data to it.
> 2) Stop node 2, load some data to cache.
> 3) Remove WAL archived segment which doesn't contain Checkpoint record needed 
> to find start point for WAL rebalance, but contains necessary data for 
> rebalancing. 
> 4) Start node 2, this node will start rebalance data from node 1 using WAL.
> Rebalance will be hanged with following assertion:
> {noformat}
> java.lang.AssertionError: Partitions after rebalance should be either done or 
> missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  
> This happened because we never reached necessary data and updateCounters 
> contained in removed WAL segment.
> To resolve such problems we should introduce some fallback strategy if 
> rebalance by WAL has been failed. Example of fallback strategy is - re-run 
> full rebalance for partitions that were not able properly rebalanced using 
> WAL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging

2018-09-24 Thread Pavel Kovalenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-8391:

Fix Version/s: (was: 2.7)
   2.8

> Removing some WAL history segments leads to WAL rebalance hanging
> -
>
> Key: IGNITE-8391
> URL: https://issues.apache.org/jira/browse/IGNITE-8391
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Priority: Major
> Fix For: 2.8
>
>
> Problem:
> 1) Start 2 nodes, load some data to it.
> 2) Stop node 2, load some data to cache.
> 3) Remove WAL archived segment which doesn't contain Checkpoint record needed 
> to find start point for WAL rebalance, but contains necessary data for 
> rebalancing. 
> 4) Start node 2, this node will start rebalance data from node 1 using WAL.
> Rebalance will be hanged with following assertion:
> {noformat}
> java.lang.AssertionError: Partitions after rebalance should be either done or 
> missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  
> This happened because we never reached necessary data and updateCounters 
> contained in removed WAL segment.
> To resolve such problems we should introduce some fallback strategy if 
> rebalance by WAL has been failed. Example of fallback strategy is - re-run 
> full rebalance for partitions that were not able properly rebalanced using 
> WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging

2018-06-26 Thread Dmitriy Pavlov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Pavlov updated IGNITE-8391:
---
Fix Version/s: (was: 2.6)
   2.7

> Removing some WAL history segments leads to WAL rebalance hanging
> -
>
> Key: IGNITE-8391
> URL: https://issues.apache.org/jira/browse/IGNITE-8391
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.4
>Reporter: Pavel Kovalenko
>Priority: Major
> Fix For: 2.7
>
>
> Problem:
> 1) Start 2 nodes, load some data to it.
> 2) Stop node 2, load some data to cache.
> 3) Remove WAL archived segment which doesn't contain Checkpoint record needed 
> to find start point for WAL rebalance, but contains necessary data for 
> rebalancing. 
> 4) Start node 2, this node will start rebalance data from node 1 using WAL.
> Rebalance will be hanged with following assertion:
> {noformat}
> java.lang.AssertionError: Partitions after rebalance should be either done or 
> missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
>   at 
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  
> This happened because we never reached necessary data and updateCounters 
> contained in removed WAL segment.
> To resolve such problems we should introduce some fallback strategy if 
> rebalance by WAL has been failed. Example of fallback strategy is - re-run 
> full rebalance for partitions that were not able properly rebalanced using 
> WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)