[ https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694592#comment-16694592 ]
ASF GitHub Bot commented on IGNITE-8391: ---------------------------------------- GitHub user vldpyatkov opened a pull request: https://github.com/apache/ignite/pull/5459 IGNITE-8391 Removing some WAL history segments leads to WAL rebalance hanging You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-8391 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/5459.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5459 ---- commit 3c83346f167f60fed1311883325d7bc3a596a7f0 Author: vd-pyatkov <vpyatkov@...> Date: 2018-11-21T11:50:00Z IGNITE-8391 Removing some WAL history segments leads to WAL rebalance hanging ---- > Removing some WAL history segments leads to WAL rebalance hanging > ----------------------------------------------------------------- > > Key: IGNITE-8391 > URL: https://issues.apache.org/jira/browse/IGNITE-8391 > Project: Ignite > Issue Type: Bug > Components: cache > Affects Versions: 2.4 > Reporter: Pavel Kovalenko > Assignee: Vladislav Pyatkov > Priority: Major > Fix For: 2.8 > > > Problem: > 1) Start 2 nodes, load some data to it. > 2) Stop node 2, load some data to cache. > 3) Remove WAL archived segment which doesn't contain Checkpoint record needed > to find start point for WAL rebalance, but contains necessary data for > rebalancing. > 4) Start node 2, this node will start rebalance data from node 1 using WAL. > Rebalance will be hanged with following assertion: > {noformat} > java.lang.AssertionError: Partitions after rebalance should be either done or > missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417) > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603) > at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) > at > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) > at > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) > at > org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > > This happened because we never reached necessary data and updateCounters > contained in removed WAL segment. > To resolve such problems we should introduce some fallback strategy if > rebalance by WAL has been failed. Example of fallback strategy is - re-run > full rebalance for partitions that were not able properly rebalanced using > WAL. -- This message was sent by Atlassian JIRA (v7.6.3#76005)