Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition
This sounds strange. There definetely should be a cause of such behaviour. Rebalancing is happened only after an topology change (node join/leave, deactivation/activation). Could you please share logs from node with exception you mentioned in message, node with id "5423e6b5-c9be-4eb8-8f68-e643357ec2b3", and coordinator (oldest) node (you can find this node grepping "crd=true" in logs) to find the root cause of such behaviour? Cache configurations / Data storage configurations would be also very useful to debug. 1) If rebalancing didn't happen you should notice MOVING partitions in your cache groups (from metrics MxBeans or Visor). It's possible to write data to such partitions and read (it depends on configured PartitionLossPolicy in your caches). If you have at least 1 owner (OWNING state) for each of such replicated partition there is no data loss. Such MOVING partitions will be properly rebalanced after node restart and data become consistent in primary-backups partitions. 2) If part*.bin files are corrupted you may notice it only during node restart or subsequent cluster deactivation/activation or if you have less RAM than your data size and node do pages swapping (replacing) to/from disk. In usual cluster life this is undetectable since all data placed in RAM. ср, 26 дек. 2018 г. в 13:44, aMark : > Thanks Pavel for prompt response. > > I could confirm that node "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" (and no > other node in the cluster) did not go down, not sure how did stale data > cropped up on few nodes. And this type of exception is coming from every > server node in the cluster. > > What happens if re-balancing did not happen properly due to this exception, > could it lead to data loss ? > does data get corrupted on the part*.bin files (in persistent store) in the > Ignite cache due to this exception ? > > Thanks, > > > > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition
Thanks Pavel for prompt response. I could confirm that node "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" (and no other node in the cluster) did not go down, not sure how did stale data cropped up on few nodes. And this type of exception is coming from every server node in the cluster. What happens if re-balancing did not happen properly due to this exception, could it lead to data loss ? does data get corrupted on the part*.bin files (in persistent store) in the Ignite cache due to this exception ? Thanks, -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition
Hello, It means that node with id "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" has outdated data (possibly due to restart) and started to rebalance missed updates from a node with up-to-date data (where you have exception) using WAL. WAL rebalance is used when the number of entries in some partition exceeds threshold controlled by system property IGNITE_PDS_WAL_REBALANCE_THRESHOLD , default value of that is 500k entries. WAL rebalance is very efficient when node has a lot of data and was in short period of down-time. Unfortunately this mechanism is currently unstable and may lead to such errors you noticed. A very few users have such amount of data in persistence in 1 partition. There are a couple of tickets [1], [2], [3] which should be fixed in 2.8 release and make it more robust. To avoid such problem you should set JVM system property IGNITE_PDS_WAL_REBALANCE_THRESHOLD value to some very high threshold (e.g. 2kk) in all Ignite instances and perform rolling restart. In this case default full rebalance will be used. It's slower but durable approach. [1] https://issues.apache.org/jira/browse/IGNITE-8459 [2] https://issues.apache.org/jira/browse/IGNITE-8391 [3] https://issues.apache.org/jira/browse/IGNITE-10078 ср, 26 дек. 2018 г. в 11:19, aMark : > Hi, > > We are using Ignite 2.6 as persistent store in Partitioned Mode having 12 > server node running in cluster, each node is running on different machine. > > There are around 48 client JVM as well which connect to cluster to fetch > the > data. > > Recently we have started getting following exception on server nodes > (Though > clients are still able to read/write data): > > 2018-12-25 02:59:48,423 ERROR > [sys-#22846%a738c793-6e94-48cc-b6cf-d53ccab5f0fe%] {} > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier > - Failed to send partition supply message to node: > 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class > org.apache.ignite.IgniteCheckedException: Could not find start pointer for > partition [part=9, partCntrSince=484857] > at > > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:792) > at > > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:90) > at > > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.rebalanceIterator(IgniteCacheOffheapManagerImpl.java:893) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:283) > at > > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364) > at > > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379) > at > > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99) > at > > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) > at > > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) > at > > org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) > at > > org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > > Does someone has any idea about the exception and possible resolution as > well ? > > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition
Hi, We are using Ignite 2.6 as persistent store in Partitioned Mode having 12 server node running in cluster, each node is running on different machine. There are around 48 client JVM as well which connect to cluster to fetch the data. Recently we have started getting following exception on server nodes (Though clients are still able to read/write data): 2018-12-25 02:59:48,423 ERROR [sys-#22846%a738c793-6e94-48cc-b6cf-d53ccab5f0fe%] {} org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier - Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition [part=9, partCntrSince=484857] at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:792) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:90) at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.rebalanceIterator(IgniteCacheOffheapManagerImpl.java:893) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:283) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Does someone has any idea about the exception and possible resolution as well ? -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/