Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition

2018-12-26 Thread Pavel Kovalenko
This sounds strange. There definetely should be a cause of such behaviour.
Rebalancing is happened only after an topology change (node join/leave,
deactivation/activation).
Could you please share logs from node with exception you mentioned in
message, node with id "5423e6b5-c9be-4eb8-8f68-e643357ec2b3", and
coordinator (oldest) node (you can find this node grepping "crd=true" in
logs) to find the root cause of such behaviour?
Cache configurations / Data storage configurations would be also very
useful to debug.

1) If rebalancing didn't happen you should notice MOVING partitions in your
cache groups (from metrics MxBeans or Visor). It's possible to write data
to such partitions and read (it depends on configured PartitionLossPolicy
in your caches). If you have at least 1 owner (OWNING state) for each of
such replicated partition there is no data loss. Such MOVING partitions
will be properly rebalanced after node restart and data become consistent
in primary-backups partitions.
2) If part*.bin files are corrupted you may notice it only during node
restart or subsequent cluster deactivation/activation or if you have less
RAM than your data size and node do pages swapping (replacing) to/from
disk. In usual cluster life this is undetectable since all data placed in
RAM.


ср, 26 дек. 2018 г. в 13:44, aMark :

> Thanks Pavel for prompt response.
>
> I could confirm that node "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" (and no
> other node in the cluster) did not go down, not sure how did stale data
> cropped up on few nodes.  And this type of exception is coming from every
> server node in the cluster.
>
> What happens if re-balancing did not happen properly due to this exception,
> could it lead to data loss ?
> does data get corrupted on the part*.bin files (in persistent store) in the
> Ignite cache due to this exception ?
>
> Thanks,
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition

2018-12-26 Thread aMark
Thanks Pavel for prompt response.

I could confirm that node "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" (and no
other node in the cluster) did not go down, not sure how did stale data
cropped up on few nodes.  And this type of exception is coming from every
server node in the cluster.

What happens if re-balancing did not happen properly due to this exception,
could it lead to data loss ? 
does data get corrupted on the part*.bin files (in persistent store) in the
Ignite cache due to this exception ?

Thanks,






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition

2018-12-26 Thread Pavel Kovalenko
Hello,

It means that node with id "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" has
outdated data (possibly due to restart) and started to rebalance missed
updates from a node with up-to-date data (where you have exception) using
WAL.
WAL rebalance is used when the number of entries in some partition exceeds
threshold controlled by system property IGNITE_PDS_WAL_REBALANCE_THRESHOLD
, default value of that is 500k entries. WAL rebalance is very efficient
when node has a lot of data and was in short period of down-time.
Unfortunately this mechanism is currently unstable and may lead to such
errors you noticed. A very few users have such amount of data in
persistence in 1 partition. There are a couple of tickets [1], [2], [3]
which should be fixed in 2.8 release and make it more robust.

To avoid such problem you should set JVM system property
IGNITE_PDS_WAL_REBALANCE_THRESHOLD value to some very high threshold (e.g.
2kk) in all Ignite instances and perform rolling restart. In this case
default full rebalance will be used. It's slower but durable approach.

[1] https://issues.apache.org/jira/browse/IGNITE-8459
[2] https://issues.apache.org/jira/browse/IGNITE-8391
[3] https://issues.apache.org/jira/browse/IGNITE-10078

ср, 26 дек. 2018 г. в 11:19, aMark :

> Hi,
>
> We are using Ignite 2.6 as persistent store in Partitioned Mode having 12
> server node running in cluster, each node is running on different machine.
>
> There are around 48 client JVM as well which connect to cluster to fetch
> the
> data.
>
> Recently we have started getting following exception on server nodes
> (Though
> clients are still able to read/write data):
>
> 2018-12-25 02:59:48,423 ERROR
> [sys-#22846%a738c793-6e94-48cc-b6cf-d53ccab5f0fe%] {}
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier
> - Failed to send partition supply message to node:
> 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class
> org.apache.ignite.IgniteCheckedException: Could not find start pointer for
> partition [part=9, partCntrSince=484857]
> at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:792)
> at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:90)
> at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.rebalanceIterator(IgniteCacheOffheapManagerImpl.java:893)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:283)
> at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
> at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
> at
>
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
> at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
> at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
> at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
> at
>
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
> at
>
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
>
> Does someone has any idea about the exception and possible resolution as
> well ?
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition

2018-12-26 Thread aMark
Hi, 

We are using Ignite 2.6 as persistent store in Partitioned Mode having 12
server node running in cluster, each node is running on different machine.

There are around 48 client JVM as well which connect to cluster to fetch the
data.

Recently we have started getting following exception on server nodes (Though
clients are still able to read/write data): 

2018-12-25 02:59:48,423 ERROR
[sys-#22846%a738c793-6e94-48cc-b6cf-d53ccab5f0fe%] {}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier
- Failed to send partition supply message to node:
5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class
org.apache.ignite.IgniteCheckedException: Could not find start pointer for
partition [part=9, partCntrSince=484857]
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:792)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:90)
at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.rebalanceIterator(IgniteCacheOffheapManagerImpl.java:893)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:283)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Does someone has any idea about the exception and possible resolution as
well ? 




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/