Re: Nodes failed to join the cluster after restarting

Ivan Bessonov Mon, 16 Nov 2020 01:00:29 -0800

Hello,

there must be a bug somewhere during node start, it updates its
distributed metastorage content and tries to join an already activated
cluster, thus creating a conflict. It's hard to tell the exact data that
caused conflict, especially without any logs.


Topic that you mentioned (
http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
seems to be about the same problem, but the issue
https://issues.apache.org/jira/browse/IGNITE-12850 is not related to it.

If you have logs from those unsuccessful restart attempts, it would be very
helpful.

Sadly, distributed metastorage is an internal component to store settings
and has no public documentation. Developers documentation is probably
outdated and incomplete. But just in case, "version id" that message is
referring to is located in field
"org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
it's incremented on every distributed metastorage setting update. You can
find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll
figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <nadbpwa...@gmail.com>:

> Hi,
>
> I have a 3-node cluster with persistence enabled. All the three nodes are
> in the baseline topology. The ignite version is 2.8.1.
>
> When I restart the first node, it encounters an error and fails to join
> the cluster. The error message is "Caused by: org.apache.
> ignite.spi.IgniteSpiException: Attempting to join node with larger
> distributed metastorage version id. The node is most likely in invalid
> state and can't be joined." I try several times but get the same error.
>
> Then I restart the second node, it encounters the same error. After I
> restart the third node, the other two nodes can start successfully and join
> the cluster. When I restart the nodes, I do not change the baseline
> topology. I cannot reproduce this error now.
>
> I find someone else has the same problem.
> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>
> The answer is corruption in the metastorage. I do not see any issue of the
> metastorage files. However, it is a small probability event to have files
> on two different machines corrupted at the same time. Is it possible that
> this is another bug like
> https://issues.apache.org/jira/browse/IGNITE-12850?
>
> Do you have any document about how the version id is updated and read?
> Could you please show me in the source code where the version id is read
> when a node starts and where the version id is updated when a node stops?
> Thank you!
>
>
>

-- 
Sincerely yours,
Ivan Bessonov

Re: Nodes failed to join the cluster after restarting

Reply via email to