Re: Nodes failed to join the cluster after restarting

Ivan Bessonov Tue, 17 Nov 2020 23:45:53 -0800

Sorry, I see that you use TcpDiscoverySpi.

ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <bessonov...@gmail.com>:


> Hello,
>
> these parameters are configured automatically, I know that you don't
> configure them. And with the fact that all "automatic" configuration is
> completed, chances of seeing the same bug are low.
>
> Understanding the reason is tricky, we would need to debug the starting
> node or at least add more logs. Is this possible? I see that you're asking
> me about the code.
>
> Knowing the content of "ver" and "histCache.toArray()" in
> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData"
> would certainly help.
> More specifically - *ver.id <http://ver.id>()* and 
> *Arrays.stream(histCache.toArray()).map(item
> -> Arrays.toString(item.keys())).collect(Collectors.joining(","))*
>
> Honestly, I have no idea how your situation is even possible, otherwise we
> would find the solution rather quickly. Needless to say, I can't reproduce
> it. Error message that you see was created for the case when you join your
> node to the wrong cluster.
>
> Do you have any custom code during the node start? And one more question -
> what discovery SPI are you using? TCP or Zookeeper?
>
>
> ср, 18 нояб. 2020 г. в 02:29, Cong Guo <nadbpwa...@gmail.com>:
>
>> Hi,
>>
>> The parameters values on two other nodes are the same. Actually I do not
>> configure these values. When you enable the native persistence, you will
>> see these logs by default. Nothing is special. When this error occurs on
>> the restarting node, nothing happens on two other nodes. When I restart the
>> second node, it also fails due to the same error.
>>
>> I will still need to restart the nodes in the future,  one by one without
>> stopping the service. This issue may happen again. The workaround has to
>> deactivate the cluster and stop the service, which does not work in a
>> production environment.
>>
>> I think we need to fix this bug or at least understand the reason to
>> avoid it. Could you please tell me where this version value could be
>> modified when a node just starts? Do you have any guess about this bug now?
>> I can help analyze the code. Thank you.
>>
>> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <bessonov...@gmail.com>
>> wrote:
>>
>>> Thank you for the reply!
>>>
>>> Right now the only existing distributed properties I see are these:
>>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null'
>>> to 'false'
>>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null'
>>> to '300000'
>>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
>>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
>>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>>>
>>> I wonder what values they have on nodes that rejected the new node. I
>>> suggest sending logs of those nodes as well.
>>> Right now I believe that this bug won't happen again on your
>>> installation, but it only makes it more elusive...
>>>
>>> The most probable reason is that node (somehow) initialized some
>>> properties with defaults before joining the cluster, while cluster didn't
>>> have those values at all.
>>> The rule is that activated cluster can't accept changed properties from
>>> joining node. So, the workaround would be deactivating the cluster, joining
>>> the node and activating it again. But as I said, I don't think that you'll
>>> see this bug ever again.
>>>
>>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <nadbpwa...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> Please find the attached log for a complete but failed reboot. You can
>>>> see the exceptions.
>>>>
>>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <bessonov...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> there must be a bug somewhere during node start, it updates its
>>>>> distributed metastorage content and tries to join an already activated
>>>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>>>> caused conflict, especially without any logs.
>>>>>
>>>>> Topic that you mentioned (
>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>>>> seems to be about the same problem, but the issue
>>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to
>>>>> it.
>>>>>
>>>>> If you have logs from those unsuccessful restart attempts, it would be
>>>>> very helpful.
>>>>>
>>>>> Sadly, distributed metastorage is an internal component to store
>>>>> settings and has no public documentation. Developers documentation is
>>>>> probably outdated and incomplete. But just in case, "version id" that
>>>>> message is referring to is located in field
>>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>>>> it's incremented on every distributed metastorage setting update. You can
>>>>> find your error message in the same class.
>>>>>
>>>>> Please follow up with more questions and logs it possible, I hope
>>>>> we'll figure it out.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <nadbpwa...@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a 3-node cluster with persistence enabled. All the three nodes
>>>>>> are in the baseline topology. The ignite version is 2.8.1.
>>>>>>
>>>>>> When I restart the first node, it encounters an error and fails to
>>>>>> join the cluster. The error message is "Caused by: org.apache.
>>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>>>> distributed metastorage version id. The node is most likely in invalid
>>>>>> state and can't be joined." I try several times but get the same
>>>>>> error.
>>>>>>
>>>>>> Then I restart the second node, it encounters the same error. After I
>>>>>> restart the third node, the other two nodes can start successfully and 
>>>>>> join
>>>>>> the cluster. When I restart the nodes, I do not change the baseline
>>>>>> topology. I cannot reproduce this error now.
>>>>>>
>>>>>> I find someone else has the same problem.
>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>>>
>>>>>> The answer is corruption in the metastorage. I do not see any issue
>>>>>> of the metastorage files. However, it is a small probability event to 
>>>>>> have
>>>>>> files on two different machines corrupted at the same time. Is it 
>>>>>> possible
>>>>>> that this is another bug like
>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>>>
>>>>>> Do you have any document about how the version id is updated and
>>>>>> read? Could you please show me in the source code where the version id is
>>>>>> read when a node starts and where the version id is updated when a node
>>>>>> stops? Thank you!
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours,
>>>>> Ivan Bessonov
>>>>>
>>>>
>>>
>>> --
>>> Sincerely yours,
>>> Ivan Bessonov
>>>
>>
>
> --
> Sincerely yours,
> Ivan Bessonov
>


-- 
Sincerely yours,
Ivan Bessonov

Re: Nodes failed to join the cluster after restarting

Reply via email to