Re: Nodes failed to join the cluster after restarting

Ivan Bessonov Mon, 23 Nov 2020 05:00:05 -0800

Hi,

sadly, logs from the latest message show nothing. There are no visible
issues with the code either, I already checked it. Sorry to say, but what
we need is additional logs in Ignite code and stable reproducer, we don't
have both.


You shouldn't worry about it I think. It's most likely a bug that only
occurs once.

чт, 19 нояб. 2020 г. в 02:50, Cong Guo <nadbpwa...@gmail.com>:

> Hi,
>
> I attach the log from the only working node while two others are
> restarted. There is no error message other than the "failed to join"
> message. I do not see any clue in the log. I cannot reproduce this issue
> either. That's why I am asking about the code. Maybe you know certain
> suspicious places. Thank you.
>
> On Wed, Nov 18, 2020 at 2:45 AM Ivan Bessonov <bessonov...@gmail.com>
> wrote:
>
>> Sorry, I see that you use TcpDiscoverySpi.
>>
>> ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <bessonov...@gmail.com>:
>>
>>> Hello,
>>>
>>> these parameters are configured automatically, I know that you don't
>>> configure them. And with the fact that all "automatic" configuration is
>>> completed, chances of seeing the same bug are low.
>>>
>>> Understanding the reason is tricky, we would need to debug the starting
>>> node or at least add more logs. Is this possible? I see that you're asking
>>> me about the code.
>>>
>>> Knowing the content of "ver" and "histCache.toArray()" in
>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData"
>>> would certainly help.
>>> More specifically - *ver.id <http://ver.id>()* and 
>>> *Arrays.stream(histCache.toArray()).map(item
>>> -> Arrays.toString(item.keys())).collect(Collectors.joining(","))*
>>>
>>> Honestly, I have no idea how your situation is even possible, otherwise
>>> we would find the solution rather quickly. Needless to say, I can't
>>> reproduce it. Error message that you see was created for the case when you
>>> join your node to the wrong cluster.
>>>
>>> Do you have any custom code during the node start? And one more question
>>> - what discovery SPI are you using? TCP or Zookeeper?
>>>
>>>
>>> ср, 18 нояб. 2020 г. в 02:29, Cong Guo <nadbpwa...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> The parameters values on two other nodes are the same. Actually I do
>>>> not configure these values. When you enable the native persistence, you
>>>> will see these logs by default. Nothing is special. When this error occurs
>>>> on the restarting node, nothing happens on two other nodes. When I restart
>>>> the second node, it also fails due to the same error.
>>>>
>>>> I will still need to restart the nodes in the future,  one by one
>>>> without stopping the service. This issue may happen again. The workaround
>>>> has to deactivate the cluster and stop the service, which does not work in
>>>> a production environment.
>>>>
>>>> I think we need to fix this bug or at least understand the reason to
>>>> avoid it. Could you please tell me where this version value could be
>>>> modified when a node just starts? Do you have any guess about this bug now?
>>>> I can help analyze the code. Thank you.
>>>>
>>>> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <bessonov...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thank you for the reply!
>>>>>
>>>>> Right now the only existing distributed properties I see are these:
>>>>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from
>>>>> 'null' to 'false'
>>>>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from
>>>>> 'null' to '300000'
>>>>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to
>>>>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA,
>>>>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'
>>>>>
>>>>> I wonder what values they have on nodes that rejected the new node. I
>>>>> suggest sending logs of those nodes as well.
>>>>> Right now I believe that this bug won't happen again on your
>>>>> installation, but it only makes it more elusive...
>>>>>
>>>>> The most probable reason is that node (somehow) initialized some
>>>>> properties with defaults before joining the cluster, while cluster didn't
>>>>> have those values at all.
>>>>> The rule is that activated cluster can't accept changed properties
>>>>> from joining node. So, the workaround would be deactivating the cluster,
>>>>> joining the node and activating it again. But as I said, I don't think 
>>>>> that
>>>>> you'll see this bug ever again.
>>>>>
>>>>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <nadbpwa...@gmail.com>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Please find the attached log for a complete but failed reboot. You
>>>>>> can see the exceptions.
>>>>>>
>>>>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <bessonov...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> there must be a bug somewhere during node start, it updates its
>>>>>>> distributed metastorage content and tries to join an already activated
>>>>>>> cluster, thus creating a conflict. It's hard to tell the exact data that
>>>>>>> caused conflict, especially without any logs.
>>>>>>>
>>>>>>> Topic that you mentioned (
>>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html)
>>>>>>> seems to be about the same problem, but the issue
>>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related
>>>>>>> to it.
>>>>>>>
>>>>>>> If you have logs from those unsuccessful restart attempts, it would
>>>>>>> be very helpful.
>>>>>>>
>>>>>>> Sadly, distributed metastorage is an internal component to store
>>>>>>> settings and has no public documentation. Developers documentation is
>>>>>>> probably outdated and incomplete. But just in case, "version id" that
>>>>>>> message is referring to is located in field
>>>>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver",
>>>>>>> it's incremented on every distributed metastorage setting update. You 
>>>>>>> can
>>>>>>> find your error message in the same class.
>>>>>>>
>>>>>>> Please follow up with more questions and logs it possible, I hope
>>>>>>> we'll figure it out.
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <nadbpwa...@gmail.com>:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a 3-node cluster with persistence enabled. All the three
>>>>>>>> nodes are in the baseline topology. The ignite version is 2.8.1.
>>>>>>>>
>>>>>>>> When I restart the first node, it encounters an error and fails to
>>>>>>>> join the cluster. The error message is "Caused by: org.apache.
>>>>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger
>>>>>>>> distributed metastorage version id. The node is most likely in invalid
>>>>>>>> state and can't be joined." I try several times but get the same
>>>>>>>> error.
>>>>>>>>
>>>>>>>> Then I restart the second node, it encounters the same error. After
>>>>>>>> I restart the third node, the other two nodes can start successfully 
>>>>>>>> and
>>>>>>>> join the cluster. When I restart the nodes, I do not change the 
>>>>>>>> baseline
>>>>>>>> topology. I cannot reproduce this error now.
>>>>>>>>
>>>>>>>> I find someone else has the same problem.
>>>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html
>>>>>>>>
>>>>>>>> The answer is corruption in the metastorage. I do not see any issue
>>>>>>>> of the metastorage files. However, it is a small probability event to 
>>>>>>>> have
>>>>>>>> files on two different machines corrupted at the same time. Is it 
>>>>>>>> possible
>>>>>>>> that this is another bug like
>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850?
>>>>>>>>
>>>>>>>> Do you have any document about how the version id is updated and
>>>>>>>> read? Could you please show me in the source code where the version id 
>>>>>>>> is
>>>>>>>> read when a node starts and where the version id is updated when a node
>>>>>>>> stops? Thank you!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sincerely yours,
>>>>>>> Ivan Bessonov
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours,
>>>>> Ivan Bessonov
>>>>>
>>>>
>>>
>>> --
>>> Sincerely yours,
>>> Ivan Bessonov
>>>
>>
>>
>> --
>> Sincerely yours,
>> Ivan Bessonov
>>
>

-- 
Sincerely yours,
Ivan Bessonov

Re: Nodes failed to join the cluster after restarting

Reply via email to