Hi, sadly, logs from the latest message show nothing. There are no visible issues with the code either, I already checked it. Sorry to say, but what we need is additional logs in Ignite code and stable reproducer, we don't have both.
You shouldn't worry about it I think. It's most likely a bug that only occurs once. чт, 19 нояб. 2020 г. в 02:50, Cong Guo <nadbpwa...@gmail.com>: > Hi, > > I attach the log from the only working node while two others are > restarted. There is no error message other than the "failed to join" > message. I do not see any clue in the log. I cannot reproduce this issue > either. That's why I am asking about the code. Maybe you know certain > suspicious places. Thank you. > > On Wed, Nov 18, 2020 at 2:45 AM Ivan Bessonov <bessonov...@gmail.com> > wrote: > >> Sorry, I see that you use TcpDiscoverySpi. >> >> ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <bessonov...@gmail.com>: >> >>> Hello, >>> >>> these parameters are configured automatically, I know that you don't >>> configure them. And with the fact that all "automatic" configuration is >>> completed, chances of seeing the same bug are low. >>> >>> Understanding the reason is tricky, we would need to debug the starting >>> node or at least add more logs. Is this possible? I see that you're asking >>> me about the code. >>> >>> Knowing the content of "ver" and "histCache.toArray()" in >>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData" >>> would certainly help. >>> More specifically - *ver.id <http://ver.id>()* and >>> *Arrays.stream(histCache.toArray()).map(item >>> -> Arrays.toString(item.keys())).collect(Collectors.joining(","))* >>> >>> Honestly, I have no idea how your situation is even possible, otherwise >>> we would find the solution rather quickly. Needless to say, I can't >>> reproduce it. Error message that you see was created for the case when you >>> join your node to the wrong cluster. >>> >>> Do you have any custom code during the node start? And one more question >>> - what discovery SPI are you using? TCP or Zookeeper? >>> >>> >>> ср, 18 нояб. 2020 г. в 02:29, Cong Guo <nadbpwa...@gmail.com>: >>> >>>> Hi, >>>> >>>> The parameters values on two other nodes are the same. Actually I do >>>> not configure these values. When you enable the native persistence, you >>>> will see these logs by default. Nothing is special. When this error occurs >>>> on the restarting node, nothing happens on two other nodes. When I restart >>>> the second node, it also fails due to the same error. >>>> >>>> I will still need to restart the nodes in the future, one by one >>>> without stopping the service. This issue may happen again. The workaround >>>> has to deactivate the cluster and stop the service, which does not work in >>>> a production environment. >>>> >>>> I think we need to fix this bug or at least understand the reason to >>>> avoid it. Could you please tell me where this version value could be >>>> modified when a node just starts? Do you have any guess about this bug now? >>>> I can help analyze the code. Thank you. >>>> >>>> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <bessonov...@gmail.com> >>>> wrote: >>>> >>>>> Thank you for the reply! >>>>> >>>>> Right now the only existing distributed properties I see are these: >>>>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from >>>>> 'null' to 'false' >>>>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from >>>>> 'null' to '300000' >>>>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to >>>>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, >>>>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]' >>>>> >>>>> I wonder what values they have on nodes that rejected the new node. I >>>>> suggest sending logs of those nodes as well. >>>>> Right now I believe that this bug won't happen again on your >>>>> installation, but it only makes it more elusive... >>>>> >>>>> The most probable reason is that node (somehow) initialized some >>>>> properties with defaults before joining the cluster, while cluster didn't >>>>> have those values at all. >>>>> The rule is that activated cluster can't accept changed properties >>>>> from joining node. So, the workaround would be deactivating the cluster, >>>>> joining the node and activating it again. But as I said, I don't think >>>>> that >>>>> you'll see this bug ever again. >>>>> >>>>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <nadbpwa...@gmail.com>: >>>>> >>>>>> Hi, >>>>>> >>>>>> Please find the attached log for a complete but failed reboot. You >>>>>> can see the exceptions. >>>>>> >>>>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <bessonov...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> there must be a bug somewhere during node start, it updates its >>>>>>> distributed metastorage content and tries to join an already activated >>>>>>> cluster, thus creating a conflict. It's hard to tell the exact data that >>>>>>> caused conflict, especially without any logs. >>>>>>> >>>>>>> Topic that you mentioned ( >>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html) >>>>>>> seems to be about the same problem, but the issue >>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related >>>>>>> to it. >>>>>>> >>>>>>> If you have logs from those unsuccessful restart attempts, it would >>>>>>> be very helpful. >>>>>>> >>>>>>> Sadly, distributed metastorage is an internal component to store >>>>>>> settings and has no public documentation. Developers documentation is >>>>>>> probably outdated and incomplete. But just in case, "version id" that >>>>>>> message is referring to is located in field >>>>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", >>>>>>> it's incremented on every distributed metastorage setting update. You >>>>>>> can >>>>>>> find your error message in the same class. >>>>>>> >>>>>>> Please follow up with more questions and logs it possible, I hope >>>>>>> we'll figure it out. >>>>>>> >>>>>>> Thank you! >>>>>>> >>>>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <nadbpwa...@gmail.com>: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I have a 3-node cluster with persistence enabled. All the three >>>>>>>> nodes are in the baseline topology. The ignite version is 2.8.1. >>>>>>>> >>>>>>>> When I restart the first node, it encounters an error and fails to >>>>>>>> join the cluster. The error message is "Caused by: org.apache. >>>>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger >>>>>>>> distributed metastorage version id. The node is most likely in invalid >>>>>>>> state and can't be joined." I try several times but get the same >>>>>>>> error. >>>>>>>> >>>>>>>> Then I restart the second node, it encounters the same error. After >>>>>>>> I restart the third node, the other two nodes can start successfully >>>>>>>> and >>>>>>>> join the cluster. When I restart the nodes, I do not change the >>>>>>>> baseline >>>>>>>> topology. I cannot reproduce this error now. >>>>>>>> >>>>>>>> I find someone else has the same problem. >>>>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html >>>>>>>> >>>>>>>> The answer is corruption in the metastorage. I do not see any issue >>>>>>>> of the metastorage files. However, it is a small probability event to >>>>>>>> have >>>>>>>> files on two different machines corrupted at the same time. Is it >>>>>>>> possible >>>>>>>> that this is another bug like >>>>>>>> https://issues.apache.org/jira/browse/IGNITE-12850? >>>>>>>> >>>>>>>> Do you have any document about how the version id is updated and >>>>>>>> read? Could you please show me in the source code where the version id >>>>>>>> is >>>>>>>> read when a node starts and where the version id is updated when a node >>>>>>>> stops? Thank you! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sincerely yours, >>>>>>> Ivan Bessonov >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Sincerely yours, >>>>> Ivan Bessonov >>>>> >>>> >>> >>> -- >>> Sincerely yours, >>> Ivan Bessonov >>> >> >> >> -- >> Sincerely yours, >> Ivan Bessonov >> > -- Sincerely yours, Ivan Bessonov