Sorry, I see that you use TcpDiscoverySpi. ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <bessonov...@gmail.com>:
> Hello, > > these parameters are configured automatically, I know that you don't > configure them. And with the fact that all "automatic" configuration is > completed, chances of seeing the same bug are low. > > Understanding the reason is tricky, we would need to debug the starting > node or at least add more logs. Is this possible? I see that you're asking > me about the code. > > Knowing the content of "ver" and "histCache.toArray()" in > "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData" > would certainly help. > More specifically - *ver.id <http://ver.id>()* and > *Arrays.stream(histCache.toArray()).map(item > -> Arrays.toString(item.keys())).collect(Collectors.joining(","))* > > Honestly, I have no idea how your situation is even possible, otherwise we > would find the solution rather quickly. Needless to say, I can't reproduce > it. Error message that you see was created for the case when you join your > node to the wrong cluster. > > Do you have any custom code during the node start? And one more question - > what discovery SPI are you using? TCP or Zookeeper? > > > ср, 18 нояб. 2020 г. в 02:29, Cong Guo <nadbpwa...@gmail.com>: > >> Hi, >> >> The parameters values on two other nodes are the same. Actually I do not >> configure these values. When you enable the native persistence, you will >> see these logs by default. Nothing is special. When this error occurs on >> the restarting node, nothing happens on two other nodes. When I restart the >> second node, it also fails due to the same error. >> >> I will still need to restart the nodes in the future, one by one without >> stopping the service. This issue may happen again. The workaround has to >> deactivate the cluster and stop the service, which does not work in a >> production environment. >> >> I think we need to fix this bug or at least understand the reason to >> avoid it. Could you please tell me where this version value could be >> modified when a node just starts? Do you have any guess about this bug now? >> I can help analyze the code. Thank you. >> >> On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <bessonov...@gmail.com> >> wrote: >> >>> Thank you for the reply! >>> >>> Right now the only existing distributed properties I see are these: >>> - Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' >>> to 'false' >>> - Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' >>> to '300000' >>> - SQL parameter 'sql.disabledFunctions' was changed from 'null' to >>> '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, >>> MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]' >>> >>> I wonder what values they have on nodes that rejected the new node. I >>> suggest sending logs of those nodes as well. >>> Right now I believe that this bug won't happen again on your >>> installation, but it only makes it more elusive... >>> >>> The most probable reason is that node (somehow) initialized some >>> properties with defaults before joining the cluster, while cluster didn't >>> have those values at all. >>> The rule is that activated cluster can't accept changed properties from >>> joining node. So, the workaround would be deactivating the cluster, joining >>> the node and activating it again. But as I said, I don't think that you'll >>> see this bug ever again. >>> >>> вт, 17 нояб. 2020 г. в 07:34, Cong Guo <nadbpwa...@gmail.com>: >>> >>>> Hi, >>>> >>>> Please find the attached log for a complete but failed reboot. You can >>>> see the exceptions. >>>> >>>> On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <bessonov...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> there must be a bug somewhere during node start, it updates its >>>>> distributed metastorage content and tries to join an already activated >>>>> cluster, thus creating a conflict. It's hard to tell the exact data that >>>>> caused conflict, especially without any logs. >>>>> >>>>> Topic that you mentioned ( >>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html) >>>>> seems to be about the same problem, but the issue >>>>> https://issues.apache.org/jira/browse/IGNITE-12850 is not related to >>>>> it. >>>>> >>>>> If you have logs from those unsuccessful restart attempts, it would be >>>>> very helpful. >>>>> >>>>> Sadly, distributed metastorage is an internal component to store >>>>> settings and has no public documentation. Developers documentation is >>>>> probably outdated and incomplete. But just in case, "version id" that >>>>> message is referring to is located in field >>>>> "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", >>>>> it's incremented on every distributed metastorage setting update. You can >>>>> find your error message in the same class. >>>>> >>>>> Please follow up with more questions and logs it possible, I hope >>>>> we'll figure it out. >>>>> >>>>> Thank you! >>>>> >>>>> пт, 13 нояб. 2020 г. в 02:23, Cong Guo <nadbpwa...@gmail.com>: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have a 3-node cluster with persistence enabled. All the three nodes >>>>>> are in the baseline topology. The ignite version is 2.8.1. >>>>>> >>>>>> When I restart the first node, it encounters an error and fails to >>>>>> join the cluster. The error message is "Caused by: org.apache. >>>>>> ignite.spi.IgniteSpiException: Attempting to join node with larger >>>>>> distributed metastorage version id. The node is most likely in invalid >>>>>> state and can't be joined." I try several times but get the same >>>>>> error. >>>>>> >>>>>> Then I restart the second node, it encounters the same error. After I >>>>>> restart the third node, the other two nodes can start successfully and >>>>>> join >>>>>> the cluster. When I restart the nodes, I do not change the baseline >>>>>> topology. I cannot reproduce this error now. >>>>>> >>>>>> I find someone else has the same problem. >>>>>> http://apache-ignite-users.70518.x6.nabble.com/Question-about-baseline-topology-and-cluster-activation-td34336.html >>>>>> >>>>>> The answer is corruption in the metastorage. I do not see any issue >>>>>> of the metastorage files. However, it is a small probability event to >>>>>> have >>>>>> files on two different machines corrupted at the same time. Is it >>>>>> possible >>>>>> that this is another bug like >>>>>> https://issues.apache.org/jira/browse/IGNITE-12850? >>>>>> >>>>>> Do you have any document about how the version id is updated and >>>>>> read? Could you please show me in the source code where the version id is >>>>>> read when a node starts and where the version id is updated when a node >>>>>> stops? Thank you! >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Sincerely yours, >>>>> Ivan Bessonov >>>>> >>>> >>> >>> -- >>> Sincerely yours, >>> Ivan Bessonov >>> >> > > -- > Sincerely yours, > Ivan Bessonov > -- Sincerely yours, Ivan Bessonov