Hi, > Note: the counter used to store the next sequence number is a signed int > (4bytes) maintained by the parent node, the counter will overflow when > incremented beyond 2147483647 (resulting in a name "-2147483648").
I wrote a test[1] and found that overflow of cversion results in NODEEXISTS[2]. I think this is better than what the doc[3] says. There is no wrap-around on the client side. > In our use case, we create *persistent* *sequential* nodes. We store the > sequence id in the client application and use it as a globally unique id. Given that you have overflowed the cversion, I think you could take a look at ZOOKEEPER-4332[4]. ZooKeeper does not work well with massive children when listing. BookKeeper's HierarchicalLedgerManager[5] is a real world example for this. > New > === > if (parentCVersion > currentParentCVersion > *|| parentCVersion == Integer.MIN_VALUE && > currentParentCVersion == Integer.MAX_VALUE) *{ > parent.stat.setCversion(parent > CVersion); > parent.stat.setPzxid(zxid); > } This breaks "monotonically increasing" and gains no "uniqueness". The cversion will wrap around again given your cases. > It would be better to use a longer integer. > 1. Increasing to a 64-bit counter certainly solves the problem, but this > might require conversion of zk data when the current counter is stored as > 32-bit I support this, but it demands massive work and probably relates to a long term goal[6]. The "int" fact of "cversion" is exposed both in API(Stat) and storage(StatPersisted). [1]: https://github.com/kezhuw/zookeeper/commit/755b1168156c28e4fc2813be593ac67514e8bdc7#diff-1af986ce48b5d4bb4b8e51374a70cc6e109a04c70d9f450be3df8f302010341cR59 [2]: https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L675 [3]: https://zookeeper.apache.org/doc/r3.8.1/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming [4]: https://issues.apache.org/jira/browse/ZOOKEEPER-4332 [5]: https://bookkeeper.apache.org/docs/getting-started/concepts/#hierarchical-ledger-manager [6]: https://issues.apache.org/jira/browse/ZOOKEEPER-102 Kezhu Wang On Fri, Jun 16, 2023 at 11:33 AM Josef Roehrl <josef.roe...@fuseforward.com.invalid> wrote: > > I wanted to add 2 things. > > > 1. Increasing to a 64-bit counter certainly solves the problem, but this > might require conversion of zk data when the current counter is stored as > 32-bit > 2. A client that relies on a unique version that it uses as a reference > outside of zk should verify that a version that it receives does not already > exist outside zk. This applies even if 1. is considered, should a zk quorum > be reset or lose its data. > > Josef Roehrl > FuseForward | Senior Architect - Professional Services > [https://fuseforward.atlassian.net/wiki/download/attachments/512327681/image001.png?version=1&modificationDate=1537397840684&cacheVersion=1&api=v2] > Website<https://fuseforward.com/?utm_source=Email%20Signature&utm_medium=email%20signature&utm_campaign=email%20signature> > | > Newsletter<https://fuseforward.com/subscribe-to-our-newsletter/?utm_source=Email%20Signature&utm_medium=Email%20Signature&utm_campaign=Email%20Signature> > | Twitter<https://twitter.com/fuseforward> | > LinkedIn<https://www.linkedin.com/company/fuseforward/?originalSubdomain=ca> > > ________________________________ > From: Ted Dunning <ted.dunn...@gmail.com> > Sent: Thursday, June 15, 2023 6:53 PM > To: dev@zookeeper.apache.org <dev@zookeeper.apache.org> > Subject: Re: Name of sequence node is not unique > > The invariant is that the value should be increasing except in failure > modes. You found a somewhat surprising failure mode. > > Please compute how long it would take for a 64-bit counter to overflow if > incremented a million times per second. (hint, half a million years). > Remember that zk only does things at less than 100,000 per second > > On Thu, Jun 15, 2023, 17:03 Li Wang <li4w...@gmail.com> wrote: > > > Thanks a lot for your inputs, Ted. > > > > On Thu, Jun 15, 2023 at 2:52 PM Ted Dunning <ted.dunn...@gmail.com> wrote: > > > > > Breaking a semantic invariant is a dangerous solution here. > > > > Totally agree. We should not break a semantic invariant if there is one. > > What's the semantic invariant here and how ZK is supposed to behave in the > > overflow case? > > > > > > > It would be better to use a longer integer. > > > > > Yes, I thought about this too. Longer integer will overflow too, so the > > issue of not generating unique numbers will still exist. > > > > Best, > > > > Li > > > > > > > > > > > > > > On Thu, Jun 15, 2023, 13:35 Li Wang <li4w...@gmail.com> wrote: > > > > > > > Thanks for the response, Enrico. > > > > > > > > Please see comments below. > > > > > > > > On Thu, Jun 15, 2023 at 5:47 AM Enrico Olivelli <eolive...@gmail.com> > > > > wrote: > > > > > > > > > Li, > > > > > thanks for reporting your problem. > > > > > > > > > > Most likely you have found a bug. > > > > > > > > > > I have one question, related to your use case, > > > > > is the problem that the numbers are not "unique" or that the number > > is > > > > > not monotonically increasing ? > > > > > > > > > > > > > Technically speaking, monotonically increasing means either always > > > > increasing or *remaining constant. *With tha*t, * the problem is only > > > > the numbers are not "unique'. In this case, the parent cversion > > > > remains 2147483647 > > > > after reaching Integer.MAX_VALUE, not unique any more. > > > > > > > > > > > > > > Do you have 2147483647 concurrent sessions and you found that two > > > > > sessions got the same sequenceId ? > > > > > or are you storing the sequenceId somewhere and you use it as a > > > > > globally unique id, not only among the connected sessions but also > > > > > among all the sessions that are ever connected to the cluster ? > > > > > > > > > > > > > In our use case, we create *persistent* *sequential* nodes. We store > > the > > > > sequence id in the client application and use it as a globally unique > > id. > > > > Currently Zookeeper guarantees the following non-overflow case but not > > > > after overflow. > > > > > > > > 1. Monotonically increasing > > > > 2. Uniqueness > > > > 3. Sequentially increase by 1 > > > > > > > > For customers who have an overflow use case and can handle the sequence > > > > number cycling in a circular fashion, how about having a simple patch > > > > to support it and handle the overflow case better? The change is > > adding > > > a > > > > condition to allow the wraparound when it flows to negative. We can > > also > > > > have a property to control whether to add the additional condition if > > > > needed. > > > > > > > > New > > > > === > > > > if (parentCVersion > currentParentCVersion > > > > *|| parentCVersion == Integer.MIN_VALUE && > > > > currentParentCVersion == Integer.MAX_VALUE) *{ > > > > parent.stat.setCversion(parentCVersion); > > > > parent.stat.setPzxid(zxid); > > > > } > > > > > > > > Current > > > > ===== > > > > if (parentCVersion > parent.stat.getCversion()) { > > > > parent.stat.setCversion(parentCVersion); > > > > parent.stat.setPzxid(zxid); > > > > } > > > > > > > > Please let me know what you think. Any input or feedback would be > > > > appreciated. > > > > > > > > Thanks, > > > > > > > > Li > > > > > > > > > > > > > Enrico > > > > > > > > > > Il giorno ven 9 giu 2023 alle ore 21:10 Li Wang <li4w...@gmail.com> > > ha > > > > > scritto: > > > > > > > > > > > > Hello, > > > > > > > > > > > > We are running 3.7.1 in production and running into an "issue" that > > > the > > > > > > names of sequence nodes are not unique after the counter hits the > > max > > > > int > > > > > > (i.e 2147483647) and overflows. I would like to start a thread to > > > > > discuss > > > > > > the following > > > > > > > > > > > > 1. Is this a bug or "expected" behavior? > > > > > > 2. Is ZK supposed to support the overflow scenario and need to make > > > > sure > > > > > > the name is unique when overflow happens? > > > > > > > > > > > > The name is not unique after hitting the max int value because of > > we > > > > > > have the following in zk code base: > > > > > > > > > > > > 1. The cversion of parent znode is used to build the child name in > > > > > > PrepRequestProcessor > > > > > > > > > > > > int parentCVersion = parentRecord.stat.getCversion(); > > > > > > if (createMode.isSequential()) { > > > > > > path = path + String.format(Locale.ENGLISH, "%010d", > > > > > > parentCVersion); > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/ > > > > > > > > java/org/apache/zookeeper/server/PrepRequestProcessor.java#L668-L671 > > > > > > > > > > > > > > > > > > 2. The parent znode is read from either > > zks.outstandingChangesForPath > > > > map > > > > > > or zk database/datatree. > > > > > > > > > > > > lastChange = zks.outstandingChangesForPath.get(path); > > > > > > if (lastChange == null) { > > > > > > DataNode n = zks.getZKDatabase().getNode(path); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L168-L170 > > > > > > > > > > > > > > > > > > > > > > > > 3. The cversion of the parent node in outstandingChangesForPath map > > > is > > > > > > always updated but not in zk database as we added the following > > code > > > > in > > > > > 3.6 > > > > > > > > > > > > if (parentCVersion > parent.stat.getCversion()) { > > > > > > parent.stat.setCversion(parentCVersion); > > > > > > parent.stat.setPzxid(zxid); > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/DataTree.java#L477-L480 > > > > > > > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-3249 > > > > > > > > > > > > > > > > > > When overflow happens, the new parentCversion is changed to > > > > -2147483648. > > > > > > It's updated in the outstandingChangesForPath map. It's not updated > > > in > > > > > > DataTree and the value stays as 2147483647 because -2147483648 is > > > less > > > > > > than 2147483647, so the cVerson is inconsistent in ZK. > > > > > > > > > > > > Due to the inconsistent cVersion, when the next request comes in > > > after > > > > > > overflow, the sequence number is non-deterministic and not unique > > > > > depending > > > > > > on where the parent node is read from. It can be 2147483647 if the > > > > > > parent node is read from DataTree or -2147483648, -2147483647 and > > so > > > > on > > > > > if > > > > > > it's from the outstandingChangesForPath map. > > > > > > > > > > > > We have the following doc about unique naming but no info on > > > > "expected" > > > > > > behavior after overflow. > > > > > > > > > > > > Sequence Nodes -- Unique Naming > > > > > > > > > > > > > > > > > > When creating a znode you can also request that ZooKeeper append a > > > > > > monotonically increasing counter to the end of path. This counter > > is > > > > > unique > > > > > > to the parent znode. The counter has a format of %010d -- that is > > 10 > > > > > digits > > > > > > with 0 (zero) padding (the counter is formatted in this way to > > > simplify > > > > > > sorting), i.e. "0000000001". See Queue Recipe for an example use of > > > > this > > > > > > feature. Note: the counter used to store the next sequence number > > is > > > a > > > > > > signed int (4bytes) maintained by the parent node, the counter will > > > > > > overflow when incremented beyond 2147483647 (resulting in a name > > > > > > "-2147483648"). > > > > > > > > > > > > > > > > > > > > > > > > Please let me know if you have any comments or inputs. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > Li > > > > > > > > > > > > > >