Hello,

Thanks all for the inputs and discussions. It's been a while but we have
some updates to share

1. We have looked into our use case and decided to re-work our application
to address the sequence overflow issue
2. Since using 32 bits is sufficient for the use cases that ZK is designed
for (i.e. coordination), I will close the JIRA ticket
https://issues.apache.org/jira/browse/ZOOKEEPER-4706

Thanks,

Li

On Sat, Jun 17, 2023 at 9:01 AM Ted Dunning <ted.dunn...@gmail.com> wrote:

> I can't remember if the API in question returns the int in question
> directly or as part of a node name.
>
> If the value is embedded in a string node name, then no protocol difference
> would be required. You would still get a unique string. It would just not
> be parseable back as a valid int. A doc warning to that effect would
> probably be sufficient.
>
> If the value is returned as an int, we could add a new call that returns a
> long. The int version could be *slightly* changed so that it throws an
> error if the counter has exceeded 2^31. Users who never could hit the limit
> (essentially everybody using ZK normally) would never hit the corner
> condition. People who were on the way to hit the error condition will get
> an informative error. And people who understand the problem will use the
> long version and never have the problem.
>
> In the end, however, I really do think that this is not something that is
> going to impact any realistic use of the software. A sustained average
> creation rate of >1 node per second is something that I have never seen and
> I would strong suggest that if I were to see such a use that it is
> substantially in error. If you want to create unique values, it is much
> better to have a fast process that generates 128 bit id's and just use ZK
> to update the next starting point each time the generating process
> restarts. This can easily decrease the ZK load by a factor of 10^6 or more
> and gives much higher performance than Zk alone ever could.
>
> For instance, on process restart, increment the next starting point by 10^9
> and start generating unique values. If you have 1000 such generators, each
> will increment the starting point and overall, the starting point will be
> incremented by 10^12 \approx 2^40. With a 128 bit result, that leaves you
> with 2^80 restarts before you have a risk of collision. This configuration
> could probably generate 10^9 unique values per second for the remaining
> life of the earth.
>
>
>
> On Fri, Jun 16, 2023 at 3:43 AM Enrico Olivelli <eolive...@gmail.com>
> wrote:
>
> > Li,
> >
> > Il giorno ven 16 giu 2023 alle ore 11:43 Li Wang <li4w...@gmail.com> ha
> > scritto:
> > >
> > > Thanks for the comments and inputs, Ted, Josef and Kezhu. The
> discussions
> > > are great.
> > >
> > > > The invariant is that the value should be increasing except in
> > > failure modes. You found a somewhat surprising failure mode.
> > > Does this mean the value is supposed to be always increasing, never
> > > decreasing or remaining constant, otherwise it's a failure mode and we
> > need
> > > to fix it?
> > >
> > > > Please compute how long it would take for a 64-bit counter to
> overflow
> > > if incremented a million times per second. (hint, half a million
> years).
> > > Haha, it would take way too long (> 0.3M years) for a 64-bit counter to
> > > overflow, so we don't need to worry about it.
> > >
> > > > It would be better to use a longer integer.
> > > >   1.  Increasing to a 64-bit counter certainly solves the problem,
> but
> > > this might require conversion of zk data when the current counter is
> > stored
> > > as 32-bit
> > > >   I support this, but it demands massive work and probably relates to
> > > a long term goal[6].  The "int" fact of "cversion" is exposed both
> > > in API(Stat) and storage(StatPersisted).
> > > I agree. Using a 64-bit counter is a better way but the scope of the
> > change
> > > is big. If we have a consensus on this is the way to go, maybe we can
> > scope
> > > it out and break it into sub-tasks so we can have more people
> contribute
> > to
> > > it if there is a need.
> > >
> > > > I wrote a test[1] and found that overflow of cversion results
> > > in NODEEXISTS[2]. I think this is better than what the doc[3] says.
> > > > There is no wrap-around on the client side.
> > > NODEEXISTS is thrown because the same prefix is used in the test. If
> you
> > > use different prefixes, you will get the wrap-around, as the child path
> > is
> > > different even if the sequence is the same.
> > >
> > > > I think you could take a look at ZOOKEEPER-4332[4]. ZooKeeper does
> not
> > > work well with massive children
> > > Yes, to work around that, we only keep the most recent x number of
> > children
> > > and have a paginated way of listing the children.
> > >
> > > > This breaks "monotonically increasing" and gains no"uniqueness". The
> > > cversion will wrap around again given your cases.
> > > It breaks "monotonically increasing" only at the point of overflow, it
> is
> > > "monotonically increasing" before overflow and after overflow.
> > > It gains more ""uniqueness" as the range of "uniqueness" is increased
> > from
> > > [0, 2147483647] to [-2147483648, 2147483647].
> > > This is not ideal and I agree using a 64 counter would be better if we
> > can.
> >
> > I may be fine with a proposal that allows you to extend the range of
> > values, but I am not sure that allowing it
> > would make much difference or introduce more problems: we are seeing
> > that the sequence will no longer be monotonically increasing, so we.
> > lose a feature
> >
> > I am supporting the initiative to move the counter to be 64 bits, as
> > far as we are able to
> > ensure 100% disk and wire protocol compatibility with clusters and
> > applications that do not overflow.
> >
> > I am still a little bit worried about your use case.
> > I may be wrong that the ids generated by ZooKeeper are not meant to be
> > globally unique outside of the scope of "coordination" of the clients
> > connected to the cluster.
> > For instance it is not expected that you use ZooKeeper to generate a
> > "customer id" or a "session id" (where the session is a session with a
> > web browser).
> >
> > It would be better to store a "counter" and use "compare and set",
> > but, again, zookeeper is not meant to be used on "hot paths", there
> > are many systems built over zookeeper
> > that provide higher level features with high performance (like
> > BookKeeper, or let me cite HerdDB, a small but powerful project I
> > worked on some time ago).
> >
> > So in the end my suggestions are:
> > - work together on the 64 bits counter
> > - rework your application to not use ZK to generate globally unique
> > ids that are not used for coordination
> >
> > Enrico
> >
> > >
> > > Thanks,
> > >
> > > Li
> > >
> > > On Thu, Jun 15, 2023 at 9:31 PM Kezhu Wang <kez...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > > Note: the counter used to store the next sequence number is a
> signed
> > int
> > > > (4bytes) maintained by the parent node, the counter will overflow
> when
> > > > incremented beyond 2147483647 (resulting in a name "-2147483648").
> > > >
> > > > I wrote a test[1] and found that overflow of cversion results in
> > > > NODEEXISTS[2]. I think this is better than what the doc[3] says.
> There
> > > > is no wrap-around on the client side.
> > > >
> > > > > In our use case, we create *persistent* *sequential* nodes. We
> store
> > the
> > > > sequence id in the client application and use it as a globally unique
> > id.
> > > >
> > > > Given that you have overflowed the cversion, I think you could take a
> > > > look at ZOOKEEPER-4332[4]. ZooKeeper does not work well with massive
> > > > children when listing. BookKeeper's HierarchicalLedgerManager[5] is a
> > > > real world example for this.
> > > >
> > > >
> > > > > New
> > > > > ===
> > > > > if (parentCVersion > currentParentCVersion
> > > > >                 *|| parentCVersion == Integer.MIN_VALUE &&
> > > > > currentParentCVersion == Integer.MAX_VALUE) *{
> > > > >                 parent.stat.setCversion(parent
> > > > > CVersion);
> > > > >                 parent.stat.setPzxid(zxid);
> > > > >           }
> > > >
> > > > This breaks "monotonically increasing" and gains no "uniqueness". The
> > > > cversion will wrap around again given your cases.
> > > >
> > > > > It would be better to use a longer integer.
> > > > >   1.  Increasing to a 64-bit counter certainly solves the problem,
> > but
> > > > this might require conversion of zk data when the current counter is
> > stored
> > > > as 32-bit
> > > >
> > > > I support this, but it demands massive work and probably relates to a
> > > > long term goal[6].  The "int" fact of "cversion" is exposed both in
> > > > API(Stat) and storage(StatPersisted).
> > > >
> > > >
> > > > [1]:
> > > >
> >
> https://github.com/kezhuw/zookeeper/commit/755b1168156c28e4fc2813be593ac67514e8bdc7#diff-1af986ce48b5d4bb4b8e51374a70cc6e109a04c70d9f450be3df8f302010341cR59
> > > > [2]:
> > > >
> >
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L675
> > > > [3]:
> > > >
> >
> https://zookeeper.apache.org/doc/r3.8.1/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming
> > > > [4]: https://issues.apache.org/jira/browse/ZOOKEEPER-4332
> > > > [5]:
> > > >
> >
> https://bookkeeper.apache.org/docs/getting-started/concepts/#hierarchical-ledger-manager
> > > > [6]: https://issues.apache.org/jira/browse/ZOOKEEPER-102
> > > >
> > > > Kezhu Wang
> > > >
> > > >
> > > > On Fri, Jun 16, 2023 at 11:33 AM Josef Roehrl
> > > > <josef.roe...@fuseforward.com.invalid> wrote:
> > > > >
> > > > > I wanted to add 2 things.
> > > > >
> > > > >
> > > > >   1.  Increasing to a 64-bit counter certainly solves the problem,
> > but
> > > > this might require conversion of zk data when the current counter is
> > stored
> > > > as 32-bit
> > > > >   2.  A client that relies on a unique version that it uses as a
> > > > reference outside of zk should verify that a version that it receives
> > does
> > > > not already exist outside zk. This applies even if 1. is considered,
> > should
> > > > a zk quorum be reset or lose its data.
> > > > >
> > > > > Josef Roehrl
> > > > > FuseForward | Senior Architect - Professional Services
> > > > > [
> > > >
> >
> https://fuseforward.atlassian.net/wiki/download/attachments/512327681/image001.png?version=1&modificationDate=1537397840684&cacheVersion=1&api=v2
> > > > ]
> > > > > Website<
> > > >
> >
> https://fuseforward.com/?utm_source=Email%20Signature&utm_medium=email%20signature&utm_campaign=email%20signature
> > >
> > > > | Newsletter<
> > > >
> >
> https://fuseforward.com/subscribe-to-our-newsletter/?utm_source=Email%20Signature&utm_medium=Email%20Signature&utm_campaign=Email%20Signature
> > >
> > > > | Twitter<https://twitter.com/fuseforward> | LinkedIn<
> > > > https://www.linkedin.com/company/fuseforward/?originalSubdomain=ca>
> > > > >
> > > > > ________________________________
> > > > > From: Ted Dunning <ted.dunn...@gmail.com>
> > > > > Sent: Thursday, June 15, 2023 6:53 PM
> > > > > To: dev@zookeeper.apache.org <dev@zookeeper.apache.org>
> > > > > Subject: Re: Name of sequence node is not unique
> > > > >
> > > > > The invariant is that the value should be increasing except in
> > failure
> > > > > modes. You found a somewhat surprising failure mode.
> > > > >
> > > > > Please compute how long it would take for a 64-bit counter to
> > overflow if
> > > > > incremented a million times per second. (hint, half a million
> years).
> > > > > Remember that zk only does things at less than 100,000 per second
> > > > >
> > > > > On Thu, Jun 15, 2023, 17:03 Li Wang <li4w...@gmail.com> wrote:
> > > > >
> > > > > > Thanks a lot for your inputs, Ted.
> > > > > >
> > > > > > On Thu, Jun 15, 2023 at 2:52 PM Ted Dunning <
> ted.dunn...@gmail.com
> > >
> > > > wrote:
> > > > > >
> > > > > > > Breaking a semantic invariant is a dangerous solution here.
> > > > > >
> > > > > > Totally agree. We should not break a semantic invariant if there
> is
> > > > one.
> > > > > > What's the semantic invariant here and how ZK is supposed to
> > behave in
> > > > the
> > > > > > overflow case?
> > > > > >
> > > > > >
> > > > > > > It would be better to use a longer integer.
> > > > > > >
> > > > > > Yes, I thought about this too. Longer integer will overflow too,
> > so the
> > > > > > issue of not generating unique numbers will still exist.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Li
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jun 15, 2023, 13:35 Li Wang <li4w...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Thanks for the response, Enrico.
> > > > > > > >
> > > > > > > > Please see comments below.
> > > > > > > >
> > > > > > > > On Thu, Jun 15, 2023 at 5:47 AM Enrico Olivelli <
> > > > eolive...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Li,
> > > > > > > > > thanks for reporting your problem.
> > > > > > > > >
> > > > > > > > > Most likely you have found a bug.
> > > > > > > > >
> > > > > > > > > I have one question, related to your use case,
> > > > > > > > > is the problem that the numbers are not "unique" or that
> the
> > > > number
> > > > > > is
> > > > > > > > > not monotonically increasing ?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Technically speaking, monotonically increasing means either
> > always
> > > > > > > > increasing or *remaining constant. *With tha*t, * the problem
> > is
> > > > only
> > > > > > > > the numbers are not "unique'. In this case, the parent
> cversion
> > > > > > > > remains 2147483647
> > > > > > > > after reaching Integer.MAX_VALUE, not unique any more.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Do you have 2147483647 concurrent sessions and you found
> > that two
> > > > > > > > > sessions got the same sequenceId ?
> > > > > > > > > or are you storing the sequenceId somewhere and you use it
> > as a
> > > > > > > > > globally unique id, not only among the connected sessions
> but
> > > > also
> > > > > > > > > among all the sessions that are ever connected to the
> > cluster ?
> > > > > > > > >
> > > > > > > >
> > > > > > > > In our use case, we create *persistent* *sequential* nodes.
> We
> > > > store
> > > > > > the
> > > > > > > > sequence id in the client application and use it as a
> globally
> > > > unique
> > > > > > id.
> > > > > > > > Currently Zookeeper guarantees the following non-overflow
> case
> > but
> > > > not
> > > > > > > > after overflow.
> > > > > > > >
> > > > > > > > 1. Monotonically increasing
> > > > > > > > 2. Uniqueness
> > > > > > > > 3. Sequentially increase by 1
> > > > > > > >
> > > > > > > > For customers who have an overflow use case and can handle
> the
> > > > sequence
> > > > > > > > number cycling in a circular fashion, how about having a
> simple
> > > > patch
> > > > > > > > to support it and handle the overflow case better?  The
> change
> > is
> > > > > > adding
> > > > > > > a
> > > > > > > > condition to allow the wraparound when it flows to negative.
> > We can
> > > > > > also
> > > > > > > > have a property to control whether to add the additional
> > condition
> > > > if
> > > > > > > > needed.
> > > > > > > >
> > > > > > > > New
> > > > > > > > ===
> > > > > > > > if (parentCVersion > currentParentCVersion
> > > > > > > >                 *|| parentCVersion == Integer.MIN_VALUE &&
> > > > > > > > currentParentCVersion == Integer.MAX_VALUE) *{
> > > > > > > >                 parent.stat.setCversion(parentCVersion);
> > > > > > > >                 parent.stat.setPzxid(zxid);
> > > > > > > >           }
> > > > > > > >
> > > > > > > > Current
> > > > > > > > =====
> > > > > > > > if (parentCVersion > parent.stat.getCversion()) {
> > > > > > > >                 parent.stat.setCversion(parentCVersion);
> > > > > > > >                 parent.stat.setPzxid(zxid);
> > > > > > > >             }
> > > > > > > >
> > > > > > > > Please let me know what you think. Any input or feedback
> would
> > be
> > > > > > > > appreciated.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Li
> > > > > > > >
> > > > > > > >
> > > > > > > > > Enrico
> > > > > > > > >
> > > > > > > > > Il giorno ven 9 giu 2023 alle ore 21:10 Li Wang <
> > > > li4w...@gmail.com>
> > > > > > ha
> > > > > > > > > scritto:
> > > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > We are running 3.7.1 in production and running into an
> > "issue"
> > > > that
> > > > > > > the
> > > > > > > > > > names of sequence nodes are not unique after the counter
> > hits
> > > > the
> > > > > > max
> > > > > > > > int
> > > > > > > > > > (i.e 2147483647) and overflows.  I would like to start a
> > > > thread to
> > > > > > > > > discuss
> > > > > > > > > > the following
> > > > > > > > > >
> > > > > > > > > > 1. Is this a bug or "expected" behavior?
> > > > > > > > > > 2. Is ZK supposed to support the overflow scenario and
> > need to
> > > > make
> > > > > > > > sure
> > > > > > > > > > the name is unique when overflow happens?
> > > > > > > > > >
> > > > > > > > > > The name is not unique after hitting the max int value
> > because
> > > > of
> > > > > > we
> > > > > > > > > > have the following in zk  code base:
> > > > > > > > > >
> > > > > > > > > > 1.  The cversion of parent znode is used to build the
> child
> > > > name in
> > > > > > > > > > PrepRequestProcessor
> > > > > > > > > >
> > > > > > > > > >         int parentCVersion =
> > parentRecord.stat.getCversion();
> > > > > > > > > >         if (createMode.isSequential()) {
> > > > > > > > > >             path = path + String.format(Locale.ENGLISH,
> > > > "%010d",
> > > > > > > > > > parentCVersion);
> > > > > > > > > >         }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/
> > > > > > > > > >
> > > > > >
> > java/org/apache/zookeeper/server/PrepRequestProcessor.java#L668-L671
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2. The parent znode is read from either
> > > > > > zks.outstandingChangesForPath
> > > > > > > > map
> > > > > > > > > > or zk database/datatree.
> > > > > > > > > >
> > > > > > > > > >            lastChange =
> > > > zks.outstandingChangesForPath.get(path);
> > > > > > > > > >             if (lastChange == null) {
> > > > > > > > > >                 DataNode n =
> > zks.getZKDatabase().getNode(path);
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L168-L170
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 3. The cversion of the parent node in
> > > > outstandingChangesForPath map
> > > > > > > is
> > > > > > > > > > always updated  but not in zk database as we added the
> > > > following
> > > > > > code
> > > > > > > > in
> > > > > > > > > 3.6
> > > > > > > > > >
> > > > > > > > > >             if (parentCVersion >
> > parent.stat.getCversion()) {
> > > > > > > > > >                 parent.stat.setCversion(parentCVersion);
> > > > > > > > > >                 parent.stat.setPzxid(zxid);
> > > > > > > > > >             }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/DataTree.java#L477-L480
> > > > > > > > > >
> > > > > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-3249
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > When overflow happens, the new parentCversion is changed
> to
> > > > > > > > -2147483648.
> > > > > > > > > > It's updated in the outstandingChangesForPath map. It's
> not
> > > > updated
> > > > > > > in
> > > > > > > > > > DataTree and the value stays as 2147483647  because
> > > > -2147483648 is
> > > > > > > less
> > > > > > > > > > than 2147483647, so the cVerson is inconsistent in  ZK.
> > > > > > > > > >
> > > > > > > > > > Due to the inconsistent cVersion, when the next request
> > comes
> > > > in
> > > > > > > after
> > > > > > > > > > overflow, the sequence number is non-deterministic and
> not
> > > > unique
> > > > > > > > > depending
> > > > > > > > > > on where the parent node is read from.  It can be
> > 2147483647
> > > > if the
> > > > > > > > > > parent node is read from DataTree or -2147483648,
> > -2147483647
> > > > and
> > > > > > so
> > > > > > > > on
> > > > > > > > > if
> > > > > > > > > > it's from the outstandingChangesForPath map.
> > > > > > > > > >
> > > > > > > > > > We have the following doc about unique naming but no info
> > on
> > > > > > > > "expected"
> > > > > > > > > > behavior after overflow.
> > > > > > > > > >
> > > > > > > > > > Sequence Nodes -- Unique Naming
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > When creating a znode you can also request that ZooKeeper
> > > > append a
> > > > > > > > > > monotonically increasing counter to the end of path. This
> > > > counter
> > > > > > is
> > > > > > > > > unique
> > > > > > > > > > to the parent znode. The counter has a format of %010d --
> > that
> > > > is
> > > > > > 10
> > > > > > > > > digits
> > > > > > > > > > with 0 (zero) padding (the counter is formatted in this
> > way to
> > > > > > > simplify
> > > > > > > > > > sorting), i.e. "0000000001". See Queue Recipe for an
> > example
> > > > use of
> > > > > > > > this
> > > > > > > > > > feature. Note: the counter used to store the next
> sequence
> > > > number
> > > > > > is
> > > > > > > a
> > > > > > > > > > signed int (4bytes) maintained by the parent node, the
> > counter
> > > > will
> > > > > > > > > > overflow when incremented beyond 2147483647 (resulting
> in a
> > > > name
> > > > > > > > > > "-2147483648").
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Please let me know if you have any comments or inputs.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Li
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> >
>

Reply via email to