Li,

Il giorno ven 16 giu 2023 alle ore 11:43 Li Wang <li4w...@gmail.com> ha scritto:
>
> Thanks for the comments and inputs, Ted, Josef and Kezhu. The discussions
> are great.
>
> > The invariant is that the value should be increasing except in
> failure modes. You found a somewhat surprising failure mode.
> Does this mean the value is supposed to be always increasing, never
> decreasing or remaining constant, otherwise it's a failure mode and we need
> to fix it?
>
> > Please compute how long it would take for a 64-bit counter to overflow
> if incremented a million times per second. (hint, half a million years).
> Haha, it would take way too long (> 0.3M years) for a 64-bit counter to
> overflow, so we don't need to worry about it.
>
> > It would be better to use a longer integer.
> >   1.  Increasing to a 64-bit counter certainly solves the problem, but
> this might require conversion of zk data when the current counter is stored
> as 32-bit
> >   I support this, but it demands massive work and probably relates to
> a long term goal[6].  The "int" fact of "cversion" is exposed both
> in API(Stat) and storage(StatPersisted).
> I agree. Using a 64-bit counter is a better way but the scope of the change
> is big. If we have a consensus on this is the way to go, maybe we can scope
> it out and break it into sub-tasks so we can have more people contribute to
> it if there is a need.
>
> > I wrote a test[1] and found that overflow of cversion results
> in NODEEXISTS[2]. I think this is better than what the doc[3] says.
> > There is no wrap-around on the client side.
> NODEEXISTS is thrown because the same prefix is used in the test. If you
> use different prefixes, you will get the wrap-around, as the child path is
> different even if the sequence is the same.
>
> > I think you could take a look at ZOOKEEPER-4332[4]. ZooKeeper does not
> work well with massive children
> Yes, to work around that, we only keep the most recent x number of children
> and have a paginated way of listing the children.
>
> > This breaks "monotonically increasing" and gains no"uniqueness". The
> cversion will wrap around again given your cases.
> It breaks "monotonically increasing" only at the point of overflow, it is
> "monotonically increasing" before overflow and after overflow.
> It gains more ""uniqueness" as the range of "uniqueness" is increased from
> [0, 2147483647] to [-2147483648, 2147483647].
> This is not ideal and I agree using a 64 counter would be better if we can.

I may be fine with a proposal that allows you to extend the range of
values, but I am not sure that allowing it
would make much difference or introduce more problems: we are seeing
that the sequence will no longer be monotonically increasing, so we.
lose a feature

I am supporting the initiative to move the counter to be 64 bits, as
far as we are able to
ensure 100% disk and wire protocol compatibility with clusters and
applications that do not overflow.

I am still a little bit worried about your use case.
I may be wrong that the ids generated by ZooKeeper are not meant to be
globally unique outside of the scope of "coordination" of the clients
connected to the cluster.
For instance it is not expected that you use ZooKeeper to generate a
"customer id" or a "session id" (where the session is a session with a
web browser).

It would be better to store a "counter" and use "compare and set",
but, again, zookeeper is not meant to be used on "hot paths", there
are many systems built over zookeeper
that provide higher level features with high performance (like
BookKeeper, or let me cite HerdDB, a small but powerful project I
worked on some time ago).

So in the end my suggestions are:
- work together on the 64 bits counter
- rework your application to not use ZK to generate globally unique
ids that are not used for coordination

Enrico

>
> Thanks,
>
> Li
>
> On Thu, Jun 15, 2023 at 9:31 PM Kezhu Wang <kez...@gmail.com> wrote:
>
> > Hi,
> >
> > > Note: the counter used to store the next sequence number is a signed int
> > (4bytes) maintained by the parent node, the counter will overflow when
> > incremented beyond 2147483647 (resulting in a name "-2147483648").
> >
> > I wrote a test[1] and found that overflow of cversion results in
> > NODEEXISTS[2]. I think this is better than what the doc[3] says. There
> > is no wrap-around on the client side.
> >
> > > In our use case, we create *persistent* *sequential* nodes. We store the
> > sequence id in the client application and use it as a globally unique id.
> >
> > Given that you have overflowed the cversion, I think you could take a
> > look at ZOOKEEPER-4332[4]. ZooKeeper does not work well with massive
> > children when listing. BookKeeper's HierarchicalLedgerManager[5] is a
> > real world example for this.
> >
> >
> > > New
> > > ===
> > > if (parentCVersion > currentParentCVersion
> > >                 *|| parentCVersion == Integer.MIN_VALUE &&
> > > currentParentCVersion == Integer.MAX_VALUE) *{
> > >                 parent.stat.setCversion(parent
> > > CVersion);
> > >                 parent.stat.setPzxid(zxid);
> > >           }
> >
> > This breaks "monotonically increasing" and gains no "uniqueness". The
> > cversion will wrap around again given your cases.
> >
> > > It would be better to use a longer integer.
> > >   1.  Increasing to a 64-bit counter certainly solves the problem, but
> > this might require conversion of zk data when the current counter is stored
> > as 32-bit
> >
> > I support this, but it demands massive work and probably relates to a
> > long term goal[6].  The "int" fact of "cversion" is exposed both in
> > API(Stat) and storage(StatPersisted).
> >
> >
> > [1]:
> > https://github.com/kezhuw/zookeeper/commit/755b1168156c28e4fc2813be593ac67514e8bdc7#diff-1af986ce48b5d4bb4b8e51374a70cc6e109a04c70d9f450be3df8f302010341cR59
> > [2]:
> > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L675
> > [3]:
> > https://zookeeper.apache.org/doc/r3.8.1/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming
> > [4]: https://issues.apache.org/jira/browse/ZOOKEEPER-4332
> > [5]:
> > https://bookkeeper.apache.org/docs/getting-started/concepts/#hierarchical-ledger-manager
> > [6]: https://issues.apache.org/jira/browse/ZOOKEEPER-102
> >
> > Kezhu Wang
> >
> >
> > On Fri, Jun 16, 2023 at 11:33 AM Josef Roehrl
> > <josef.roe...@fuseforward.com.invalid> wrote:
> > >
> > > I wanted to add 2 things.
> > >
> > >
> > >   1.  Increasing to a 64-bit counter certainly solves the problem, but
> > this might require conversion of zk data when the current counter is stored
> > as 32-bit
> > >   2.  A client that relies on a unique version that it uses as a
> > reference outside of zk should verify that a version that it receives does
> > not already exist outside zk. This applies even if 1. is considered, should
> > a zk quorum be reset or lose its data.
> > >
> > > Josef Roehrl
> > > FuseForward | Senior Architect - Professional Services
> > > [
> > https://fuseforward.atlassian.net/wiki/download/attachments/512327681/image001.png?version=1&modificationDate=1537397840684&cacheVersion=1&api=v2
> > ]
> > > Website<
> > https://fuseforward.com/?utm_source=Email%20Signature&utm_medium=email%20signature&utm_campaign=email%20signature>
> > | Newsletter<
> > https://fuseforward.com/subscribe-to-our-newsletter/?utm_source=Email%20Signature&utm_medium=Email%20Signature&utm_campaign=Email%20Signature>
> > | Twitter<https://twitter.com/fuseforward> | LinkedIn<
> > https://www.linkedin.com/company/fuseforward/?originalSubdomain=ca>
> > >
> > > ________________________________
> > > From: Ted Dunning <ted.dunn...@gmail.com>
> > > Sent: Thursday, June 15, 2023 6:53 PM
> > > To: dev@zookeeper.apache.org <dev@zookeeper.apache.org>
> > > Subject: Re: Name of sequence node is not unique
> > >
> > > The invariant is that the value should be increasing except in failure
> > > modes. You found a somewhat surprising failure mode.
> > >
> > > Please compute how long it would take for a 64-bit counter to overflow if
> > > incremented a million times per second. (hint, half a million years).
> > > Remember that zk only does things at less than 100,000 per second
> > >
> > > On Thu, Jun 15, 2023, 17:03 Li Wang <li4w...@gmail.com> wrote:
> > >
> > > > Thanks a lot for your inputs, Ted.
> > > >
> > > > On Thu, Jun 15, 2023 at 2:52 PM Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> > > >
> > > > > Breaking a semantic invariant is a dangerous solution here.
> > > >
> > > > Totally agree. We should not break a semantic invariant if there is
> > one.
> > > > What's the semantic invariant here and how ZK is supposed to behave in
> > the
> > > > overflow case?
> > > >
> > > >
> > > > > It would be better to use a longer integer.
> > > > >
> > > > Yes, I thought about this too. Longer integer will overflow too, so the
> > > > issue of not generating unique numbers will still exist.
> > > >
> > > > Best,
> > > >
> > > > Li
> > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 15, 2023, 13:35 Li Wang <li4w...@gmail.com> wrote:
> > > > >
> > > > > > Thanks for the response, Enrico.
> > > > > >
> > > > > > Please see comments below.
> > > > > >
> > > > > > On Thu, Jun 15, 2023 at 5:47 AM Enrico Olivelli <
> > eolive...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Li,
> > > > > > > thanks for reporting your problem.
> > > > > > >
> > > > > > > Most likely you have found a bug.
> > > > > > >
> > > > > > > I have one question, related to your use case,
> > > > > > > is the problem that the numbers are not "unique" or that the
> > number
> > > > is
> > > > > > > not monotonically increasing ?
> > > > > > >
> > > > > >
> > > > > > Technically speaking, monotonically increasing means either always
> > > > > > increasing or *remaining constant. *With tha*t, * the problem is
> > only
> > > > > > the numbers are not "unique'. In this case, the parent cversion
> > > > > > remains 2147483647
> > > > > > after reaching Integer.MAX_VALUE, not unique any more.
> > > > > >
> > > > > > >
> > > > > > > Do you have 2147483647 concurrent sessions and you found that two
> > > > > > > sessions got the same sequenceId ?
> > > > > > > or are you storing the sequenceId somewhere and you use it as a
> > > > > > > globally unique id, not only among the connected sessions but
> > also
> > > > > > > among all the sessions that are ever connected to the cluster ?
> > > > > > >
> > > > > >
> > > > > > In our use case, we create *persistent* *sequential* nodes. We
> > store
> > > > the
> > > > > > sequence id in the client application and use it as a globally
> > unique
> > > > id.
> > > > > > Currently Zookeeper guarantees the following non-overflow case but
> > not
> > > > > > after overflow.
> > > > > >
> > > > > > 1. Monotonically increasing
> > > > > > 2. Uniqueness
> > > > > > 3. Sequentially increase by 1
> > > > > >
> > > > > > For customers who have an overflow use case and can handle the
> > sequence
> > > > > > number cycling in a circular fashion, how about having a simple
> > patch
> > > > > > to support it and handle the overflow case better?  The change is
> > > > adding
> > > > > a
> > > > > > condition to allow the wraparound when it flows to negative. We can
> > > > also
> > > > > > have a property to control whether to add the additional condition
> > if
> > > > > > needed.
> > > > > >
> > > > > > New
> > > > > > ===
> > > > > > if (parentCVersion > currentParentCVersion
> > > > > >                 *|| parentCVersion == Integer.MIN_VALUE &&
> > > > > > currentParentCVersion == Integer.MAX_VALUE) *{
> > > > > >                 parent.stat.setCversion(parentCVersion);
> > > > > >                 parent.stat.setPzxid(zxid);
> > > > > >           }
> > > > > >
> > > > > > Current
> > > > > > =====
> > > > > > if (parentCVersion > parent.stat.getCversion()) {
> > > > > >                 parent.stat.setCversion(parentCVersion);
> > > > > >                 parent.stat.setPzxid(zxid);
> > > > > >             }
> > > > > >
> > > > > > Please let me know what you think. Any input or feedback would be
> > > > > > appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Li
> > > > > >
> > > > > >
> > > > > > > Enrico
> > > > > > >
> > > > > > > Il giorno ven 9 giu 2023 alle ore 21:10 Li Wang <
> > li4w...@gmail.com>
> > > > ha
> > > > > > > scritto:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > We are running 3.7.1 in production and running into an "issue"
> > that
> > > > > the
> > > > > > > > names of sequence nodes are not unique after the counter hits
> > the
> > > > max
> > > > > > int
> > > > > > > > (i.e 2147483647) and overflows.  I would like to start a
> > thread to
> > > > > > > discuss
> > > > > > > > the following
> > > > > > > >
> > > > > > > > 1. Is this a bug or "expected" behavior?
> > > > > > > > 2. Is ZK supposed to support the overflow scenario and need to
> > make
> > > > > > sure
> > > > > > > > the name is unique when overflow happens?
> > > > > > > >
> > > > > > > > The name is not unique after hitting the max int value because
> > of
> > > > we
> > > > > > > > have the following in zk  code base:
> > > > > > > >
> > > > > > > > 1.  The cversion of parent znode is used to build the child
> > name in
> > > > > > > > PrepRequestProcessor
> > > > > > > >
> > > > > > > >         int parentCVersion = parentRecord.stat.getCversion();
> > > > > > > >         if (createMode.isSequential()) {
> > > > > > > >             path = path + String.format(Locale.ENGLISH,
> > "%010d",
> > > > > > > > parentCVersion);
> > > > > > > >         }
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/
> > > > > > > >
> > > > java/org/apache/zookeeper/server/PrepRequestProcessor.java#L668-L671
> > > > > > > >
> > > > > > > >
> > > > > > > > 2. The parent znode is read from either
> > > > zks.outstandingChangesForPath
> > > > > > map
> > > > > > > > or zk database/datatree.
> > > > > > > >
> > > > > > > >            lastChange =
> > zks.outstandingChangesForPath.get(path);
> > > > > > > >             if (lastChange == null) {
> > > > > > > >                 DataNode n = zks.getZKDatabase().getNode(path);
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/PrepRequestProcessor.java#L168-L170
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 3. The cversion of the parent node in
> > outstandingChangesForPath map
> > > > > is
> > > > > > > > always updated  but not in zk database as we added the
> > following
> > > > code
> > > > > > in
> > > > > > > 3.6
> > > > > > > >
> > > > > > > >             if (parentCVersion > parent.stat.getCversion()) {
> > > > > > > >                 parent.stat.setCversion(parentCVersion);
> > > > > > > >                 parent.stat.setPzxid(zxid);
> > > > > > > >             }
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/DataTree.java#L477-L480
> > > > > > > >
> > > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-3249
> > > > > > > >
> > > > > > > >
> > > > > > > > When overflow happens, the new parentCversion is changed to
> > > > > > -2147483648.
> > > > > > > > It's updated in the outstandingChangesForPath map. It's not
> > updated
> > > > > in
> > > > > > > > DataTree and the value stays as 2147483647  because
> > -2147483648 is
> > > > > less
> > > > > > > > than 2147483647, so the cVerson is inconsistent in  ZK.
> > > > > > > >
> > > > > > > > Due to the inconsistent cVersion, when the next request comes
> > in
> > > > > after
> > > > > > > > overflow, the sequence number is non-deterministic and not
> > unique
> > > > > > > depending
> > > > > > > > on where the parent node is read from.  It can be 2147483647
> > if the
> > > > > > > > parent node is read from DataTree or -2147483648,  -2147483647
> > and
> > > > so
> > > > > > on
> > > > > > > if
> > > > > > > > it's from the outstandingChangesForPath map.
> > > > > > > >
> > > > > > > > We have the following doc about unique naming but no info on
> > > > > > "expected"
> > > > > > > > behavior after overflow.
> > > > > > > >
> > > > > > > > Sequence Nodes -- Unique Naming
> > > > > > > >
> > > > > > > >
> > > > > > > > When creating a znode you can also request that ZooKeeper
> > append a
> > > > > > > > monotonically increasing counter to the end of path. This
> > counter
> > > > is
> > > > > > > unique
> > > > > > > > to the parent znode. The counter has a format of %010d -- that
> > is
> > > > 10
> > > > > > > digits
> > > > > > > > with 0 (zero) padding (the counter is formatted in this way to
> > > > > simplify
> > > > > > > > sorting), i.e. "0000000001". See Queue Recipe for an example
> > use of
> > > > > > this
> > > > > > > > feature. Note: the counter used to store the next sequence
> > number
> > > > is
> > > > > a
> > > > > > > > signed int (4bytes) maintained by the parent node, the counter
> > will
> > > > > > > > overflow when incremented beyond 2147483647 (resulting in a
> > name
> > > > > > > > "-2147483648").
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Please let me know if you have any comments or inputs.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > >
> > > > > > > > Li
> > > > > > >
> > > > > >
> > > > >
> > > >
> >

Reply via email to