Hi Nozomi,

Thank you for the explanation. It's clearer to me now. Maybe instead of
aggregatePublisherStatsByProducerName, name it allowStatsAggregationByIndex
to be more descriptive. Defaults to false which makes this fix take effect.

On Mon, Nov 28, 2022 at 9:36 PM Nozomi Kurihara <nkuri...@apache.org> wrote:

> Hi Lin,
>
> > What if we change aggregatePublisherStatsByProducerName to default to
> true, and logs a warning if it's set to false?
>
> Currently "aggregatePublisherStatsByProducerName=true" means that stats is
> aggregated by NOT ONLY producer name BUT index when clients don't support
> partial producer.
>
> https://github.com/apache/pulsar/blob/6a719480b2afc84f5acb85fedf3accf8861eb97a/conf/broker.conf#L1476-L1478
>
> Thus
> > change aggregatePublisherStatsByProducerName to default to true
> is not enough to avoid the performance issue since the index-based
> aggregation can still work.
>
> To address this issue, https://github.com/apache/pulsar/pull/18254
> completely removes the index-based aggregation logic.
> However this change will cause another problem, breaking the backward
> compatibility.
> Though I agree that the impact is less than the performance issue, I think
> we should keep the existing behavior too.
>
> That's why I proposed "force" setting by which users have 3 options:
>
> (1) force=true
> * stats is aggregated by only producer name
> * for those who want to avoid the performance issue
>
> (2) force=false & aggregatePublisherStatsByProducerName=true
> * stats is aggregated by producer name for clients which supports partial
> producer, otherwise by index
> * for those who want to keep the backward-compatibility
>
> (3) force=false & aggregatePublisherStatsByProducerName=false
> * stats is aggregated by only index
> * (I'm not sure whether there are users who need this option)
>
> Best Regards,
> Nozomi
>
> 2022年11月29日(火) 8:55 Lin Zhao <lin.z...@streamnative.io.invalid>:
>
> > In my opinion, the performance issue reported is to critical that it
> should
> > be considered a regression. The only option is to either revert the
> initial
> > change or fix it as soon as possible.
> >
> > The backward compatibility issue introduced to this fix, on the other
> hand,
> > doesn't impact the behavior of the broker other than the reported json
> from
> > the stats call. It's a much lesser evil than broker running out of memory
> > with tens of millions of ProducerStatsImpl as I have seen.
> >
> > I would prefer deprecating aggregatePublisherStatsByProducerName, but at
> > the same time I see the reason behind keeping the feature intact in a
> minor
> > version change. Nozimo's proposal of
> > forceAggregatePublisherStatsByProducerName is basically keeping*
> > aggregatePublisherStatsByProducerName=false
> > *an option right? What if we change aggregatePublisherStatsByProducerName
> > to default to true, and logs a warning if it's set to false?
> >
> > On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <nkuri...@apache.org>
> > wrote:
> >
> > > Hi Heesung,
> > >
> > > Thank you for opening this discussion.
> > >
> > > IMO backward-compatibility should be kept at least across minor
> releases.
> > >
> > > Although the performance issue you mentioned (e.g. memory burst, high
> GC
> > > and OOM) looks problematic, backward-compatibility is also important.
> > > I think which has higher priority depends on the case.
> > >
> > > Your current change seems to remove the index-based aggregation
> > completely.
> > > However I think we should keep room for choice.
> > >
> > > In order to allow users (here Pulsar server-side admins in particular)
> to
> > > choose the performance or backward-compatibility, how about
> introducing a
> > > "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
> > >
> > > Those who place more importance on the performance than
> > > backward-compatibility can set this flag to true.
> > > Others, those who want to keep backward-compatibility, set this flag to
> > > false.
> > >
> > > By the way, I'm not sure, is the producer name generation logic already
> > > implemented in C++, Go and other clients?
> > > If not so, first we should implement it before switching the
> > > producer-name-based aggregation.
> > >
> > > Best Regards,
> > > Nozomi
> > >
> > > 2022年11月17日(木) 8:51 Heesung Sohn <heesung.s...@streamnative.io
> .invalid>:
> > >
> > > > Hi,
> > > >
> > > > To add more about the backward incompatibility issue
> > > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> >,
> > > >
> > > > Before fix:
> > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > persistent://public/default/pt
> > > > ...
> > > >   "publishers" : [ {
> > > >     "msgRateIn" : 0.0,
> > > >     "msgThroughputIn" : 0.0,
> > > >     "averageMsgSize" : 0.0,
> > > >     "chunkedMessageRate" : 0.0,
> > > >     "producerId" : 0,
> > > >     "supportsPartialProducer" : false
> > > >   } ],
> > > >
> > > > After fix:
> > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > persistent://public/default/pt
> > > > ...
> > > >   "publishers" : [ {
> > > >     "msgRateIn" : 0.0,
> > > >     "msgThroughputIn" : 0.0,
> > > >     "averageMsgSize" : 0.0,
> > > >     "chunkedMessageRate" : 0.0,
> > > >     "producerId" : 0,
> > > >     "supportsPartialProducer" : true,
> > > >     "producerName" : "standalone-0-1"
> > > >   }, {
> > > >     "msgRateIn" : 0.0,
> > > >     "msgThroughputIn" : 0.0,
> > > >     "averageMsgSize" : 0.0,
> > > >     "chunkedMessageRate" : 0.0,
> > > >     "producerId" : 0,
> > > >     "supportsPartialProducer" : true,
> > > >     "producerName" : "standalone-0-0"
> > > >   } ],
> > > > ...
> > > >
> > > >
> > > > The broker side's producer name generation has been there since this
> PR
> > > > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > > > ProducerName was automatically generated in this format
> > > > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter}
> until
> > > > 2.10.
> > > > So, a producer to a partitioned topic(2) results in the two producer
> > > names,
> > > > like the following.
> > > > standalone-0-0
> > > > standalone-0-1
> > > >
> > > > And since 2.10, by default, partitioned producers have the same
> > producer
> > > > name between partitions by this PR
> > > > <https://github.com/apache/pulsar/pull/10279>(generated on the
> client
> > > > side).
> > > > standalone-0-0
> > > >
> > > > Hence, the impacted versions(backward incompatibility issue
> > > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> >)
> > > by
> > > > the proposed fix(Option 1 below) are < 2.10.
> > > >
> > > > When we aggregate stats between partitions, the default(
> > > > aggregatePublisherStatsByProducerName=false) aggregates the producer
> > > stats
> > > > by the index of the producer stat list from each partition. So, when
> > > lucky,
> > > > it could output a single producer stat. However, this method can be
> > > buggy,
> > > > as each partition could return a different size and index of the
> > > > producer stat list.
> > > >
> > > >
> > > > To fix the original issue(described here
> > > > <https://github.com/apache/pulsar/pull/18254>), I think we have the
> > > > following options.
> > > >
> > > > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName
> > (the
> > > > current PR here <https://github.com/apache/pulsar/pull/18254>) in
> the
> > > next
> > > > release and live with behavior change where we get `topics
> > > > partitioned-stats` per-producer-and-partition from old clients(Ver.
> > > <2.10),
> > > > instead of stats per-producer.
> > > >
> > > > Option 2: Defer the Option 1 fix and push it to the next major
> > > > version(3.0.0), as this is a breaking change.
> > > >
> > > > Option 3: Keep aggregatePublisherStatsByProducerName config but
> change
> > > the
> > > > default to aggregatePublisherStatsByProducerName=true.
> > > >
> > > > Option 4: As a long-term fix, create separate Admin-APIs for
> publisher
> > > and
> > > > subscription stats and drop their stats from `topics
> partitioned-stats`
> > > as
> > > > it is expensive to aggregate them on the fly. (for thousands of
> > > publishers
> > > > and subscriptions). Push this change to the next major version.
> > > >
> > > > Or other suggestions?
> > > >
> > > > Regards,
> > > > Heesung
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> > > heesung.s...@streamnative.io
> > > > >
> > > > wrote:
> > > >
> > > > > Dear Pulsar Community,
> > > > >
> > > > > We recently found a bug in `pulsar-admin topics partitioned-stats
> > api`
> > > > that
> > > > > could incur a memory burst, high GC time, or OOM.
> > > > >
> > > > > For this issue, I proposed a fix
> > > > > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> > > > aggregatePublisherStatsByProducerName
> > > > > config and always aggregating the publishers' stats by
> publisherName,
> > > > > instead of the list
> > index(aggregatePublisherStatsByProducerName=false,
> > > > > default).
> > > > >
> > > > >
> > > > >    -  The index-based aggregation is inherently wrong in a highly
> > > > >    concurrent producer environment(where the order and size of the
> > > > publisher
> > > > >    stat list are not guaranteed to be the same). The publisher
> stats
> > > > need to
> > > > >    be aggregated by a unique key, preferably the producer
> > > > >    name(aggregatePublisherStatsByProducerName=true).
> > > > >
> > > > >
> > > > > However, this fix will break some of the old client's compatibility
> > > since
> > > > > the way Pulsar generates the producer name has changed over time,
> as
> > > > > described here
> > > > > <
> https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > >.
> > > > >
> > > > > As I replied here
> > > > > <
> https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363
> > >,
> > > > although
> > > > > it is not desirable, I think we could be lenient on this change in
> > the
> > > > stat
> > > > > API response(assuming thispublishers'stat struct is used for human
> > > admins
> > > > > only for ad-hoc checks).
> > > > >
> > > > > Are we OK with this non-backward-compatible fix for some of the old
> > > > > clients? Or, do you have any other suggestions?
> > > > >
> > > > > One idea for a long-term fix could be:
> > > > > When there are thousands of producers(consumers) for a
> > > > > (partitioned-)topic, it is expensive to aggregate each
> > > > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > > > Alternatively,
> > > > > for the next major version, I think we could further define
> > > > > producers(subscriptions)' API like the below and drop the
> publishers
> > > and
> > > > > subscriptions structs from topics (partitioned-)stats returns.
> > > > >
> > > > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > > > pulsar-admin publishers stats my-producer
> > > > >
> > > > > # similarly for subscriptions
> > > > >
> > > > > Regards,
> > > > > Heesung
> > > > >
> > > >
> > >
> >
>

Reply via email to