Oops.  I realized I just replied to Omnia 🤦‍♀️

Here was my response for the mailing thread:

Hey Omnia,
Sorry to hear this is a problem for you as well. :(
> * I have some concerns about the impact of this option on the
transactional producers, for example, what will happen to an ongoing
transaction associated with an expired PID? Would this leave the
transactions in a "hanging" state?*
We currently check if a transaction is ongoing and do not expire the
producer ID if it has an ongoing transaction. I suspect we will continue to
do this with any solution we pick.

My team members and I looked a bit into the throttling case and it can get
a bit tricky since it means we need to throttle the produce request before
it is processed. This means we either block all requests or have to store
the request in memory (which is not great if we are trying to save memory).

I recently had another discussion with my team and it does seem like there
should be a way to make it more clear to the clients what is going on. A
lot of this protocol is implicit. I'm wondering if maybe there is a way to
improve the story for newer clients. (Ie if we choose to expire based on a
size limit, we should include a response indicating the ID has expired.) We
also discussed ways to redefine the guarantees so that users who have
stronger idempotency requirements can ensure them (over availability/memory
concerns). Let me know if you have any ideas here.

Thanks again for commenting here, hopefully we can come to a good solution.

On Tue, Oct 18, 2022 at 1:11 AM Luke Chen <show...@gmail.com> wrote:

> Hi Omnia,
>
> Thanks for your reply.
>
> For (3), you said:
> > - I have some concerns about the impact of this option on the
> transactional
> producers, for example, what will happen to an ongoing transaction
> associated with an expired PID? Would this leave the transactions in a
> "hanging" state?
>
> - How will we notify the client that the transaction can't continue due to
> an expired PID?
>
> - If PID got marked as `expired` this will mean that
> `admin.DescribeProducers` will not list them which will make
> *`kafka-transactions.sh
> --list`* a bit tricky as we can't identify if there are transactions linked
> to this expired PID or not. The same concern applies to
> *`kafka-transactions.sh
> --find-hanging`*.
>
> --> Yes, you're right. Those are also concerns for this solution.
> Currently, there's no way to notify clients about the expiration.
> Also, the ongoing transactions will be hanging. For the admin cli, we've
> never thought about it. Good point.
> In summary, to adopt this solution, there are many issues needed to get
> fixed.
>
>
> For (5), you said:
> > I am assuming you mean KafkaPrincipal here! If so is your concern here
> that
> those good clients that use the same principal as a rogue one will get
> throttled?
>
> If this is the case, then I believe it should be okay as other throttling
> in Kafka on *`/config/users/<user>`* has the same behaviour.
>
>
> What about applying limit/throttling to
> *`/config/users/<user>/clients/<client-id>`
> *similar to what we have with client quota? This should reduce the concern
> about throttling good clients, right?
>
> --> Yes, I mean KafkaPrinciple (sorry, I didn't make it clear)
> Yes, We were thinking about throttling by KafkaPrinciple. Client Id is
> also workable.
> It's just these 2 attributes are not required.
> That is, it's possible we take all clients as the same one: {default
> KafkaPrinciple + default clientID}, and apply throttling on it.
> Do you have any thoughts about it?
> Maybe skip throttling for {default KafkaPrinciple + default clientID} ?
>
> Luke
>
>
>
> On Sat, Oct 15, 2022 at 2:33 AM Omnia Ibrahim <o.g.h.ibra...@gmail.com>
> wrote:
>
>> Hi Luke & Justine,
>> Thanks for looking into this issue, we have been experiencing this because
>> of rouge clients as well.
>>
>> > 3. Having a limit to the number of active producer IDs (sort of like an
>> LRU
>> >cache)
>> >-> The idea here is that if we hit a misconfigured client, we will expire
>> >the older entries. The concern here is we have risks to lose idempotency
>> >guarantees, and currently, we don't have a way to notify clients about
>> >losing idempotency guarantees. Besides, the least  recently used entries
>> >got removed are not always from the "bad" clients.
>>
>> - I have some concerns about the impact of this option on the
>> transactional
>> producers, for example, what will happen to an ongoing transaction
>> associated with an expired PID? Would this leave the transactions in a
>> "hanging" state?
>>
>> - How will we notify the client that the transaction can't continue due to
>> an expired PID?
>>
>> - If PID got marked as `expired` this will mean that
>> `admin.DescribeProducers` will not list them which will make
>> *`kafka-transactions.sh
>> --list`* a bit tricky as we can't identify if there are transactions
>> linked
>> to this expired PID or not. The same concern applies to
>> *`kafka-transactions.sh
>> --find-hanging`*.
>>
>>
>> >5. limit/throttling the producer id based on the principle
>> >-> Although we can limit the impact to a certain principle with this
>> idea,
>> >same concern still exists as solution #1 #2.
>>
>> I am assuming you mean KafkaPrincipal here! If so is your concern here
>> that
>> those good clients that use the same principal as a rogue one will get
>> throttled?
>>
>> If this is the case, then I believe it should be okay as other throttling
>> in Kafka on *`/config/users/<user>`* has the same behaviour.
>>
>>
>> What about applying limit/throttling to
>> *`/config/users/<user>/clients/<client-id>`
>> *similar to what we have with client quota? This should reduce the concern
>> about throttling good clients, right?
>>
>> best,
>>
>> Omnia
>>
>> On Tue, Oct 11, 2022 at 4:18 AM Luke Chen <show...@gmail.com> wrote:
>>
>> > Bump this thread to see if there are any comments/thoughts.
>> > Thanks.
>> >
>> > Luke
>> >
>> > On Mon, Sep 26, 2022 at 11:06 AM Luke Chen <show...@gmail.com> wrote:
>> >
>> > > Hi devs,
>> > >
>> > > As stated in the motivation section in KIP-854
>> > > <
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry
>> > >:
>> > >
>> > > With idempotent producers becoming the default in Kafka, this means
>> that
>> > > unless otherwise specified, all new producers will be given producer
>> IDs.
>> > > Some (inefficient) applications may now create many non-transactional
>> > > idempotent producers. Each of these producers will be assigned a
>> producer
>> > > ID and these IDs and their metadata are stored in the broker memory,
>> > which
>> > > might cause brokers out of memory.
>> > >
>> > > Justine (in cc.) and I and some other team members are working on the
>> > > solutions for this issue. But none of them solves it completely
>> without
>> > > side effects. Among them, "availability" VS "idempotency guarantees"
>> is
>> > > what we can't decide which to sacrifice. Some of these solutions
>> > sacrifice
>> > > availability of produce (1,2,5) and others sacrifice idempotency
>> > guarantees
>> > > (3). It could be useful to know if people generally have a preference
>> one
>> > > way or the other. Or what other better solutions there might be.
>> > >
>> > > Here are the proposals we came up with:
>> > >
>> > > 1. Limit the total active producer ID allocation number.
>> > > -> This is the simplest solution. But since the OOM issue is usually
>> > > caused by a rogue or misconfigured client, and this solution might
>> > "punish"
>> > > the good client from sending messages.
>> > >
>> > > 2. Throttling the producer ID allocation rate
>> > > -> Same concern as the solution #1.
>> > >
>> > > 3. Having a limit to the number of active producer IDs (sort of like
>> an
>> > > LRU cache)
>> > > -> The idea here is that if we hit a misconfigured client, we will
>> expire
>> > > the older entries. The concern here is we have risks to lose
>> idempotency
>> > > guarantees, and currently, we don't have a way to notify clients about
>> > > losing idempotency guarantees. Besides, the least  recently used
>> entries
>> > > got removed are not always from the "bad" clients.
>> > >
>> > > 4. allow clients to "close" the producer ID usage
>> > > -> We can provide a way for producer to "close" producerID usage.
>> > > Currently, we only have a way to INIT_PRODUCER_ID requested to
>> allocate
>> > > one. After that, we'll keep the producer ID metadata in broker even if
>> > the
>> > > producer is "closed". Having a closed API (ex: END_PRODUCER_ID), we
>> can
>> > > remove the entry from broker side. In client side, we can send it when
>> > > producer closing. The concern is, the old clients (including non-java
>> > > clients) will still suffer from the OOM issue.
>> > >
>> > > 5. limit/throttling the producer id based on the principle
>> > > -> Although we can limit the impact to a certain principle with this
>> > idea,
>> > > same concern still exists as solution #1 #2.
>> > >
>> > > Any thoughts/feedback are welcomed.
>> > >
>> > > Thank you.
>> > > Luke
>> > >
>> >
>>
>

Reply via email to