Re: [VOTE] KIP-846: Processor-level Streams metrics for records/bytes Producedd

2022-06-01 Thread Guozhang Wang
+1 (binding)

Thanks Sophie!

On Tue, May 31, 2022 at 11:28 AM Walker Carlson
 wrote:

> +1 non binding
>
> On Tue, May 31, 2022 at 12:19 PM John Roesler  wrote:
>
> > +1 (binding)
> >
> > Thanks,
> > John
> >
> > On Mon, May 30, 2022, at 13:00, Bill Bejeck wrote:
> > > +1 (binding)
> > >
> > > -Bill
> > >
> > > On Mon, May 30, 2022 at 4:49 AM Sagar 
> wrote:
> > >
> > >> +1 (non-binding).
> > >>
> > >> Thanks!
> > >> Sagar.
> > >>
> > >> On Mon, May 30, 2022 at 1:11 PM Bruno Cadonna 
> > wrote:
> > >>
> > >> > +1 (binding)
> > >> >
> > >> > Thanks!
> > >> > Bruno
> > >> >
> > >> > On 30.05.22 09:36, Sophie Blee-Goldman wrote:
> > >> > > Hey all,
> > >> > >
> > >> > >   I'd like to kick off the voting thread for the KIP I proposed to
> > add
> > >> > > processor-level "bytes/records produced" metrics to Kafka Streams.
> > >> > >
> > >> > > Thanks!
> > >> > >
> > >> > > KIP-846: Task-level Streams metrics for bytes/records Produced
> > >> > > <
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211886093
> > >> > >
> > >> > >
> > >> > > Cheers,
> > >> > > Sophie
> > >> > >
> > >> >
> > >>
> >
>


-- 
-- Guozhang


[jira] [Created] (KAFKA-13953) kafka Console consumer fails with CorruptRecordException

2022-06-01 Thread Aldan Brito (Jira)
Aldan Brito created KAFKA-13953:
---

 Summary: kafka Console consumer fails with CorruptRecordException 
 Key: KAFKA-13953
 URL: https://issues.apache.org/jira/browse/KAFKA-13953
 Project: Kafka
  Issue Type: Bug
  Components: consumer, controller, core
Affects Versions: 2.7.0
Reporter: Aldan Brito


Kafka consumer fails with corrupt record exception. 
{code:java}
opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server 192.168.7.93:28104 
--topic BQR-PULL-DEFAULT --from-beginning > 
/opt/nokia/kafka-zookeeper-clustering/kafka/topic-data/tmpsdh/dumptest
[{*}2022-05-15 18:34:15,146]{*} ERROR Error processing message, terminating 
consumer process:  (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.KafkaException: Received exception when fetching the 
next record from BQR-PULL-DEFAULT-30. If needed, please seek past the record to 
continue consumption.
        at 
org.apache.kafka.clients.consumer.internals.Fetcher$CompletedFetch.fetchRecords(Fetcher.java:1577)
        at 
org.apache.kafka.clients.consumer.internals.Fetcher$CompletedFetch.access$1700(Fetcher.java:1432)
        at 
org.apache.kafka.clients.consumer.internals.Fetcher.fetchRecords(Fetcher.java:684)
        at 
org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:635)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1276)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1237)
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1210)
        at 
kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:438)
        at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:104)
        at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:78)
        at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:55)
        at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
Caused by: org.apache.kafka.common.errors.CorruptRecordException: Record size 0 
is less than the minimum record overhead (14)
Processed a total of 15765197 messages {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [VOTE] KIP-834: Pause / Resume KafkaStreams Topologies

2022-06-01 Thread Sophie Blee-Goldman
Hey Jim, thanks for the update. I'm on the same side as Guozhang here, as
I've expressed during
the original discussion I think it would be confusing and possibly harmful
to continue *any* kind of
processing or action within Streams while it is "paused". In fact I sort of
assumed we were including
active task restoration under the umbrella of standby tasks when we decided
to pause those as well,
but since they are technically different I can see why we might want to
consider them separately.

I would say that for now we should just keep the semantics simple and
obvious, and if users express
a desire to pause applications from active processing but allow them to
catch up as restoring actives,
lagging standbys, warmup tasks, or so on then we can always add that
functionality to the feature later on

On Wed, Jun 1, 2022 at 11:41 AM Guozhang Wang  wrote:

> Hello Jim,
>
> I think If our primary goal would be to reduce resource utilization and
> potentially to stop the streaming pipeline for investigating possible bugs
> etc, then we should also pause active tasks' restoration as well since that
> 1) may still use resources, and 2) may load in bad data.
>
> Guozhang
>
>
>
>
>
>
>
> On Wed, Jun 1, 2022 at 5:53 AM Jim Hughes 
> wrote:
>
> > Hi all,
> >
> > While reviewing my PR for KIP-834, Bruno noticed a case that we may not
> > have discussed enough.*
> >
> > During the discussion, we decided that standby tasks would be paused.  In
> > order to do this, there are changes to the StoreChangelogReader around
> > where it does restorations.  Bruno noticed that the restoration of active
> > tasks is not paused in my PR.
> >
> > From my point of view, I was hoping to let active tasks
> restore/consume/etc
> > in order that the Kafka Streams instance could transition to RUNNING
> > (assuming that it was started paused).  I believe Bruno's position is
> that
> > if we are pausing restoration for standby tasks, then restoration should
> be
> > paused for active tasks as well.
> >
> > Since this point hasn't been discussed like this, the KIP is unclear
> about
> > this detail.
> >
> > What do folks think?
> >
> > Thanks in advance,
> >
> > Jim
> >
> > * https://github.com/apache/kafka/pull/12161#discussion_r886732983
> >
> > On Mon, May 16, 2022 at 11:07 AM Jim Hughes 
> wrote:
> >
> > > Hi all,
> > >
> > >
> > > With 5 binding votes (John, Bruno, Sophie, Matthias, Bill) and 4
> > > non-binding votes (Guozhang, Luke, Leah, Walker), the vote for KIP-834
> > > passes!
> > >
> > >
> > > Thanks all for the great discussion.
> > >
> > > I have a PR up here: https://github.com/apache/kafka/pull/12161
> > >
> > >
> > > Thanks in advance for feedback on the PR!
> > >
> > >
> > > Cheers,
> > >
> > >
> > > JIm
> > >
> > > On Fri, May 13, 2022 at 12:04 PM Walker Carlson
> > >  wrote:
> > >
> > >> +1 from me (non-binding)
> > >>
> > >> Walker
> > >>
> > >> On Wed, May 11, 2022 at 12:36 PM Leah Thomas
> >  > >> >
> > >> wrote:
> > >>
> > >> > Thanks Jim, great discussion. +1 from me (non-binding)
> > >> >
> > >> > Cheers,
> > >> > Leah
> > >> >
> > >> > On Wed, May 11, 2022 at 10:14 AM Bill Bejeck 
> > wrote:
> > >> >
> > >> > > Thanks for the KIP!
> > >> > >
> > >> > > +1 (binding)
> > >> > >
> > >> > > -Bill
> > >> > >
> > >> > > On Wed, May 11, 2022 at 9:36 AM Luke Chen 
> > wrote:
> > >> > >
> > >> > > > Hi Jim,
> > >> > > >
> > >> > > > I'm +1. (please add some note in KIP about the stream resetting
> > tool
> > >> > > can't
> > >> > > > be used in paused state)
> > >> > > > Thanks for the KIP!
> > >> > > >
> > >> > > > Luke
> > >> > > >
> > >> > > > On Wed, May 11, 2022 at 9:09 AM Guozhang Wang <
> wangg...@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > Thanks Jim. +1 from me.
> > >> > > > >
> > >> > > > > On Tue, May 10, 2022 at 4:51 PM Matthias J. Sax <
> > mj...@apache.org
> > >> >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > I had one minor question on the discuss thread. It's mainly
> > >> about
> > >> > > > > > clarifying and document the user contract. I am fine either
> > way.
> > >> > > > > >
> > >> > > > > > +1 (binding)
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > -Matthias
> > >> > > > > >
> > >> > > > > > On 5/10/22 12:32 PM, Sophie Blee-Goldman wrote:
> > >> > > > > > > Thanks for the KIP! +1 (binding)
> > >> > > > > > >
> > >> > > > > > > On Tue, May 10, 2022, 12:24 PM Bruno Cadonna <
> > >> cado...@apache.org
> > >> > >
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > >> Thanks Jim,
> > >> > > > > > >>
> > >> > > > > > >> +1 (binding)
> > >> > > > > > >>
> > >> > > > > > >> Best,
> > >> > > > > > >> Bruno
> > >> > > > > > >>
> > >> > > > > > >> On 10.05.22 21:19, John Roesler wrote:
> > >> > > > > > >>> Thanks Jim,
> > >> > > > > > >>>
> > >> > > > > > >>> I’m +1 (binding)
> > >> > > > > > >>>
> > >> > > > > > >>> -John
> > >> > > > > > >>>
> > >> > > > > > >>> On Tue, May 10, 2022, at 14:05, Jim Hughes wrote:
> > >> > > > > >  Hi all,
> > >> > > > > > 
> > >> > > > 

Re: [DISCUSS] KIP-846: Task-level Streams metrics for bytes/records Produced

2022-06-01 Thread Sophie Blee-Goldman
Thanks Guozhang -- I'll definitely make sure we have this benchmarked with
an eye for any regressions before it
makes it into a release. That said, if it's any comfort, we use the cached
system time and record only count and
sum type metrics, so unlike the *-rate *metrics for example we don't have
to make any calls to the system clock
which I know we have found to impact performance when investigating past
regressions.

On Wed, Jun 1, 2022 at 11:07 AM Guozhang Wang  wrote:

> Thanks Sophie, that makes sense. Also I agree that since we are adding it
> at finest granularity for consumed metrics, it's better to have symmetry
> and add produced metrics at processor-node level as well.
>
> Regarding the benchmarks, I think that would be critical to have before we
> add in the next release, since we propose to make this INFO level, meaning
> it would be on by default, and collecting per-record is a critical path.
>
>
> Guozhang
>
> On Wed, Jun 1, 2022 at 8:04 AM Sophie Blee-Goldman
>  wrote:
>
> > I just want to send out a small update -- I decided to include the "-
> > *consumed*" metrics in the KIP alongside the
> > *"-produced"* metrics after all, for reasons I address in the paragraph I
> > added at the end of the motivation section.
> > Please let me know if you have any questions or concerns
> >
> > Cheers,
> > Sophie
> >
> > On Wed, Jun 1, 2022 at 1:38 AM Sophie Blee-Goldman 
> > wrote:
> >
> > > Just a quick question: for filling the gap of sub-topology
> visibilities,
> > >> would task-level produced metrics be sufficient?
> > >
> > >
> > > If I understand your question correctly, you're asking whether we could
> > > just report at the task/subtopology
> > > level since we mainly want the bytes/throughput produced by the
> > > subtopology itself?
> > >
> > > Note that since we're only reporting this metric at the sink nodes,
> it's
> > > basically the same as a task-level metric
> > > for any subtopology that has only one sink and only scales with the
> > number
> > > of output topics. If you're concerned
> > > about a potential performance impact I would say that (a) making these
> > > task-level doesn't buy us much, if anything,
> > > and (b) we can always revisit this if our benchmarks do indeed reveal a
> > > regression but for now let's not over-
> > > optimize too much :)
> > >
> > > Also, the general philosophy  behind the KAfka Streams metrics thus far
> > > has been to report at the finest granularity
> > > and allow users to roll them up into whatever scope they want to
> > aggregate
> > > over. This KIP adopts the same approach.
> > >
> > > On Tue, May 31, 2022 at 12:02 PM Guozhang Wang 
> > wrote:
> > >
> > >> Hi Sophie,
> > >>
> > >> Just a quick question: for filling the gap of sub-topology
> visibilities,
> > >> would task-level produced metrics be sufficient?
> > >>
> > >> On Mon, May 30, 2022 at 10:59 AM Bill Bejeck 
> wrote:
> > >>
> > >> > Thanks for the KIP Sophie.
> > >> >
> > >> > I'm in favor of this change as well. I don't have any comments in
> > >> > addition to the ones already expressed.
> > >> >
> > >> > -Bill
> > >> >
> > >> > On Mon, May 30, 2022 at 4:55 AM Sagar 
> > >> wrote:
> > >> >
> > >> > > Hi Sophie,
> > >> > >
> > >> > > A very minor comment but you might want to remove this KIP
> template
> > >> > related
> > >> > > information from the top of the KIP:
> > >> > >
> > >> > > *This page is meant as a template for writing a KIP
> > >> > > <
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> > >> > > >.
> > >> > > To create a KIP choose Tools->Copy on this page and modify with
> your
> > >> > > content and replace the heading with the next KIP number and a
> > >> > description
> > >> > > of your issue. Replace anything in italics with your own
> > description.*
> > >> > >
> > >> > >
> > >> > > Thanks!
> > >> > > Sagar.
> > >> > >
> > >> > > On Mon, May 30, 2022 at 1:04 PM Sophie Blee-Goldman
> > >> > >  wrote:
> > >> > >
> > >> > > > >
> > >> > > > > Why does the title of the KIP talk about task-level metrics,
> but
> > >> the
> > >> > > > > specified metrics are on processor-level?
> > >> > > >
> > >> > > >
> > >> > > > Ah, my mistake -- it should indeed say "processor-level
> metrics".
> > >> > Thanks
> > >> > > > for the catch Bruno, the title has been fixed.
> > >> > > >
> > >> > > > Since there don't seem to be any concerns I'll proceed with
> > kicking
> > >> off
> > >> > > the
> > >> > > > vote. Thanks all!
> > >> > > >
> > >> > > > On Mon, May 30, 2022 at 12:01 AM Bruno Cadonna <
> > cado...@apache.org>
> > >> > > wrote:
> > >> > > >
> > >> > > > > Thanks for the KIP, Sophie!
> > >> > > > >
> > >> > > > > I am also in favor of this KIP!
> > >> > > > >
> > >> > > > > I have one minor question:
> > >> > > > >
> > >> > > > > Why does the title of the KIP talk about task-level metrics,
> but
> > >> the
> > >> > > > > specified metrics are on processor-level?
> > >> > > > >
> > >> > > > > For the rest, 

Re: [VOTE] KIP-746: Revise KRaft Metadata Records

2022-06-01 Thread Colin McCabe
Hi all,

I updated this with the changes to FeatureLevelRecord that we did in KIP-778. 
Since the original version was never implemented, it would have been confusing 
to leave it there, I think.

best,
Colin

On Wed, Jul 21, 2021, at 17:41, Colin McCabe wrote:
> Hi all,
>
> I made an addendum to this KIP to reflect some of the changes we did 
> recently. Specifically, we decided that since we are not supporting 
> KRaft upgrade from 2.8 to 3.0, we should keep the message versions at 0 
> and increment the frame version instead.
>
> best,
> Colin
>
>
> On Thu, Jun 10, 2021, at 08:38, Colin McCabe wrote:
>> Hi all,
>> 
>> Thanks for the discussion and votes.
>> 
>> The vote passes with binding +1s from:
>> Jason Gustafson
>> Jun Rao
>> David Arthur
>> 
>> and a non-binding +1 from
>> Israel Ekpo
>> 
>> best,
>> Colin
>> 
>> 
>> On Wed, Jun 9, 2021, at 18:15, David Arthur wrote:
>> > Thanks, Colin, looks good to me. +1
>> > 
>> > On Wed, Jun 9, 2021 at 8:32 PM Israel Ekpo  wrote:
>> > 
>> > > Makes sense to me
>> > >
>> > > +1 (non-binding)
>> > >
>> > >
>> > > On Wed, Jun 9, 2021 at 7:05 PM Jun Rao  wrote:
>> > >
>> > > > Hi, Colin,
>> > > >
>> > > > Thanks for the KIP. +1 from me.
>> > > >
>> > > > Jun
>> > > >
>> > > > On Wed, Jun 9, 2021 at 9:36 AM Jason Gustafson
>> > > > > > > >
>> > > > wrote:
>> > > >
>> > > > > +1 Thanks Colin!
>> > > > >
>> > > > > On Thu, Jun 3, 2021 at 4:30 PM Colin McCabe 
>> > > wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > > I'd like to call a vote for KIP-746: Revise KRaft Metadata Records.
>> > > > This
>> > > > > > is a minor KIP which revises the KRaft metadata records slightly 
>> > > > > > for
>> > > > the
>> > > > > > upcoming 3.0 release.
>> > > > > >
>> > > > > > The KIP is at: https://cwiki.apache.org/confluence/x/34zOCg
>> > > > > >
>> > > > > > best,
>> > > > > > Colin
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > 
>> > 
>> > -- 
>> > David Arthur
>> > 
>>


Re: [DISCUSS] KIP-844: Transactional State Stores

2022-06-01 Thread Guozhang Wang
Alex,

Thanks for your replies! That is very helpful.

Just to broaden our discussions a bit here, I think there are some other
approaches in parallel to the idea of "enforce to only persist upon
explicit flush" and I'd like to throw one here -- not really advocating it,
but just for us to compare the pros and cons:

1) We let the StateStore's `flush` function to return a token instead of
returning `void`.
2) We add another `rollback(token)` interface of StateStore which would
effectively rollback the state as indicated by the token to the snapshot
when the corresponding `flush` is called.
3) We encode the token and commit as part of
`producer#sendOffsetsToTransaction`.

Users could optionally implement the new functions, or they can just not
return the token at all and not implement the second function. Again, the
APIs are just for the sake of illustration, not feeling they are the most
natural :)

Then the procedure would be:

1. the previous checkpointed offset is 100
...
3. flush store, make sure all writes are persisted; get the returned token
that indicates the snapshot of 200.
4. producer.sendOffsetsToTransaction(token); producer.commitTransaction();
5. Update the checkpoint file (say, the new value is 200).

Then if there's a failure, say between 3/4, we would get the token from the
last committed txn, and first we would do the restoration (which may get
the state to somewhere between 100 and 200), then call
`store.rollback(token)` to rollback to the snapshot of offset 100.

The pros is that we would then not need to enforce the state stores to not
persist any data during the txn: for stores that may not be able to
implement the `rollback` function, they can still reduce its impl to "not
persisting any data" via this API, but for stores that can indeed support
the rollback, their implementation may be more efficient. The cons though,
on top of my head are 1) more complicated logic differentiating between EOS
with and without store rollback support, and ALOS, 2) encoding the token as
part of the commit offset is not ideal if it is big, 3) the recovery logic
including the state store is also a bit more complicated.


Guozhang





On Wed, Jun 1, 2022 at 1:29 PM Alexander Sorokoumov
 wrote:

> Hi Guozhang,
>
> But I'm still trying to clarify how it guarantees EOS, and it seems that we
> > would achieve it by enforcing to not persist any data written within this
> > transaction until step 4. Is that correct?
>
>
> This is correct. Both alternatives - in-memory WriteBatchWithIndex and
> transactionality via the secondary store guarantee EOS by not persisting
> data in the "main" state store until it is committed in the changelog
> topic.
>
> Oh what I meant is not what KStream code does, but that StateStore impl
> > classes themselves could potentially flush data to become persisted
> > asynchronously
>
>
> Thank you for elaborating! You are correct, the underlying state store
> should not persist data until the streams app calls StateStore#flush. There
> are 2 options how a State Store implementation can guarantee that - either
> keep uncommitted writes in memory or be able to roll back the changes that
> were not committed during recovery. RocksDB's WriteBatchWithIndex is an
> implementation of the first option. A considered alternative, Transactions
> via Secondary State Store for Uncommitted Changes, is the way to implement
> the second option.
>
> As everyone correctly pointed out, keeping uncommitted data in memory
> introduces a very real risk of OOM that we will need to handle. The more I
> think about it, the more I lean towards going with the Transactions via
> Secondary Store as the way to implement transactionality as it does not
> have that issue.
>
> Best,
> Alex
>
>
> On Wed, Jun 1, 2022 at 12:59 PM Guozhang Wang  wrote:
>
> > Hello Alex,
> >
> > > we flush the cache, but not the underlying state store.
> >
> > You're right. The ordering I mentioned above is actually:
> >
> > ...
> > 3. producer.sendOffsetsToTransaction(); producer.commitTransaction();
> > 4. flush store, make sure all writes are persisted.
> > 5. Update the checkpoint file to 200.
> >
> > But I'm still trying to clarify how it guarantees EOS, and it seems that
> we
> > would achieve it by enforcing to not persist any data written within this
> > transaction until step 4. Is that correct?
> >
> > > Can you please point me to the place in the codebase where we trigger
> > async flush before the commit?
> >
> > Oh what I meant is not what KStream code does, but that StateStore impl
> > classes themselves could potentially flush data to become persisted
> > asynchronously, e.g. RocksDB does that naturally out of the control of
> > KStream code. I think it is related to my previous question: if we think
> by
> > guaranteeing EOS at the state store level, we would effectively ask the
> > impl classes that "you should not persist any data until `flush` is
> called
> > explicitly", is the StateStore interface the right level to 

Re: [DISCUSS] KIP-844: Transactional State Stores

2022-06-01 Thread Alexander Sorokoumov
Hi Guozhang,

But I'm still trying to clarify how it guarantees EOS, and it seems that we
> would achieve it by enforcing to not persist any data written within this
> transaction until step 4. Is that correct?


This is correct. Both alternatives - in-memory WriteBatchWithIndex and
transactionality via the secondary store guarantee EOS by not persisting
data in the "main" state store until it is committed in the changelog topic.

Oh what I meant is not what KStream code does, but that StateStore impl
> classes themselves could potentially flush data to become persisted
> asynchronously


Thank you for elaborating! You are correct, the underlying state store
should not persist data until the streams app calls StateStore#flush. There
are 2 options how a State Store implementation can guarantee that - either
keep uncommitted writes in memory or be able to roll back the changes that
were not committed during recovery. RocksDB's WriteBatchWithIndex is an
implementation of the first option. A considered alternative, Transactions
via Secondary State Store for Uncommitted Changes, is the way to implement
the second option.

As everyone correctly pointed out, keeping uncommitted data in memory
introduces a very real risk of OOM that we will need to handle. The more I
think about it, the more I lean towards going with the Transactions via
Secondary Store as the way to implement transactionality as it does not
have that issue.

Best,
Alex


On Wed, Jun 1, 2022 at 12:59 PM Guozhang Wang  wrote:

> Hello Alex,
>
> > we flush the cache, but not the underlying state store.
>
> You're right. The ordering I mentioned above is actually:
>
> ...
> 3. producer.sendOffsetsToTransaction(); producer.commitTransaction();
> 4. flush store, make sure all writes are persisted.
> 5. Update the checkpoint file to 200.
>
> But I'm still trying to clarify how it guarantees EOS, and it seems that we
> would achieve it by enforcing to not persist any data written within this
> transaction until step 4. Is that correct?
>
> > Can you please point me to the place in the codebase where we trigger
> async flush before the commit?
>
> Oh what I meant is not what KStream code does, but that StateStore impl
> classes themselves could potentially flush data to become persisted
> asynchronously, e.g. RocksDB does that naturally out of the control of
> KStream code. I think it is related to my previous question: if we think by
> guaranteeing EOS at the state store level, we would effectively ask the
> impl classes that "you should not persist any data until `flush` is called
> explicitly", is the StateStore interface the right level to enforce such
> mechanisms, or should we just do that on top of the StateStores, e.g.
> during the transaction we just keep all the writes in the cache (of course
> we need to consider how to work around memory pressure as previously
> mentioned), and then upon committing, we just write the cached records as a
> whole into the store and then call flush.
>
>
> Guozhang
>
>
>
>
>
>
>
> On Tue, May 31, 2022 at 4:08 PM Alexander Sorokoumov
>  wrote:
>
> > Hey,
> >
> > Thank you for the wealth of great suggestions and questions! I am going
> to
> > address the feedback in batches and update the proposal async, as it is
> > probably going to be easier for everyone. I will also write a separate
> > message after making updates to the KIP.
> >
> > @John,
> >
> > > Did you consider instead just adding the option to the
> > > RocksDB*StoreSupplier classes and the factories in Stores ?
> >
> > Thank you for suggesting that. I think that this idea is better than
> what I
> > came up with and will update the KIP with configuring transactionality
> via
> > the suppliers and Stores.
> >
> > what is the advantage over just doing the same thing with the RecordCache
> > > and not introducing the WriteBatch at all?
> >
> > Can you point me to RecordCache? I can't find it in the project. The
> > advantage would be that WriteBatch guarantees write atomicity. As far as
> I
> > understood the way RecordCache works, it might leave the system in an
> > inconsistent state during crash failure on write.
> >
> > You mentioned that a transactional store can help reduce duplication in
> the
> > > case of ALOS
> >
> > I will remove claims about ALOS from the proposal. Thank you for
> > elaborating!
> >
> > As a reminder, we have a new IQv2 mechanism now. Should we propose any
> > > changes to IQv1 to support this transactional mechanism, versus just
> > > proposing it for IQv2? Certainly, it seems strange only to propose a
> > change
> > > for IQv1 and not v2.
> >
> >
> >  I will update the proposal with complementary API changes for IQv2
> >
> > What should IQ do if I request to readCommitted on a non-transactional
> > > store?
> >
> > We can assume that non-transactional stores commit on write, so IQ works
> in
> > the same way with non-transactional stores regardless of the value of
> > readCommitted.
> >
> >
> >  @Guozhang,
> >
> > * If we 

Re: [DISCUSS] KIP-844: Transactional State Stores

2022-06-01 Thread Guozhang Wang
Hello Alex,

> we flush the cache, but not the underlying state store.

You're right. The ordering I mentioned above is actually:

...
3. producer.sendOffsetsToTransaction(); producer.commitTransaction();
4. flush store, make sure all writes are persisted.
5. Update the checkpoint file to 200.

But I'm still trying to clarify how it guarantees EOS, and it seems that we
would achieve it by enforcing to not persist any data written within this
transaction until step 4. Is that correct?

> Can you please point me to the place in the codebase where we trigger
async flush before the commit?

Oh what I meant is not what KStream code does, but that StateStore impl
classes themselves could potentially flush data to become persisted
asynchronously, e.g. RocksDB does that naturally out of the control of
KStream code. I think it is related to my previous question: if we think by
guaranteeing EOS at the state store level, we would effectively ask the
impl classes that "you should not persist any data until `flush` is called
explicitly", is the StateStore interface the right level to enforce such
mechanisms, or should we just do that on top of the StateStores, e.g.
during the transaction we just keep all the writes in the cache (of course
we need to consider how to work around memory pressure as previously
mentioned), and then upon committing, we just write the cached records as a
whole into the store and then call flush.


Guozhang







On Tue, May 31, 2022 at 4:08 PM Alexander Sorokoumov
 wrote:

> Hey,
>
> Thank you for the wealth of great suggestions and questions! I am going to
> address the feedback in batches and update the proposal async, as it is
> probably going to be easier for everyone. I will also write a separate
> message after making updates to the KIP.
>
> @John,
>
> > Did you consider instead just adding the option to the
> > RocksDB*StoreSupplier classes and the factories in Stores ?
>
> Thank you for suggesting that. I think that this idea is better than what I
> came up with and will update the KIP with configuring transactionality via
> the suppliers and Stores.
>
> what is the advantage over just doing the same thing with the RecordCache
> > and not introducing the WriteBatch at all?
>
> Can you point me to RecordCache? I can't find it in the project. The
> advantage would be that WriteBatch guarantees write atomicity. As far as I
> understood the way RecordCache works, it might leave the system in an
> inconsistent state during crash failure on write.
>
> You mentioned that a transactional store can help reduce duplication in the
> > case of ALOS
>
> I will remove claims about ALOS from the proposal. Thank you for
> elaborating!
>
> As a reminder, we have a new IQv2 mechanism now. Should we propose any
> > changes to IQv1 to support this transactional mechanism, versus just
> > proposing it for IQv2? Certainly, it seems strange only to propose a
> change
> > for IQv1 and not v2.
>
>
>  I will update the proposal with complementary API changes for IQv2
>
> What should IQ do if I request to readCommitted on a non-transactional
> > store?
>
> We can assume that non-transactional stores commit on write, so IQ works in
> the same way with non-transactional stores regardless of the value of
> readCommitted.
>
>
>  @Guozhang,
>
> * If we crash between line 3 and 4, then at that time the local persistent
> > store image is representing as of offset 200, but upon recovery all
> > changelog records from 100 to log-end-offset would be considered as
> aborted
> > and not be replayed and we would restart processing from position 100.
> > Restart processing will violate EOS.I'm not sure how e.g. RocksDB's
> > WriteBatchWithIndex would make sure that the step 4 and step 5 could be
> > done atomically here.
>
>
> Could you please point me to the place in the codebase where a task flushes
> the store before committing the transaction?
> Looking at TaskExecutor (
>
> https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutor.java#L144-L167
> ),
> StreamTask#prepareCommit (
>
> https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L398
> ),
> and CachedStateStore (
>
> https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/streams/src/main/java/org/apache/kafka/streams/state/internals/CachedStateStore.java#L29-L34
> )
> we flush the cache, but not the underlying state store. Explicit
> StateStore#flush happens in AbstractTask#maybeWriteCheckpoint (
>
> https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/streams/src/main/java/org/apache/kafka/streams/processor/internals/AbstractTask.java#L91-L99
> ).
> Is there something I am missing here?
>
> Today all cached data that have not been flushed are not committed for
> > sure, but even flushed data to the persistent 

Re: [VOTE] KIP-834: Pause / Resume KafkaStreams Topologies

2022-06-01 Thread Guozhang Wang
Hello Jim,

I think If our primary goal would be to reduce resource utilization and
potentially to stop the streaming pipeline for investigating possible bugs
etc, then we should also pause active tasks' restoration as well since that
1) may still use resources, and 2) may load in bad data.

Guozhang







On Wed, Jun 1, 2022 at 5:53 AM Jim Hughes 
wrote:

> Hi all,
>
> While reviewing my PR for KIP-834, Bruno noticed a case that we may not
> have discussed enough.*
>
> During the discussion, we decided that standby tasks would be paused.  In
> order to do this, there are changes to the StoreChangelogReader around
> where it does restorations.  Bruno noticed that the restoration of active
> tasks is not paused in my PR.
>
> From my point of view, I was hoping to let active tasks restore/consume/etc
> in order that the Kafka Streams instance could transition to RUNNING
> (assuming that it was started paused).  I believe Bruno's position is that
> if we are pausing restoration for standby tasks, then restoration should be
> paused for active tasks as well.
>
> Since this point hasn't been discussed like this, the KIP is unclear about
> this detail.
>
> What do folks think?
>
> Thanks in advance,
>
> Jim
>
> * https://github.com/apache/kafka/pull/12161#discussion_r886732983
>
> On Mon, May 16, 2022 at 11:07 AM Jim Hughes  wrote:
>
> > Hi all,
> >
> >
> > With 5 binding votes (John, Bruno, Sophie, Matthias, Bill) and 4
> > non-binding votes (Guozhang, Luke, Leah, Walker), the vote for KIP-834
> > passes!
> >
> >
> > Thanks all for the great discussion.
> >
> > I have a PR up here: https://github.com/apache/kafka/pull/12161
> >
> >
> > Thanks in advance for feedback on the PR!
> >
> >
> > Cheers,
> >
> >
> > JIm
> >
> > On Fri, May 13, 2022 at 12:04 PM Walker Carlson
> >  wrote:
> >
> >> +1 from me (non-binding)
> >>
> >> Walker
> >>
> >> On Wed, May 11, 2022 at 12:36 PM Leah Thomas
>  >> >
> >> wrote:
> >>
> >> > Thanks Jim, great discussion. +1 from me (non-binding)
> >> >
> >> > Cheers,
> >> > Leah
> >> >
> >> > On Wed, May 11, 2022 at 10:14 AM Bill Bejeck 
> wrote:
> >> >
> >> > > Thanks for the KIP!
> >> > >
> >> > > +1 (binding)
> >> > >
> >> > > -Bill
> >> > >
> >> > > On Wed, May 11, 2022 at 9:36 AM Luke Chen 
> wrote:
> >> > >
> >> > > > Hi Jim,
> >> > > >
> >> > > > I'm +1. (please add some note in KIP about the stream resetting
> tool
> >> > > can't
> >> > > > be used in paused state)
> >> > > > Thanks for the KIP!
> >> > > >
> >> > > > Luke
> >> > > >
> >> > > > On Wed, May 11, 2022 at 9:09 AM Guozhang Wang  >
> >> > > wrote:
> >> > > >
> >> > > > > Thanks Jim. +1 from me.
> >> > > > >
> >> > > > > On Tue, May 10, 2022 at 4:51 PM Matthias J. Sax <
> mj...@apache.org
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > I had one minor question on the discuss thread. It's mainly
> >> about
> >> > > > > > clarifying and document the user contract. I am fine either
> way.
> >> > > > > >
> >> > > > > > +1 (binding)
> >> > > > > >
> >> > > > > >
> >> > > > > > -Matthias
> >> > > > > >
> >> > > > > > On 5/10/22 12:32 PM, Sophie Blee-Goldman wrote:
> >> > > > > > > Thanks for the KIP! +1 (binding)
> >> > > > > > >
> >> > > > > > > On Tue, May 10, 2022, 12:24 PM Bruno Cadonna <
> >> cado...@apache.org
> >> > >
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > >> Thanks Jim,
> >> > > > > > >>
> >> > > > > > >> +1 (binding)
> >> > > > > > >>
> >> > > > > > >> Best,
> >> > > > > > >> Bruno
> >> > > > > > >>
> >> > > > > > >> On 10.05.22 21:19, John Roesler wrote:
> >> > > > > > >>> Thanks Jim,
> >> > > > > > >>>
> >> > > > > > >>> I’m +1 (binding)
> >> > > > > > >>>
> >> > > > > > >>> -John
> >> > > > > > >>>
> >> > > > > > >>> On Tue, May 10, 2022, at 14:05, Jim Hughes wrote:
> >> > > > > >  Hi all,
> >> > > > > > 
> >> > > > > >  I'm asking for a vote on KIP-834:
> >> > > > > > 
> >> > > > > > >>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> >> > > > > > 
> >> > > > > >  Thanks in advance!
> >> > > > > > 
> >> > > > > >  Jim
> >> > > > > > >>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > -- Guozhang
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>


-- 
-- Guozhang


Re: [VOTE] KIP-843: Adding metricOrElseCreate method to Metrics

2022-06-01 Thread Guozhang Wang
I think `addMetricIfAbsent` is a good function name, and like John said
people in the Java world are familiar with its return value semantics as
well.

Regarding deprecating the existing functions, I feel it is not necessary
just for the function semantics difference between `sensors` and `metrics`
(Ismael may chime in here if you have other good reasons).


Guozhang

On Wed, Jun 1, 2022 at 9:50 AM Sagar  wrote:

> So, we have multiple options in terms of names, at this point I actually
> liked John's suggestion to use addMetricIfAbsent or something along those
> lines.
>
> Regarding the deprecation of sensor/metric method, I am not sure... Would
> like to know others' thoughts.
>
> Thanks!
> Sagar.
>
> On Wed, Jun 1, 2022 at 2:28 AM Guozhang Wang  wrote:
>
> > Hey Ismael, just checking do you mean the `metric` method instead?
> >
> > On Tue, May 31, 2022 at 1:45 PM Ismael Juma  wrote:
> >
> > > Should we deprecate the `sensor` method? One other thing to take into
> > > account is that these methods are meant to be used like a dsl for
> > > configuring sensors and metrics. So brevity is a plus (but clarity is
> > > critical still).
> > >
> > > Ismael
> > >
> > > On Tue, May 31, 2022 at 11:09 AM John Roesler 
> > wrote:
> > >
> > > > Generally, I agree with Ismael that having a new, weird name will
> make
> > it
> > > > hard to keep them straight. Then again, we need to make them
> different
> > to
> > > > prevent confusion about their semantics. To be clear, I'll be a +1
> > > > regardless of how we break this dilemma.
> > > >
> > > > One suggestion: We currently have addMetric to add a new metric. We
> can
> > > > take some inspiration from the Java Map interface and call this new
> > > method
> > > > `addMetricIfAbsent`. Having the same prefix should help discovery,
> and
> > > > following the Map convention should help confusion.
> > > >
> > > > Thanks all,
> > > > -John
> > > >
> > > >
> > > >
> > > > On Tue, May 31, 2022, at 12:13, Sagar wrote:
> > > > > Oh yeah there's another metric function which is get-only. I think
> we
> > > > > should go ahead with getOrCreateMetric.
> > > > >
> > > > > Thanks!
> > > > > Sagar.
> > > > >
> > > > > On Tue, May 31, 2022 at 10:02 PM Guozhang Wang  >
> > > > wrote:
> > > > >
> > > > >> I'd prefer the getOrCreateMetric function name, since for the
> > > existing "
> > > > >> sensor(String name)" function that only takes a single `String`
> > > > parameter,
> > > > >> its semantics is already "get or create". Whereas the existing
> > > > >> "metric(MetricName)" function's semantics is "get" only. So in my
> > > mind,
> > > > the
> > > > >> inconsistent conventions in function signatures already exist
> today.
> > > And
> > > > >> with the other option we would need to educate users that "all the
> > > > `sensor`
> > > > >> functions are get-or-create, but, please remember that the
> `metric`
> > > > >> function with just the metric name is get-only, while other
> `metric`
> > > > >> overrides with more parameters are get-or-create", which I think
> is
> > > even
> > > > >> more confusing.
> > > > >>
> > > > >>
> > > > >> Guozhang
> > > > >>
> > > > >>
> > > > >> On Mon, May 30, 2022 at 9:51 PM Sagar 
> > > > wrote:
> > > > >>
> > > > >> > Hi Ismael,
> > > > >> >
> > > > >> > I guess in that case, we will have to go with the name *metric*-
> > > > similar
> > > > >> to
> > > > >> > *sensor* - which David pointed out above because I think that's
> > the
> > > > >> closest
> > > > >> > method which either gets or creates a new sensor. Current
> > addMetric
> > > in
> > > > >> the
> > > > >> > Metrics class throw an IllegalArguementException when the metric
> > > > already
> > > > >> > exists and that's why I still think getOrCreateMetric still
> > > signifies
> > > > the
> > > > >> > action correctly. Or how about addOrGetMetric or getOrAddMetric,
> > > just
> > > > >> > replacing create with add to keep it similar to the already
> > present
> > > > >> > addMetric method.
> > > > >> >
> > > > >> > Thanks!
> > > > >> > Sagar.
> > > > >> >
> > > > >> > On Tue, May 31, 2022 at 1:19 AM Ismael Juma 
> > > > wrote:
> > > > >> >
> > > > >> > > I think it's confusing to use two completely different naming
> > > > >> conventions
> > > > >> > > in the same class. We either stick with the existing
> convention
> > or
> > > > we
> > > > >> > > create a new one and deprecate old method(s). I am not sure
> > there
> > > is
> > > > >> > enough
> > > > >> > > value in this case for the latter, but it would be good to
> hear
> > > what
> > > > >> > others
> > > > >> > > think.
> > > > >> > >
> > > > >> > > Ismael
> > > > >> > >
> > > > >> > > On Mon, May 30, 2022, 2:08 AM Bruno Cadonna <
> cado...@apache.org
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi,
> > > > >> > > >
> > > > >> > > > I would also lean towards getOrCreateMetric() for the
> reasons
> > > > pointed
> > > > >> > > > out by Sagar. But I am fine either way.
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> 

Re: [DISCUSS] KIP-846: Task-level Streams metrics for bytes/records Produced

2022-06-01 Thread Guozhang Wang
Thanks Sophie, that makes sense. Also I agree that since we are adding it
at finest granularity for consumed metrics, it's better to have symmetry
and add produced metrics at processor-node level as well.

Regarding the benchmarks, I think that would be critical to have before we
add in the next release, since we propose to make this INFO level, meaning
it would be on by default, and collecting per-record is a critical path.


Guozhang

On Wed, Jun 1, 2022 at 8:04 AM Sophie Blee-Goldman
 wrote:

> I just want to send out a small update -- I decided to include the "-
> *consumed*" metrics in the KIP alongside the
> *"-produced"* metrics after all, for reasons I address in the paragraph I
> added at the end of the motivation section.
> Please let me know if you have any questions or concerns
>
> Cheers,
> Sophie
>
> On Wed, Jun 1, 2022 at 1:38 AM Sophie Blee-Goldman 
> wrote:
>
> > Just a quick question: for filling the gap of sub-topology visibilities,
> >> would task-level produced metrics be sufficient?
> >
> >
> > If I understand your question correctly, you're asking whether we could
> > just report at the task/subtopology
> > level since we mainly want the bytes/throughput produced by the
> > subtopology itself?
> >
> > Note that since we're only reporting this metric at the sink nodes, it's
> > basically the same as a task-level metric
> > for any subtopology that has only one sink and only scales with the
> number
> > of output topics. If you're concerned
> > about a potential performance impact I would say that (a) making these
> > task-level doesn't buy us much, if anything,
> > and (b) we can always revisit this if our benchmarks do indeed reveal a
> > regression but for now let's not over-
> > optimize too much :)
> >
> > Also, the general philosophy  behind the KAfka Streams metrics thus far
> > has been to report at the finest granularity
> > and allow users to roll them up into whatever scope they want to
> aggregate
> > over. This KIP adopts the same approach.
> >
> > On Tue, May 31, 2022 at 12:02 PM Guozhang Wang 
> wrote:
> >
> >> Hi Sophie,
> >>
> >> Just a quick question: for filling the gap of sub-topology visibilities,
> >> would task-level produced metrics be sufficient?
> >>
> >> On Mon, May 30, 2022 at 10:59 AM Bill Bejeck  wrote:
> >>
> >> > Thanks for the KIP Sophie.
> >> >
> >> > I'm in favor of this change as well. I don't have any comments in
> >> > addition to the ones already expressed.
> >> >
> >> > -Bill
> >> >
> >> > On Mon, May 30, 2022 at 4:55 AM Sagar 
> >> wrote:
> >> >
> >> > > Hi Sophie,
> >> > >
> >> > > A very minor comment but you might want to remove this KIP template
> >> > related
> >> > > information from the top of the KIP:
> >> > >
> >> > > *This page is meant as a template for writing a KIP
> >> > > <
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> >> > > >.
> >> > > To create a KIP choose Tools->Copy on this page and modify with your
> >> > > content and replace the heading with the next KIP number and a
> >> > description
> >> > > of your issue. Replace anything in italics with your own
> description.*
> >> > >
> >> > >
> >> > > Thanks!
> >> > > Sagar.
> >> > >
> >> > > On Mon, May 30, 2022 at 1:04 PM Sophie Blee-Goldman
> >> > >  wrote:
> >> > >
> >> > > > >
> >> > > > > Why does the title of the KIP talk about task-level metrics, but
> >> the
> >> > > > > specified metrics are on processor-level?
> >> > > >
> >> > > >
> >> > > > Ah, my mistake -- it should indeed say "processor-level metrics".
> >> > Thanks
> >> > > > for the catch Bruno, the title has been fixed.
> >> > > >
> >> > > > Since there don't seem to be any concerns I'll proceed with
> kicking
> >> off
> >> > > the
> >> > > > vote. Thanks all!
> >> > > >
> >> > > > On Mon, May 30, 2022 at 12:01 AM Bruno Cadonna <
> cado...@apache.org>
> >> > > wrote:
> >> > > >
> >> > > > > Thanks for the KIP, Sophie!
> >> > > > >
> >> > > > > I am also in favor of this KIP!
> >> > > > >
> >> > > > > I have one minor question:
> >> > > > >
> >> > > > > Why does the title of the KIP talk about task-level metrics, but
> >> the
> >> > > > > specified metrics are on processor-level?
> >> > > > >
> >> > > > > For the rest, I am +1.
> >> > > > >
> >> > > > > Best,
> >> > > > > Bruno
> >> > > > >
> >> > > > > On 29.05.22 00:20, John Roesler wrote:
> >> > > > > > Thanks for the well motivated and documented KIP, Sophie! I’m
> in
> >> > > favor
> >> > > > > of this change.
> >> > > > > >
> >> > > > > > -John
> >> > > > > >
> >> > > > > > On Sat, May 28, 2022, at 06:42, Sophie Blee-Goldman wrote:
> >> > > > > >> Hey all,
> >> > > > > >>
> >> > > > > >> I'd like to propose a very small KIP to add two metrics that
> >> will
> >> > > help
> >> > > > > fill
> >> > > > > >> a gap in the derivable produced and consumed metrics. Please
> >> take
> >> > a
> >> > > > look
> >> > > > > >> and reply here with any questions or concerns.
> >> > > > > >>
> >> > > > > >> KIP-846: 

Re: [DISCUSS] KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft

2022-06-01 Thread José Armando García Sancio
Thanks for the updates to the KIP.

I like enumerating invariants. Is it safe to say that if
`InControlledShutdown` is true then `Fenced` must be false.


Re: [DISCUSS] KIP-827: Expose logdirs total and usable space via Kafka API

2022-06-01 Thread Cong Ding
Thanks for the explanation. I think the question is that if we have disk
utilization in our environment, what is the use case for KIP-827? The disk
utilization in our environment can already do the job. Is there anything I
missed?

Thanks,
Cong

On Tue, May 31, 2022 at 2:57 AM Mickael Maison 
wrote:

> Hi Cong,
>
> Kafka does not expose disk utilization metrics. This is something you
> need to provide in your environment. You definitively should have a
> mechanism for exposing metrics from your Kafka broker hosts and you
> should absolutely monitor disk usage and have appropriate alerts.
>
> Thanks,
> Mickael
>
> On Thu, May 26, 2022 at 7:34 PM Jun Rao  wrote:
> >
> > Hi, Igor,
> >
> > Thanks for the reply.
> >
> > I agree that this KIP could be useful for improving the tool for moving
> > data across disks. It would be useful to clarify on the main motivation
> of
> > the KIP. Also, DescribeLogDirsResponse already includes the size of each
> > partition on a disk. So, it seems that UsableBytes is redundant since
> it's
> > derivable.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, May 26, 2022 at 3:30 AM Igor Soarez  wrote:
> >
> > > Hi,
> > >
> > > This can also be quite useful to make better use of existing
> functionality
> > > in the Kafka API — moving replicas between log directories via
> > > ALTER_REPLICA_LOG_DIRS. If usable space information is also available
> the
> > > caller can make better decisions using the same API. It means a more
> > > consistent way of interacting with Kafka to manage replicas locations
> > > within a broker without having to correlate Kafka metrics with
> information
> > > from the Kafka API.
> > >
> > > --
> > > Igor
> > >
> > > On Wed, May 25, 2022, at 8:16 PM, Jun Rao wrote:
> > > > Hi, Mickael,
> > > >
> > > > Thanks for the KIP.  Since this is mostly for monitoring and
> alerting,
> > > > could we expose them as metrics instead of as part of the API? We
> already
> > > > have a size metric per log. Perhaps we could extend that to add
> > > used/total
> > > > metrics per disk?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, May 19, 2022 at 10:21 PM Raman Verma
>  > > >
> > > > wrote:
> > > >
> > > >> Hello Mikael,
> > > >>
> > > >> Thanks for the KIP.
> > > >>
> > > >> I see that the API response contains some information about each
> > > partition.
> > > >> ```
> > > >> { "name": "PartitionSize", "type": "int64", "versions": "0+",
> > > >>   "about": "The size of the log segments in this partition in
> bytes." }
> > > >> ```
> > > >> Can this be summed up to provide a used space in a `log.dir`
> > > >> This will also be specific to a `log.dir` (for the case where
> multiple
> > > >> log.dir are hosted on the same underlying device)
> > > >>
> > > >> On Thu, May 19, 2022 at 10:21 AM Cong Ding
> 
> > > >> wrote:
> > > >> >
> > > >> > Hey Mickael,
> > > >> >
> > > >> > Great KIP!
> > > >> >
> > > >> > I have one question:
> > > >> >
> > > >> > You mentioned "DescribeLogDirs is usually a low volume API. This
> > > change
> > > >> > should not
> > > >> > significantly affect the latency of this API." and "That would
> allow
> > > to
> > > >> > easily validate whether disk operations (like a resize), or topic
> > > >> deletion
> > > >> > (log deletion only happen after a short delay) have completed." I
> > > wonder
> > > >> if
> > > >> > there is an existing metric/API that can allow administrators to
> > > >> determine
> > > >> > whether we need to resize? If administrators use this API to
> determine
> > > >> > whether we need a resize, would this API become a high-volume
> API? I
> > > >> > understand we don't want this API to be a high-volume one because
> the
> > > API
> > > >> > is already costly by returning `"name": "Topics"`.
> > > >> >
> > > >> > Cong
> > > >> >
> > > >> > On Thu, Apr 7, 2022 at 2:17 AM Mickael Maison <
> > > mickael.mai...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > I wrote a small KIP to expose the total and usable space of
> logdirs
> > > >> > > via the DescribeLogDirs API:
> > > >> > >
> > > >> > >
> > > >>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-827%3A+Expose+logdirs+total+and+usable+space+via+Kafka+API
> > > >> > >
> > > >> > > Please take a look and let me know if you have any feedback.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Mickael
> > > >> > >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Best Regards,
> > > >> Raman Verma
> > > >>
> > >
>


Re: [DISCUSS] KIP-714: Client metrics and observability

2022-06-01 Thread Jun Rao
Hi, Magnus,

51. Just to clarify my question.  (1) Are standard metrics required for
every client for this KIP to function?  (2) Are we converting existing java
metrics to the standard metrics and deprecating the old ones? If so, could
we list all existing java metrics that need to be renamed and the
corresponding new name?

Thanks,

Jun

On Tue, May 31, 2022 at 3:29 PM Jun Rao  wrote:

> Hi, Magnus,
>
> Thanks for the reply.
>
> 51. I think it's fine to have a list of recommended metrics for every
> client to implement. I am just not sure that standardizing on the metric
> names across all clients is practical. The list of common metrics in the
> KIP have completely different names from the java metric names. Some of
> them have different types. For example, some of the common metrics have a
> type of histogram, but the java client metrics don't use histogram in
> general. Requiring the operator to translate those names and understand the
> subtle differences across clients seem to cause more confusion during
> troubleshooting.
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill 
> wrote:
>
>> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao :
>>
>> > Hi, Magus,
>> >
>> > Thanks for the reply.
>> >
>> > 50. Sounds good.
>> >
>> > 51. I miss-understood the proposal in the KIP then. The proposal is to
>> > define a set of common metric names that every client should implement.
>> The
>> > problem is that every client already has its own set of metrics with its
>> > own names. I am not sure that we could easily agree upon a common set of
>> > metrics that work with all clients. There are likely to be some metrics
>> > that are client specific. Translating between the common name and client
>> > specific name is probably going to add more confusion. As mentioned in
>> the
>> > KIP, similar metrics from different clients could have subtle
>> > semantic differences. Could we just let each client use its own set of
>> > metric names?
>> >
>>
>> We identified a common set of metrics that should be relevant for most
>> client implementations,
>> they're the ones listed in the KIP.
>> A supporting client does not have to implement all those metrics, only the
>> ones that makes sense
>> based on that client implementation, and a client may implement other
>> metrics that are not listed
>> in the KIP under its own namespace.
>> This approach has two benefits:
>>  - there will be a common set of metrics that most/all clients implement,
>> which makes monitoring
>>   and troubleshooting easier across fleets with multiple Kafka client
>> languages/implementations.
>>  - client-specific metrics are still possible, so if there is no suitable
>> standard metric a client can still
>>provide what special metrics it has.
>>
>>
>> Thanks,
>> Magnus
>>
>>
>> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill 
>> wrote:
>> >
>> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao > >:
>> > >
>> > > > Hi, Magnus,
>> > > >
>> > >
>> > > Hi Jun
>> > >
>> > >
>> > > >
>> > > > Thanks for the updated KIP. Just a couple of more comments.
>> > > >
>> > > > 50. To troubleshoot a particular client issue, I imagine that the
>> > client
>> > > > needs to identify its client_instance_id. How does the client find
>> this
>> > > > out? Do we plan to include client_instance_id in the client log,
>> expose
>> > > it
>> > > > as a metric or something else?
>> > > >
>> > >
>> > > The KIP suggests that client implementations emit an informative log
>> > > message
>> > > with the assigned client-instance-id once it is retrieved (once per
>> > client
>> > > instance lifetime).
>> > > There's also a clientInstanceId() method that an application can use
>> to
>> > > retrieve
>> > > the client instance id and emit through whatever side channels makes
>> > sense.
>> > >
>> > >
>> > >
>> > > > 51. The KIP lists a bunch of metrics that need to be collected at
>> the
>> > > > client side. However, it seems quite a few useful java client
>> metrics
>> > > like
>> > > > the following are missing.
>> > > > buffer-total-bytes
>> > > > buffer-available-bytes
>> > > >
>> > >
>> > > These are covered by client.producer.record.queue.bytes and
>> > > client.producer.record.queue.max.bytes.
>> > >
>> > >
>> > > > bufferpool-wait-time
>> > > >
>> > >
>> > > Missing, but somewhat implementation specific.
>> > > If it was up to me we would add this later if there's a need.
>> > >
>> > >
>> > >
>> > > > batch-size-avg
>> > > > batch-size-max
>> > > >
>> > >
>> > > These are missing and would be suitably represented as a histogram.
>> I'll
>> > > add them.
>> > >
>> > >
>> > >
>> > > > io-wait-ratio
>> > > > io-ratio
>> > > >
>> > >
>> > > There's client.io.wait.time which should cover io-wait-ratio.
>> > > We could add a client.io.time as well, now or in a later KIP.
>> > >
>> > > Thanks,
>> > > Magnus
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jun
>> > > >
>> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun 

Re: [VOTE] KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft

2022-06-01 Thread Colin McCabe
Thanks, David. With these changes, I am +1 (binding)

Great to see this KIP moving forward!

Colin


On Wed, Jun 1, 2022, at 02:04, David Jacot wrote:
> Hi Colin,
>
> Thanks for your feedback! Please find my answers below.
>
>> However, I wonder if this will be feasible for ZK-based brokers to do. We 
>> still support pre-topic-id IBP versions there, right? This might end up 
>> being more complex than you were hoping.
>
> That's right. We support pre-topic-id IBP versions for ZK-based
> brokers. We will only use version 2 if all the topics have an ID when
> the request is constructed. Version 1 is used otherwise. I have
> already implemented this part in the draft PR if you want to see how
> it looks. It brings a little more complexity in the request
> building/handling but that works. I think that it is definitely worth
> it.
>
>> We should add a comment in AlterPartitionResponse about the new error code 
>> INELIGIBLE_REPLICA. This is very important for error codes so we can track 
>> which ones are returned in which RPC version.
>
> Totally. I have those comments in the PR but I forgot to add them in
> the KIP. I will fix this.
>
>> I also wonder if we should add a special error code for the case where the 
>> AlterPartitions call completes a reassignment AND the completed reassignment 
>> no longer includes the previous leader. We have been overloading 
>> FENCED_LEADER_EPOCH for this case, but this is coonfusing to operators 
>> (especially since this RPC does not support error messages, as opposed to 
>> codes)
>
> Interesting. I was not aware of this one. We don't do this in the ZK
> controller, right? I do agree that using FENCED_LEADER_EPOCH is
> confusing here. We could introduce a NEW_LEADER_ELECTED error code for
> this purpose.
>
>> One thing, though, is that we should define how this interacts with replica 
>> placement. It seems to me that replicas should not be able to be placed on 
>> these inControlledShutdown nodes (unless done manually via the explicit 
>> placement API).
>
> Yeah, I agree that we need to clarify this. If not mistaken, replicas
> can be placed on fenced nodes so I don't see why it should be
> different for inControlledShutdown nodes. Otherwise we will again have
> that create topic issue when the cluster has only 3 nodes. However I
> think that we have to guarantee two other invariants:
> 1) an inControlledShutdown node should not be added to the ISR when a
> partition is created.
> 2) an inControlledShutdown node should not be picked as a leader for a
> partition. (e.g. KAFKA-13944)
>
> Let me add this part to the KIP.
>
>> I also think we should spell out the fact that once you go into controlled 
>> shutdown, you don't come out except by creating a new broker instance. (new 
>> incarnation ID). This also makes me wonder if we need to support the 
>> shutting down -> not shutting down transition in 
>> BrokerRegistrationChangeRecord, since we don't plan on using that transition.
>
> That makes sense. I will add this and remove that transition from
> BrokerRegistrationChangeRecord.
>
>> Finally, RegisterBrokerRecord / BrokerRegistrationChangeRecord should be 
>> bumped to the next RPC version since we have added a new field. You will 
>> also need to assign yourself a new IBP / MetadataVersion. (For 
>> BrokerRegistrationChangeRecord it would be possible to avoid the version 
>> bump, since we're using tagged fields, but it's better to have it for 
>> consistency, I think.)
>
> Noted.
>
> Let me update the KIP to incorporate all your feedback.
>
> Cheers,
> David
>
> On Tue, May 31, 2022 at 9:42 PM Colin McCabe  wrote:
>>
>> > We should add a comment in AlterPartitionResponse about the new error
>> > code INELIGIBLE_REPLICA. This is very important for error codes so we
>> > can track which ones are returned in which RPC version. I also wonder
>>
>> Here I'm referring to AlterPartitionResponse.json
>>
>> cheers,
>> Colin
>>
>> >
>> >
>> > On Tue, May 31, 2022, at 08:36, David Jacot wrote:
>> >> Hi folks,
>> >>
>> >> I'd like to start a vote for KIP-841:
>> >> https://cwiki.apache.org/confluence/x/phmhD.
>> >>
>> >> Thanks,
>> >> David


Re: [VOTE] KIP-843: Adding metricOrElseCreate method to Metrics

2022-06-01 Thread Sagar
So, we have multiple options in terms of names, at this point I actually
liked John's suggestion to use addMetricIfAbsent or something along those
lines.

Regarding the deprecation of sensor/metric method, I am not sure... Would
like to know others' thoughts.

Thanks!
Sagar.

On Wed, Jun 1, 2022 at 2:28 AM Guozhang Wang  wrote:

> Hey Ismael, just checking do you mean the `metric` method instead?
>
> On Tue, May 31, 2022 at 1:45 PM Ismael Juma  wrote:
>
> > Should we deprecate the `sensor` method? One other thing to take into
> > account is that these methods are meant to be used like a dsl for
> > configuring sensors and metrics. So brevity is a plus (but clarity is
> > critical still).
> >
> > Ismael
> >
> > On Tue, May 31, 2022 at 11:09 AM John Roesler 
> wrote:
> >
> > > Generally, I agree with Ismael that having a new, weird name will make
> it
> > > hard to keep them straight. Then again, we need to make them different
> to
> > > prevent confusion about their semantics. To be clear, I'll be a +1
> > > regardless of how we break this dilemma.
> > >
> > > One suggestion: We currently have addMetric to add a new metric. We can
> > > take some inspiration from the Java Map interface and call this new
> > method
> > > `addMetricIfAbsent`. Having the same prefix should help discovery, and
> > > following the Map convention should help confusion.
> > >
> > > Thanks all,
> > > -John
> > >
> > >
> > >
> > > On Tue, May 31, 2022, at 12:13, Sagar wrote:
> > > > Oh yeah there's another metric function which is get-only. I think we
> > > > should go ahead with getOrCreateMetric.
> > > >
> > > > Thanks!
> > > > Sagar.
> > > >
> > > > On Tue, May 31, 2022 at 10:02 PM Guozhang Wang 
> > > wrote:
> > > >
> > > >> I'd prefer the getOrCreateMetric function name, since for the
> > existing "
> > > >> sensor(String name)" function that only takes a single `String`
> > > parameter,
> > > >> its semantics is already "get or create". Whereas the existing
> > > >> "metric(MetricName)" function's semantics is "get" only. So in my
> > mind,
> > > the
> > > >> inconsistent conventions in function signatures already exist today.
> > And
> > > >> with the other option we would need to educate users that "all the
> > > `sensor`
> > > >> functions are get-or-create, but, please remember that the `metric`
> > > >> function with just the metric name is get-only, while other `metric`
> > > >> overrides with more parameters are get-or-create", which I think is
> > even
> > > >> more confusing.
> > > >>
> > > >>
> > > >> Guozhang
> > > >>
> > > >>
> > > >> On Mon, May 30, 2022 at 9:51 PM Sagar 
> > > wrote:
> > > >>
> > > >> > Hi Ismael,
> > > >> >
> > > >> > I guess in that case, we will have to go with the name *metric*-
> > > similar
> > > >> to
> > > >> > *sensor* - which David pointed out above because I think that's
> the
> > > >> closest
> > > >> > method which either gets or creates a new sensor. Current
> addMetric
> > in
> > > >> the
> > > >> > Metrics class throw an IllegalArguementException when the metric
> > > already
> > > >> > exists and that's why I still think getOrCreateMetric still
> > signifies
> > > the
> > > >> > action correctly. Or how about addOrGetMetric or getOrAddMetric,
> > just
> > > >> > replacing create with add to keep it similar to the already
> present
> > > >> > addMetric method.
> > > >> >
> > > >> > Thanks!
> > > >> > Sagar.
> > > >> >
> > > >> > On Tue, May 31, 2022 at 1:19 AM Ismael Juma 
> > > wrote:
> > > >> >
> > > >> > > I think it's confusing to use two completely different naming
> > > >> conventions
> > > >> > > in the same class. We either stick with the existing convention
> or
> > > we
> > > >> > > create a new one and deprecate old method(s). I am not sure
> there
> > is
> > > >> > enough
> > > >> > > value in this case for the latter, but it would be good to hear
> > what
> > > >> > others
> > > >> > > think.
> > > >> > >
> > > >> > > Ismael
> > > >> > >
> > > >> > > On Mon, May 30, 2022, 2:08 AM Bruno Cadonna  >
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi,
> > > >> > > >
> > > >> > > > I would also lean towards getOrCreateMetric() for the reasons
> > > pointed
> > > >> > > > out by Sagar. But I am fine either way.
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Bruno
> > > >> > > >
> > > >> > > > On 30.05.22 10:54, Sagar wrote:
> > > >> > > > > Hi Bruno/David,
> > > >> > > > >
> > > >> > > > > Thanks for the suggestions. I would personally lean towards
> > > using
> > > >> > > > > getOrCreateMetric as it clearly explains the intent. Having
> > said
> > > >> > that,
> > > >> > > if
> > > >> > > > > we want to use just metric(similar to sensor), that should
> > also
> > > be
> > > >> > ok.
> > > >> > > > Just
> > > >> > > > > that I feel getOrCreateMetric is easily understandable.
> > > >> > > > >
> > > >> > > > > Thanks!
> > > >> > > > > Sagar.
> > > >> > > > >
> > > >> > > > > On Mon, May 30, 2022 at 2:16 PM David Jacot
> > > >> > >  > > >> > > > >
> > > >> > > > > wrote:
> > 

Re: [DISCUSS] KIP-846: Task-level Streams metrics for bytes/records Produced

2022-06-01 Thread Sophie Blee-Goldman
I just want to send out a small update -- I decided to include the "-
*consumed*" metrics in the KIP alongside the
*"-produced"* metrics after all, for reasons I address in the paragraph I
added at the end of the motivation section.
Please let me know if you have any questions or concerns

Cheers,
Sophie

On Wed, Jun 1, 2022 at 1:38 AM Sophie Blee-Goldman 
wrote:

> Just a quick question: for filling the gap of sub-topology visibilities,
>> would task-level produced metrics be sufficient?
>
>
> If I understand your question correctly, you're asking whether we could
> just report at the task/subtopology
> level since we mainly want the bytes/throughput produced by the
> subtopology itself?
>
> Note that since we're only reporting this metric at the sink nodes, it's
> basically the same as a task-level metric
> for any subtopology that has only one sink and only scales with the number
> of output topics. If you're concerned
> about a potential performance impact I would say that (a) making these
> task-level doesn't buy us much, if anything,
> and (b) we can always revisit this if our benchmarks do indeed reveal a
> regression but for now let's not over-
> optimize too much :)
>
> Also, the general philosophy  behind the KAfka Streams metrics thus far
> has been to report at the finest granularity
> and allow users to roll them up into whatever scope they want to aggregate
> over. This KIP adopts the same approach.
>
> On Tue, May 31, 2022 at 12:02 PM Guozhang Wang  wrote:
>
>> Hi Sophie,
>>
>> Just a quick question: for filling the gap of sub-topology visibilities,
>> would task-level produced metrics be sufficient?
>>
>> On Mon, May 30, 2022 at 10:59 AM Bill Bejeck  wrote:
>>
>> > Thanks for the KIP Sophie.
>> >
>> > I'm in favor of this change as well. I don't have any comments in
>> > addition to the ones already expressed.
>> >
>> > -Bill
>> >
>> > On Mon, May 30, 2022 at 4:55 AM Sagar 
>> wrote:
>> >
>> > > Hi Sophie,
>> > >
>> > > A very minor comment but you might want to remove this KIP template
>> > related
>> > > information from the top of the KIP:
>> > >
>> > > *This page is meant as a template for writing a KIP
>> > > <
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
>> > > >.
>> > > To create a KIP choose Tools->Copy on this page and modify with your
>> > > content and replace the heading with the next KIP number and a
>> > description
>> > > of your issue. Replace anything in italics with your own description.*
>> > >
>> > >
>> > > Thanks!
>> > > Sagar.
>> > >
>> > > On Mon, May 30, 2022 at 1:04 PM Sophie Blee-Goldman
>> > >  wrote:
>> > >
>> > > > >
>> > > > > Why does the title of the KIP talk about task-level metrics, but
>> the
>> > > > > specified metrics are on processor-level?
>> > > >
>> > > >
>> > > > Ah, my mistake -- it should indeed say "processor-level metrics".
>> > Thanks
>> > > > for the catch Bruno, the title has been fixed.
>> > > >
>> > > > Since there don't seem to be any concerns I'll proceed with kicking
>> off
>> > > the
>> > > > vote. Thanks all!
>> > > >
>> > > > On Mon, May 30, 2022 at 12:01 AM Bruno Cadonna 
>> > > wrote:
>> > > >
>> > > > > Thanks for the KIP, Sophie!
>> > > > >
>> > > > > I am also in favor of this KIP!
>> > > > >
>> > > > > I have one minor question:
>> > > > >
>> > > > > Why does the title of the KIP talk about task-level metrics, but
>> the
>> > > > > specified metrics are on processor-level?
>> > > > >
>> > > > > For the rest, I am +1.
>> > > > >
>> > > > > Best,
>> > > > > Bruno
>> > > > >
>> > > > > On 29.05.22 00:20, John Roesler wrote:
>> > > > > > Thanks for the well motivated and documented KIP, Sophie! I’m in
>> > > favor
>> > > > > of this change.
>> > > > > >
>> > > > > > -John
>> > > > > >
>> > > > > > On Sat, May 28, 2022, at 06:42, Sophie Blee-Goldman wrote:
>> > > > > >> Hey all,
>> > > > > >>
>> > > > > >> I'd like to propose a very small KIP to add two metrics that
>> will
>> > > help
>> > > > > fill
>> > > > > >> a gap in the derivable produced and consumed metrics. Please
>> take
>> > a
>> > > > look
>> > > > > >> and reply here with any questions or concerns.
>> > > > > >>
>> > > > > >> KIP-846: Task-level Streams metrics for bytes/records Produced
>> > > > > >> <
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211886093
>> > > > > >
>> > > > > >>
>> > > > > >> Given the small nature of this I'm going to call for a vote
>> soon,
>> > > but
>> > > > > >> please don't hesitate to raise anything you feel should be
>> > discussed
>> > > > in
>> > > > > >> more detail first.
>> > > > > >>
>> > > > > >> Thanks!
>> > > > > >> Sophie
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>> --
>> -- Guozhang
>>
>


[GitHub] [kafka-site] ZachLC opened a new pull request, #413: Add Covage to powered-by

2022-06-01 Thread GitBox


ZachLC opened a new pull request, #413:
URL: https://github.com/apache/kafka-site/pull/413

   We really are fans at Covage. congrats of what you've done with Kafka.
   
![image](https://user-images.githubusercontent.com/106680797/171433210-905085eb-1c02-4783-a35a-ffb8ef22b3bb.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [VOTE] KIP-834: Pause / Resume KafkaStreams Topologies

2022-06-01 Thread Jim Hughes
Hi all,

While reviewing my PR for KIP-834, Bruno noticed a case that we may not
have discussed enough.*

During the discussion, we decided that standby tasks would be paused.  In
order to do this, there are changes to the StoreChangelogReader around
where it does restorations.  Bruno noticed that the restoration of active
tasks is not paused in my PR.

>From my point of view, I was hoping to let active tasks restore/consume/etc
in order that the Kafka Streams instance could transition to RUNNING
(assuming that it was started paused).  I believe Bruno's position is that
if we are pausing restoration for standby tasks, then restoration should be
paused for active tasks as well.

Since this point hasn't been discussed like this, the KIP is unclear about
this detail.

What do folks think?

Thanks in advance,

Jim

* https://github.com/apache/kafka/pull/12161#discussion_r886732983

On Mon, May 16, 2022 at 11:07 AM Jim Hughes  wrote:

> Hi all,
>
>
> With 5 binding votes (John, Bruno, Sophie, Matthias, Bill) and 4
> non-binding votes (Guozhang, Luke, Leah, Walker), the vote for KIP-834
> passes!
>
>
> Thanks all for the great discussion.
>
> I have a PR up here: https://github.com/apache/kafka/pull/12161
>
>
> Thanks in advance for feedback on the PR!
>
>
> Cheers,
>
>
> JIm
>
> On Fri, May 13, 2022 at 12:04 PM Walker Carlson
>  wrote:
>
>> +1 from me (non-binding)
>>
>> Walker
>>
>> On Wed, May 11, 2022 at 12:36 PM Leah Thomas > >
>> wrote:
>>
>> > Thanks Jim, great discussion. +1 from me (non-binding)
>> >
>> > Cheers,
>> > Leah
>> >
>> > On Wed, May 11, 2022 at 10:14 AM Bill Bejeck  wrote:
>> >
>> > > Thanks for the KIP!
>> > >
>> > > +1 (binding)
>> > >
>> > > -Bill
>> > >
>> > > On Wed, May 11, 2022 at 9:36 AM Luke Chen  wrote:
>> > >
>> > > > Hi Jim,
>> > > >
>> > > > I'm +1. (please add some note in KIP about the stream resetting tool
>> > > can't
>> > > > be used in paused state)
>> > > > Thanks for the KIP!
>> > > >
>> > > > Luke
>> > > >
>> > > > On Wed, May 11, 2022 at 9:09 AM Guozhang Wang 
>> > > wrote:
>> > > >
>> > > > > Thanks Jim. +1 from me.
>> > > > >
>> > > > > On Tue, May 10, 2022 at 4:51 PM Matthias J. Sax > >
>> > > > wrote:
>> > > > >
>> > > > > > I had one minor question on the discuss thread. It's mainly
>> about
>> > > > > > clarifying and document the user contract. I am fine either way.
>> > > > > >
>> > > > > > +1 (binding)
>> > > > > >
>> > > > > >
>> > > > > > -Matthias
>> > > > > >
>> > > > > > On 5/10/22 12:32 PM, Sophie Blee-Goldman wrote:
>> > > > > > > Thanks for the KIP! +1 (binding)
>> > > > > > >
>> > > > > > > On Tue, May 10, 2022, 12:24 PM Bruno Cadonna <
>> cado...@apache.org
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > >> Thanks Jim,
>> > > > > > >>
>> > > > > > >> +1 (binding)
>> > > > > > >>
>> > > > > > >> Best,
>> > > > > > >> Bruno
>> > > > > > >>
>> > > > > > >> On 10.05.22 21:19, John Roesler wrote:
>> > > > > > >>> Thanks Jim,
>> > > > > > >>>
>> > > > > > >>> I’m +1 (binding)
>> > > > > > >>>
>> > > > > > >>> -John
>> > > > > > >>>
>> > > > > > >>> On Tue, May 10, 2022, at 14:05, Jim Hughes wrote:
>> > > > > >  Hi all,
>> > > > > > 
>> > > > > >  I'm asking for a vote on KIP-834:
>> > > > > > 
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
>> > > > > > 
>> > > > > >  Thanks in advance!
>> > > > > > 
>> > > > > >  Jim
>> > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > -- Guozhang
>> > > > >
>> > > >
>> > >
>> >
>>
>


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #965

2022-06-01 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft

2022-06-01 Thread David Jacot
Hi Ismael,

That's the plan. I just noticed that I forgot to change the version of
the field in the AlterPartitionResponse. Let me fix that.

Thanks,
David

On Wed, Jun 1, 2022 at 10:40 AM Ismael Juma  wrote:
>
> Hi David,
>
> If you are adding topic IDs, should we remove topic names?
>
> Ismael
>
> On Tue, May 31, 2022, 6:36 AM David Jacot 
> wrote:
>
> > Hi all,
> >
> > Thanks for your feedback.
> >
> > I just updated the KIP as follows:
> > * I propose to add a `TopicId` field at the same time while we are bumping
> > the AlterPartition API version.
> > * I propose to add the `InShuttingDown` field to the registration records
> > to
> > track if a broker is in controlled shutdown. This is useful for debugging
> > and
> > it would also prevent unnecessary AlterPartition requests from getting sent
> > to the controller by the leaders. This will also help for KAFKA-13944.
> >
> > Artem - That's a good question. As explained by Colin we don't have this
> > notion of fenced replicas in the ZK world.
> >
> > José - That's right. I have updated that section as you suggested.
> >
> > Best,
> > David
> >
> > On Tue, May 24, 2022 at 5:39 PM José Armando García Sancio
> >  wrote:
> > >
> > > Hi David,
> > >
> > > Thanks for the KIP. In the "Compatibility, Deprecation, and Migration
> > > Plan", you have:
> > >
> > > > The change is backward compatible.
> > >
> > > I assume that we don't need to increase the metadata.version/IBP for
> > > AlterPartition because AlterPartitionManager uses ApiVersions for that
> > > channel. Should we mention that in that section?
> > >
> > > --
> > > -José
> >


Re: [VOTE] KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft

2022-06-01 Thread David Jacot
Hi Colin,

Thanks for your feedback! Please find my answers below.

> However, I wonder if this will be feasible for ZK-based brokers to do. We 
> still support pre-topic-id IBP versions there, right? This might end up being 
> more complex than you were hoping.

That's right. We support pre-topic-id IBP versions for ZK-based
brokers. We will only use version 2 if all the topics have an ID when
the request is constructed. Version 1 is used otherwise. I have
already implemented this part in the draft PR if you want to see how
it looks. It brings a little more complexity in the request
building/handling but that works. I think that it is definitely worth
it.

> We should add a comment in AlterPartitionResponse about the new error code 
> INELIGIBLE_REPLICA. This is very important for error codes so we can track 
> which ones are returned in which RPC version.

Totally. I have those comments in the PR but I forgot to add them in
the KIP. I will fix this.

> I also wonder if we should add a special error code for the case where the 
> AlterPartitions call completes a reassignment AND the completed reassignment 
> no longer includes the previous leader. We have been overloading 
> FENCED_LEADER_EPOCH for this case, but this is coonfusing to operators 
> (especially since this RPC does not support error messages, as opposed to 
> codes)

Interesting. I was not aware of this one. We don't do this in the ZK
controller, right? I do agree that using FENCED_LEADER_EPOCH is
confusing here. We could introduce a NEW_LEADER_ELECTED error code for
this purpose.

> One thing, though, is that we should define how this interacts with replica 
> placement. It seems to me that replicas should not be able to be placed on 
> these inControlledShutdown nodes (unless done manually via the explicit 
> placement API).

Yeah, I agree that we need to clarify this. If not mistaken, replicas
can be placed on fenced nodes so I don't see why it should be
different for inControlledShutdown nodes. Otherwise we will again have
that create topic issue when the cluster has only 3 nodes. However I
think that we have to guarantee two other invariants:
1) an inControlledShutdown node should not be added to the ISR when a
partition is created.
2) an inControlledShutdown node should not be picked as a leader for a
partition. (e.g. KAFKA-13944)

Let me add this part to the KIP.

> I also think we should spell out the fact that once you go into controlled 
> shutdown, you don't come out except by creating a new broker instance. (new 
> incarnation ID). This also makes me wonder if we need to support the shutting 
> down -> not shutting down transition in BrokerRegistrationChangeRecord, since 
> we don't plan on using that transition.

That makes sense. I will add this and remove that transition from
BrokerRegistrationChangeRecord.

> Finally, RegisterBrokerRecord / BrokerRegistrationChangeRecord should be 
> bumped to the next RPC version since we have added a new field. You will also 
> need to assign yourself a new IBP / MetadataVersion. (For 
> BrokerRegistrationChangeRecord it would be possible to avoid the version 
> bump, since we're using tagged fields, but it's better to have it for 
> consistency, I think.)

Noted.

Let me update the KIP to incorporate all your feedback.

Cheers,
David

On Tue, May 31, 2022 at 9:42 PM Colin McCabe  wrote:
>
> > We should add a comment in AlterPartitionResponse about the new error
> > code INELIGIBLE_REPLICA. This is very important for error codes so we
> > can track which ones are returned in which RPC version. I also wonder
>
> Here I'm referring to AlterPartitionResponse.json
>
> cheers,
> Colin
>
> >
> >
> > On Tue, May 31, 2022, at 08:36, David Jacot wrote:
> >> Hi folks,
> >>
> >> I'd like to start a vote for KIP-841:
> >> https://cwiki.apache.org/confluence/x/phmhD.
> >>
> >> Thanks,
> >> David


[VOTE] KIP-840: Config file option for MessageReader/MessageFormatter in ConsoleProducer/ConsoleConsumer

2022-06-01 Thread Alexandre Garnier
Hi everyone!

I propose to start voting for KIP-840:
https://cwiki.apache.org/confluence/x/bBqhD

Thanks,
-- 
Alex


Re: [DISCUSS] KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft

2022-06-01 Thread Ismael Juma
Hi David,

If you are adding topic IDs, should we remove topic names?

Ismael

On Tue, May 31, 2022, 6:36 AM David Jacot 
wrote:

> Hi all,
>
> Thanks for your feedback.
>
> I just updated the KIP as follows:
> * I propose to add a `TopicId` field at the same time while we are bumping
> the AlterPartition API version.
> * I propose to add the `InShuttingDown` field to the registration records
> to
> track if a broker is in controlled shutdown. This is useful for debugging
> and
> it would also prevent unnecessary AlterPartition requests from getting sent
> to the controller by the leaders. This will also help for KAFKA-13944.
>
> Artem - That's a good question. As explained by Colin we don't have this
> notion of fenced replicas in the ZK world.
>
> José - That's right. I have updated that section as you suggested.
>
> Best,
> David
>
> On Tue, May 24, 2022 at 5:39 PM José Armando García Sancio
>  wrote:
> >
> > Hi David,
> >
> > Thanks for the KIP. In the "Compatibility, Deprecation, and Migration
> > Plan", you have:
> >
> > > The change is backward compatible.
> >
> > I assume that we don't need to increase the metadata.version/IBP for
> > AlterPartition because AlterPartitionManager uses ApiVersions for that
> > channel. Should we mention that in that section?
> >
> > --
> > -José
>


Re: [DISCUSS] KIP-846: Task-level Streams metrics for bytes/records Produced

2022-06-01 Thread Sophie Blee-Goldman
>
> Just a quick question: for filling the gap of sub-topology visibilities,
> would task-level produced metrics be sufficient?


If I understand your question correctly, you're asking whether we could
just report at the task/subtopology
level since we mainly want the bytes/throughput produced by the subtopology
itself?

Note that since we're only reporting this metric at the sink nodes, it's
basically the same as a task-level metric
for any subtopology that has only one sink and only scales with the number
of output topics. If you're concerned
about a potential performance impact I would say that (a) making these
task-level doesn't buy us much, if anything,
and (b) we can always revisit this if our benchmarks do indeed reveal a
regression but for now let's not over-
optimize too much :)

Also, the general philosophy  behind the KAfka Streams metrics thus far has
been to report at the finest granularity
and allow users to roll them up into whatever scope they want to aggregate
over. This KIP adopts the same approach.

On Tue, May 31, 2022 at 12:02 PM Guozhang Wang  wrote:

> Hi Sophie,
>
> Just a quick question: for filling the gap of sub-topology visibilities,
> would task-level produced metrics be sufficient?
>
> On Mon, May 30, 2022 at 10:59 AM Bill Bejeck  wrote:
>
> > Thanks for the KIP Sophie.
> >
> > I'm in favor of this change as well. I don't have any comments in
> > addition to the ones already expressed.
> >
> > -Bill
> >
> > On Mon, May 30, 2022 at 4:55 AM Sagar  wrote:
> >
> > > Hi Sophie,
> > >
> > > A very minor comment but you might want to remove this KIP template
> > related
> > > information from the top of the KIP:
> > >
> > > *This page is meant as a template for writing a KIP
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> > > >.
> > > To create a KIP choose Tools->Copy on this page and modify with your
> > > content and replace the heading with the next KIP number and a
> > description
> > > of your issue. Replace anything in italics with your own description.*
> > >
> > >
> > > Thanks!
> > > Sagar.
> > >
> > > On Mon, May 30, 2022 at 1:04 PM Sophie Blee-Goldman
> > >  wrote:
> > >
> > > > >
> > > > > Why does the title of the KIP talk about task-level metrics, but
> the
> > > > > specified metrics are on processor-level?
> > > >
> > > >
> > > > Ah, my mistake -- it should indeed say "processor-level metrics".
> > Thanks
> > > > for the catch Bruno, the title has been fixed.
> > > >
> > > > Since there don't seem to be any concerns I'll proceed with kicking
> off
> > > the
> > > > vote. Thanks all!
> > > >
> > > > On Mon, May 30, 2022 at 12:01 AM Bruno Cadonna 
> > > wrote:
> > > >
> > > > > Thanks for the KIP, Sophie!
> > > > >
> > > > > I am also in favor of this KIP!
> > > > >
> > > > > I have one minor question:
> > > > >
> > > > > Why does the title of the KIP talk about task-level metrics, but
> the
> > > > > specified metrics are on processor-level?
> > > > >
> > > > > For the rest, I am +1.
> > > > >
> > > > > Best,
> > > > > Bruno
> > > > >
> > > > > On 29.05.22 00:20, John Roesler wrote:
> > > > > > Thanks for the well motivated and documented KIP, Sophie! I’m in
> > > favor
> > > > > of this change.
> > > > > >
> > > > > > -John
> > > > > >
> > > > > > On Sat, May 28, 2022, at 06:42, Sophie Blee-Goldman wrote:
> > > > > >> Hey all,
> > > > > >>
> > > > > >> I'd like to propose a very small KIP to add two metrics that
> will
> > > help
> > > > > fill
> > > > > >> a gap in the derivable produced and consumed metrics. Please
> take
> > a
> > > > look
> > > > > >> and reply here with any questions or concerns.
> > > > > >>
> > > > > >> KIP-846: Task-level Streams metrics for bytes/records Produced
> > > > > >> <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211886093
> > > > > >
> > > > > >>
> > > > > >> Given the small nature of this I'm going to call for a vote
> soon,
> > > but
> > > > > >> please don't hesitate to raise anything you feel should be
> > discussed
> > > > in
> > > > > >> more detail first.
> > > > > >>
> > > > > >> Thanks!
> > > > > >> Sophie
> > > > >
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>


Re: [VOTE] KIP-827: Expose logdirs total and usable space via Kafka API

2022-06-01 Thread Mickael Maison
Hi,

Thanks for the votes and feedback!

The vote passes with:
- 4 +1 (binding) votes from Luke, Tom, Jun and John
- 3 +1 (non-binding) votes from Igor, Divij and Federico

Thanks,
Mickael

On Wed, Jun 1, 2022 at 3:31 AM John Roesler  wrote:
>
> Thanks for the KIP Mickael,
>
> I'm +1 (binding)
>
> -John
>
> On Tue, May 31, 2022, at 18:48, Jun Rao wrote:
> > Hi, Mickael,
> >
> > Thanks for the KIP. +1
> >
> > Jun
> >
> > On Wed, May 25, 2022 at 7:54 AM Tom Bentley  wrote:
> >
> >> Hi Mickael,
> >>
> >> Thanks for the KIP! +1 (binding).
> >>
> >> Kind regards,
> >>
> >> Tom
> >>
> >> On Thu, 19 May 2022 at 11:28, Federico Valeri 
> >> wrote:
> >>
> >> > Thanks Mickael.
> >> >
> >> > +1 (non binding)
> >> >
> >> > On Wed, May 18, 2022 at 11:08 AM Divij Vaidya 
> >> > wrote:
> >> > >
> >> > > +1 non binding.
> >> > >
> >> > > Divij Vaidya
> >> > >
> >> > >
> >> > >
> >> > > On Tue, May 17, 2022 at 6:16 PM Igor Soarez  wrote:
> >> > >
> >> > > > Thanks for this KIP Mickael.
> >> > > >
> >> > > > +1 non binding
> >> > > >
> >> > > > --
> >> > > > Igor
> >> > > >
> >> > > > On Tue, May 17, 2022, at 2:48 PM, Luke Chen wrote:
> >> > > > > Hi Mickael,
> >> > > > >
> >> > > > > +1 (binding) from me.
> >> > > > > Thanks for the KIP!
> >> > > > >
> >> > > > > Luke
> >> > > > >
> >> > > > > On Tue, May 17, 2022 at 9:30 PM Mickael Maison <
> >> > mickael.mai...@gmail.com
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > >> Hi,
> >> > > > >>
> >> > > > >> I'd like to start a vote on KIP-827. It proposes exposing the
> >> total
> >> > > > >> and usable space of logdirs
> >> > > > >> via the DescribeLogDirs API:
> >> > > > >>
> >> > > > >>
> >> > > >
> >> >
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-827%3A+Expose+logdirs+total+and+usable+space+via+Kafka+API
> >> > > > >>
> >> > > > >> Thanks,
> >> > > > >> Mickael
> >> > > > >>
> >> > > >
> >> >
> >> >
> >>


[jira] [Created] (KAFKA-13952) Infinite retry timeout is not working

2022-06-01 Thread Jakub Malek (Jira)
Jakub Malek created KAFKA-13952:
---

 Summary: Infinite retry timeout is not working
 Key: KAFKA-13952
 URL: https://issues.apache.org/jira/browse/KAFKA-13952
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Reporter: Jakub Malek


The 
[documentation|https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/ConnectorConfig.java#L129]
 for {{errors.retry.timeout}} property says:
{noformat}
The maximum duration in milliseconds that a failed operation will be 
reattempted. The default is 0, which means no retries will be attempted. Use -1 
for infinite retries.{noformat}

But it seems that value {{-1}} is not respected by the 
[RetryWithToleranceOperator|https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/errors/RetryWithToleranceOperator.java]
 that simply compares elapsed time until {{startTime + errorRetryTimeout}} is 
exceeded.

I was also not able to find any conversion of the raw config value before 
{{RetryWithToleranceOperator}} is initialized.
I run a simple test with a connector using mocked transformation plugin that 
throws the {{RetriableException}} and it seems to prove my claim.

I'm not sure if it's documentation or implementation error or maybe I've missed 
something.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)