PANIC: Unable to recover the cluster after all the controllers in KRaft mode were dead at the same time

2023-12-01 Thread Jesus Cea

Kafka 3.6.0.

I have a KRaft cluster with three quorum servers. A power failure killed 
all the controllers at the same time. After rebooting, the controllers 
can not connect to each other. So, the cluster is down.


Log:

"""
[...]
[2023-12-01 20:29:24,931] INFO [MetadataLoader id=1000] 
initializeNewPublishers: the loader is still catching up because we 
still don't know the high water mark yet. 
(org.apache.kafka.image.loader.MetadataLoader)
[2023-12-01 20:29:24,957] ERROR [RaftManager id=1000] Unexpected error 
UNKNOWN_SERVER_ERROR in VOTE response: 
InboundResponse(correlationId=6698, data=VoteResponseData(errorCode=-1, 
topics=[]), sourceId=1002) (org.apache.kafka.raft.KafkaRaftClient)
[2023-12-01 20:29:24,957] INFO [RaftManager id=1000] Vote request 
VoteRequestData(clusterId='*EDITED*', 
topics=[TopicData(topicName='__cluster_metadata', 
partitions=[PartitionData(partitionIndex=0, candidateEpoch=13359, 
candidateId=1001, lastOffsetEpoch=13079, lastOffset=6304466)])]) with 
epoch 13359 is rejected (org.apache.kafka.raft.KafkaRaftClient)
[2023-12-01 20:29:24,983] ERROR [RaftManager id=1000] Unexpected error 
UNKNOWN_SERVER_ERROR in VOTE response: 
InboundResponse(correlationId=6699, data=VoteResponseData(errorCode=-1, 
topics=[]), sourceId=1001) (org.apache.kafka.raft.KafkaRaftClient)
[2023-12-01 20:29:25,020] ERROR [RaftManager id=1000] Unexpected error 
UNKNOWN_SERVER_ERROR in VOTE response: 
InboundResponse(correlationId=6700, data=VoteResponseData(errorCode=-1, 
topics=[]), sourceId=1002) (org.apache.kafka.raft.KafkaRaftClient)
[2023-12-01 20:29:25,032] INFO [MetadataLoader id=1000] 
initializeNewPublishers: the loader is still catching up because we 
still don't know the high water mark yet. 
(org.apache.kafka.image.loader.MetadataLoader)
[2023-12-01 20:29:25,044] ERROR [RaftManager id=1000] Unexpected error 
UNKNOWN_SERVER_ERROR in VOTE response: 
InboundResponse(correlationId=6701, data=VoteResponseData(errorCode=-1, 
topics=[]), sourceId=1001) (org.apache.kafka.raft.KafkaRaftClient)
[2023-12-01 20:29:25,082] ERROR [RaftManager id=1000] Unexpected error 
UNKNOWN_SERVER_ERROR in VOTE response: 
InboundResponse(correlationId=6702, data=VoteResponseData(errorCode=-1, 
topics=[]), sourceId=1002) (org.apache.kafka.raft.KafkaRaftClient)
[2023-12-01 20:29:25,105] ERROR [RaftManager id=1000] Unexpected error 
UNKNOWN_SERVER_ERROR in VOTE response: 
InboundResponse(correlationId=6703, data=VoteResponseData(errorCode=-1, 
topics=[]), sourceId=1001) (org.apache.kafka.raft.KafkaRaftClient)
[2023-12-01 20:29:25,133] INFO [MetadataLoader id=1000] 
initializeNewPublishers: the loader is still catching up because we 
still don't know the high water mark yet. 
(org.apache.kafka.image.loader.MetadataLoader)

[...]
"""

I use SASL_SSL. The controller credentials are "wired" in the 
configuration, so no "metadata recovery watermark" knowledge should be 
necessary:


"""
listener.name.controller.sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256
listener.name.controller.plain.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule 
required \

username="controller" \
password="*EDITED" \
user_controller="*EDITED*";

listener.name.controller.scram-sha-256.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule 
required username="*EDITED" password="*EDITED*";

"""

--
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - https://www.jcea.es/_/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz


Re: PANIC: Unable to recover the cluster after all the controllers in KRaft mode were dead at the same time

2023-12-01 Thread Jesus Cea

On 1/12/23 20:42, Jesus Cea wrote:
I use SASL_SSL. The controller credentials are "wired" in the 
configuration, so no "metadata recovery watermark" knowledge should be 
necessary:


"""
listener.name.controller.sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256
listener.name.controller.plain.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
 required \
     username="controller" \
     password="*EDITED" \
     user_controller="*EDITED*";

listener.name.controller.scram-sha-256.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule
 required username="*EDITED" password="*EDITED*";
"""


Since I am using SASL_SSL PLAINTEXT for inter-controller authentication, 
because https://issues.apache.org/jira/browse/KAFKA-15513 , I just added 
the controller's user to "super.users" in the three quorum servers and 
the cluster worked again. Then I did a rolling restart of each 
controller to retire that "super" permission while not breaking the quorum.


Thinks look good so far.

Any suggestion, beside having the controllers distributed geographically?

Thanks.

Have a nice weekend.

--
Jesús Cea Avión _/_/  _/_/_/_/_/_/
j...@jcea.es - https://www.jcea.es/_/_/_/_/  _/_/_/_/  _/_/
Twitter: @jcea_/_/_/_/  _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz


Re: Kafka 2.7.2 to 3.5.1 upgrade

2023-12-01 Thread Haruki Okada
Hi.

I'm not sure if KafkaManager has such bug though, you should check if
there's any under replicated partitions actually by `kafka-topics.sh`
command with `--under-replicated-partitions` option first.

2023年11月30日(木) 23:41 Lud Antonie :

> Hello,
>
> After upgrading from 2.7.2 to 3.5.1 some topics are missing a partition for
> one or two brokers.
> The kafka manager shows "Under replicated%" for the topic.
> Looking at the topic for some brokers (of 3) partitions are missing (in my
> case 1 partition).
> A rollback will restore the "Under replicated%" to 0 again (this is the
> wanted number).
>
> Is this a bug of kafka or the kafka manager?
>
> Best regards,
> Lud Antonie
>
>
> --
> Met vriendelijke groet / Kind regards,
>
> *Lud Antonie*
>
> 
> Kennedyplein 101, 5611 ZS, Eindhoven
> +31(0)402492700 <0031402492700>
> www.coosto.com
>  
> 
> 
>


-- 

Okada Haruki
ocadar...@gmail.com



Re: Relation between fetch.max.bytes, max.partition.fetch.bytes & max.poll.records

2023-12-01 Thread Haruki Okada
Hi.

`max.poll.records` does nothing with fetch requests (refs:
https://kafka.apache.org/35/documentation.html#consumerconfigs_max.poll.records
)

Then, how many records will be returned for single fetch request depends on
the partition-leader assignment. (note: we assume follower-fetch is not
used here)
If all partition leaders are in the same broker, 40MB (2MB * 20 partition)
will be returned for a single fetch request.

2023年11月30日(木) 17:10 Debraj Manna :

> The doc states that fetch.max.bytes & max.partition.fetch.bytes
>
> are not absolute maximum.  If the first record batch in the first non-empty
> > partition of the fetch is larger than this limit, the batch will still be
> > returned to ensure that the consumer can make progress.
>
>
> I am getting a bit confused.
>
> Let's say I have a configuration like below with sufficient messages in
> each partition
>
>
>- Partitions in a topic 20
>- Single message size 2MB
>- Consumers 5
>- max.poll.records 20
>- fetch.max.bytes 50 MB
>- max.partition.fetch.bytes 1 MB.
>
> The broker config message.max.bytes and max.message.bytes is set to default
> 100MB
>
> If the consumer does a poll will it receive 20 records? If yes then there
> is no significance of fetch.max.bytes & max.partition.fetch.bytes with
> max.poll.records?
>
>
>- Java Kafka Client - 3.5.1
>- Kafka Broker - 2.8.1
>


-- 

Okada Haruki
ocadar...@gmail.com



Re: Kafka 2.7.2 to 3.5.1 upgrade

2023-12-01 Thread megh vidani
Hi Lud,

The topics for which you're seeing under replicated partitions, Did you try
to increase the number of partitions anytime after creation of those topics
before the upgrade?

We have earlier faced issues with 2.8.0, in which we had increased the
number of partitions for some topics, and for those topics we used to see
under replicated partitions after every restart.

The reason this happened was, there was a bug in Kafka which assigned a new
topicId (different from the original topicId) to newly added partitions in
the partition.metadata file, and upon restart of kafka brokers, this
topicId didn't reconcile between brokers and ZK.

Thanks,
Megh

On Thu, Nov 30, 2023, 20:10 Lud Antonie 
wrote:

> Hello,
>
> After upgrading from 2.7.2 to 3.5.1 some topics are missing a partition for
> one or two brokers.
> The kafka manager shows "Under replicated%" for the topic.
> Looking at the topic for some brokers (of 3) partitions are missing (in my
> case 1 partition).
> A rollback will restore the "Under replicated%" to 0 again (this is the
> wanted number).
>
> Is this a bug of kafka or the kafka manager?
>
> Best regards,
> Lud Antonie
>
>
> --
> Met vriendelijke groet / Kind regards,
>
> *Lud Antonie*
>
> 
> Kennedyplein 101, 5611 ZS, Eindhoven
> +31(0)402492700 <0031402492700>
> www.coosto.com
>  
> 
> 
>