[jira] [Resolved] (KAFKA-15106) AbstractStickyAssignor may stuck in 3.5

2023-08-03 Thread li xiangyuan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

li xiangyuan resolved KAFKA-15106.
--
Resolution: Fixed

> AbstractStickyAssignor may stuck in 3.5
> ---
>
> Key: KAFKA-15106
> URL: https://issues.apache.org/jira/browse/KAFKA-15106
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.5.0
>Reporter: li xiangyuan
>Assignee: li xiangyuan
>Priority: Major
> Fix For: 3.6.0
>
>
> this could reproduce in ut easy,
> just int 
> org.apache.kafka.clients.consumer.internals.AbstractStickyAssignorTest#testLargeAssignmentAndGroupWithNonEqualSubscription,
> plz set 
> partitionCount=200, 
> consumerCount=20,  you can see 
> isBalanced will return false forever.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15170) CooperativeStickyAssignor cannot adjust assignment correctly

2023-07-10 Thread li xiangyuan (Jira)
li xiangyuan created KAFKA-15170:


 Summary: CooperativeStickyAssignor cannot adjust assignment 
correctly
 Key: KAFKA-15170
 URL: https://issues.apache.org/jira/browse/KAFKA-15170
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Affects Versions: 3.5.0
Reporter: li xiangyuan
Assignee: li xiangyuan


AbstractStickyAssignor use ConstrainedAssignmentBuilder to build assignment 
when all consumers in group subscribe the same topic list, but it couldn't add 
all partitions move owner to another consumer to 
``partitionsWithMultiplePreviousOwners``.

 

the reason is in function assignOwnedPartitions hasn't add partitions that 
rack-mismatch with prev owner to allRevokedPartitions, then partition only in 
this list would add to partitionsWithMultiplePreviousOwners.

 

In Cooperative Rebalance, partitions have changed owner must be removed from 
final assignment or will lead to incorrect consume behavior, I have already 
raise a pr, please take a look, thx

 

https://github.com/apache/kafka/pull/13965



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15106) AbstractStickyAssignor may stuck in 3.5

2023-06-19 Thread li xiangyuan (Jira)
li xiangyuan created KAFKA-15106:


 Summary: AbstractStickyAssignor may stuck in 3.5
 Key: KAFKA-15106
 URL: https://issues.apache.org/jira/browse/KAFKA-15106
 Project: Kafka
  Issue Type: Bug
  Components: clients
Affects Versions: 3.5.0
Reporter: li xiangyuan


this caould reproduce in ut easily,

just int 
org.apache.kafka.clients.consumer.internals.AbstractStickyAssignorTest#testLargeAssignmentAndGroupWithNonEqualSubscription,

plz set 

partitionCount=200, 

consumerCount=20,  you can see 

isBalanced will return false forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14914) binarySearch in AbstactIndex may execute with infinite loop

2023-04-17 Thread li xiangyuan (Jira)
li xiangyuan created KAFKA-14914:


 Summary: binarySearch in AbstactIndex may execute with infinite 
loop
 Key: KAFKA-14914
 URL: https://issues.apache.org/jira/browse/KAFKA-14914
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.4.0
Reporter: li xiangyuan
 Attachments: stack.1.txt, stack.2.txt, stack.3.txt

Recently our servers in production environment may suddenly stop handle request 
frequently(for now 3 times in less than 10 days),   please check the stack file 
uploaded, it show that 1 ioThread(data-plane-kafka-request-handler-11) hold  
the ReadLock of Partition's leaderIsrUpdateLock and keep run the binarySearch 
function, once any thread(kafka-scheduler-2) need WriteMode Of this lock, then 
all requests read this partition need ReadMode Lock will use out all ioThreads 
and then this broker couldn't handle any request.

the 3 stack files are fetched with interval  about 6 minute, with my standpoint 
i just could think obviously the  binarySearch function cause dead lock and I 
presuppose maybe the index block values in offsetIndex (at least in mmap) are 
not sorted.

 

detail information:

this problem appear in 2 brokers

broker version: 2.4.0

jvm: openjdk 11

hardware: aws c7g 4xlarge, this is a arm64 server, we recently upgrade our 
servers from c6g 4xlarge to this type, when we use c6g haven't meet this 
problem, we don't know whether arm or aws c7g server have any problem.

other: once we restart broker, it will recover, so we doubt offset index file 
may not corrupted and maybe something wrong with mmap.

plz give any suggestion solve this problem, thx.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-9648) kafka server should resize backlog when create serversocket

2020-03-04 Thread li xiangyuan (Jira)
li xiangyuan created KAFKA-9648:
---

 Summary: kafka server should resize backlog when create 
serversocket
 Key: KAFKA-9648
 URL: https://issues.apache.org/jira/browse/KAFKA-9648
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 0.10.0.1
Reporter: li xiangyuan


I have describe a mystery problem 
(https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka 
server will trigger tcp Congestion Control in some condition. finally we found 
the root cause.

when kafka server restart for any reason and then execute preferred replica 
leader, lots of replica leader will give back to it & trigger cluster metadata 
update. then all clients will establish connection to this server. at the 
monment many tcp estable request are waiting in the tcp sync queue , and then 
to accept queue. 

kafka create serversocket in SocketServer.scala 

 
{code:java}
serverChannel.socket.bind(socketAddress);{code}
this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) 
will decide the queue length.beacues kafka haven't set ,it is default value 50.

if this queue is full, and tcp_syncookies = 0, then new connection request will 
be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie mechanism. 
this mechanism could allow linux handle more tcp sync request, but it would 
lose many tcp external parameter, include "wscale", the one that allow tcp 
connection to send much more bytes per tcp package. because syncookie triggerd, 
wscale has lost, and this tcp connection will handle network very slow, 
forever,until this connection is closed and establish another tcp connection.

so after a preferred repilca executed, lots of new tcp connection will 
establish without set wscale,and many network traffic to this server will have 
a very slow speed.

i'm not sure whether new linux version have resolved this problem, but kafka 
also should set backlog a larger value. we now have modify this to 512, seems 
everything is ok.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9646) kafka consumer cause high cpu usage

2020-03-04 Thread li xiangyuan (Jira)
li xiangyuan created KAFKA-9646:
---

 Summary: kafka consumer cause high cpu usage
 Key: KAFKA-9646
 URL: https://issues.apache.org/jira/browse/KAFKA-9646
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Affects Versions: 2.3.0
 Environment: centos-7 3.10.0-957.21.3.el7.x86_64

Reporter: li xiangyuan
 Attachments: 0.10.0.1.svg, 2.4.0.svg, cpu_use

Recently we upgrade kafka server from 0.10.0.1 to 2.3.0 successfully, and 
because kafka support fetch records from closest broker since 2.4.0, we decide 
to upgrade our client from 0.10.0.1 to 2.4.0 directly.

After upgrade, we found some applications use much more cpu than before. The 
worst one up from 45% to 70%, therefore we have to rollback this application.

we profile this application in test environment(each one execute 6 minutes), 
and get 2 kafka-clients version cpu flame graph. I have update these file.

we found after upgrade to 2.4.0, select.selectNow cause highest cpu usage. this 
application subscribe 20 topics and each one has 6 consumer threads, and 19 
topics has low produce speed (less than 1 message per mintute). we set 
fetch.max.wait.ms to 5000, cpu usage reduce little but still high

 

then I write a test application, it subscribe 1 topic with 120 consumer 
threads. when use 2.4.0 client, cpu usage about to 40%. when use 0.10.0.1 ,cpu 
usage less than 10%.

then I try to use 2.4.0 and modify org.apache.kafka.common.network.select , old 
code below:
{code:java}
if (timeoutMs == 0L)
   return this.nioSelector.selectNow();
   else
return this.nioSelector.select(timeoutMs);{code}
change to
{code:java}
if (timeoutMs == 0) {
timeoutMs = 1;
}
return this.nioSelector.select(timeoutMs);
{code}
after this change cpu usage about to 20%. i have upload cpu usage pic.

i'm wondering why select.selectnow cause high cpu usage, maybe 2.4.0 client has 
to many useless select? or linux has some performance issue when multithread 
use selectnow concurrently?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9211) kafka upgrade 2.3.0 cause produce speed decrease

2019-11-19 Thread li xiangyuan (Jira)
li xiangyuan created KAFKA-9211:
---

 Summary: kafka upgrade 2.3.0 cause produce speed decrease
 Key: KAFKA-9211
 URL: https://issues.apache.org/jira/browse/KAFKA-9211
 Project: Kafka
  Issue Type: Bug
  Components: controller, producer 
Affects Versions: 2.3.0
Reporter: li xiangyuan
 Attachments: broker-jstack.txt, producer-jstack.txt

Recently we try upgrade kafka from 0.10.0.1 to 2.3.0.

we have 15 clusters in production env, each one has 3~6 brokers.

we know kafka upgrade should:
      1.replcae code to 2.3.0.jar and restart  all brokers one by one
      2.unset inter.broker.protocol.version=0.10.0.1 and restart all brokers 
one by one
      3.unset log.message.format.version=0.10.0.1 and restart all brokers one 
by one
 
for now we have already done step 1 & 2 in 12 clusters.but when we try to 
upgrade left clusters (already done step 1) in step 2, we found some topics 
drop produce speed badly.
     we have research this issue for long time, since we couldn't test it in 
production environment  and we couldn't reproduce in test environment, we 
couldn't find the root cause.
now we only could describe the situation in detail as  i know, hope anyone 
could help us.
 
1.because bug KAFKA-8653, i add code below in KafkaApis.scala 
handleJoinGroupRequest function:
{code:java}
if (rebalanceTimeoutMs <= 0) {
 rebalanceTimeoutMs = joinGroupRequest.data.sessionTimeoutMs
}{code}

2.one cluster upgrade failed has 6 8C16G brokers, about 200 topics with 2 
replicas,every broker keep 3000+ partitions and 1500+ leader partition, but 
most of them has very low produce message speed,about less than 50messages/sec, 
only one topic with 300 partitions has more than 2500 message/sec with more 
than 20 consumer groups consume message from it.

so this whole cluster  produce 4K messages/sec , 11m Bytes in /sec,240m Bytes 
out /sec.and more than 90% traffic made by that topic has 2500messages/sec.

when we unset 5 or 6 servers' inter.broker.protocol.version=0.10.0.1  and 
restart, this topic produce message drop to about 200messages/sec,  i don't 
know whether the way we use could tirgger any problem.

3.we use kafka wrapped by spring-kafka and set kafkatemplate's autoFlush=true, 
so each producer.send execution will execute producer.flush immediately too.i 
know flush method will decrease produce performance dramaticlly, but  at least 
it seems nothing wrong before upgrade step 2. but i doubt whether it's a 
problem now after upgrade.

4.I noticed when produce speed decrease, some consumer group has large message 
lag still consume message without any consume speed change or decrease, so I 
guess only producerequest speed will drop down,but fetchrequest not. 

5.we haven't set any throttle configuration, and all producers' acks=1(so it's 
not broker replica fetch slow), and when this problem triggered, both sever & 
producers cpu usage down, and servers' ioutil keep less than 30% ,so it 
shuldn't be a hardware problem.

6.this event triggered often(almost 100%) most brokers has done upgrade step 
2,then after a auto leader replica election executed, then we can observe  
produce speed drop down,and we have to downgrade brokers(set 
inter.broker.protocol.version=0.10.0.1)and restart brokers one by one,then it 
could be normal. some cluster have to downgrade all brokers,but some cluster 
could left 1 or 2 brokers without downgrade, i notice that the broker not need 
downgrade is the controller.

7.I have print jstack for producer & servers. although I do this not the same 
cluster, but we can notice that their thread seems really in idle stat.

8.both 0.10.0.1 & 2.3.0 kafka-client will trigger this problem too.

8.unless the largest one topic will drop produce speed certainly, other topic 
will drop produce speed randomly. maybe topicA will drop speed in first upgrade 
attempt but next not, and topicB not drop speed in first attemp but dropped 
when do another attempt.

9.in fact, the largest cluster, has the same topic & group usage scenario 
mentioned above, but the largest topic has 1w2 messages/sec,will upgrade fail 
in step 1(just use 2.3.0.jar)


any help would be grateful, thx, i'm very sad now...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)