[jira] [Comment Edited] (KAFKA-10888) Sticky partition leads to uneven product msg, resulting in abnormal delays in some partations

2021-02-17 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286089#comment-17286089
 ] 

Jason Gustafson edited comment on KAFKA-10888 at 2/17/21, 7:04 PM:
---

[~junrao] That's a good observation. The partitioner knows the approximate size 
of each batch that gets sent, so it could take that into account before 
rotating partitions. We were thinking of approaches which involved the 
partitioner being aware of how many bytes were inflight, but propagating this 
information from the accumulator to the partitioner is kind of messy. 

Perhaps it is simpler for the partitioner to track at the level of total bytes 
sent to each partition. Maybe we could see this analogously to the tcp window 
size. For example, say you start with an expected size of 10 records (or 1k 
bytes or whatever). The partitioner keeps writing to the partition until this 
many records/bytes have been sent regardless of the batching. If the batch is 
not filled when the limit is reached, then we move onto the next partition, but 
we increase the batch size. On the other hand, if a batch gets sent before the 
limit is reached, we continue writing to the partition until the limit is 
reached and we decrease it when we move on to the next partition. In this way, 
we can keep a better balance between partitions.


was (Author: hachikuji):
[~junrao] That's a good observation. The partitioner knows the approximate size 
of each batch that gets sent, so it could take that into account before 
rotating partitions. We were thinking of approaches which involved the 
partitioner being aware of how many bytes were inflight, but propagating this 
information from the accumulator to the partitioner is kind of messy. 

Perhaps it is simpler for the partitioner to track at the level of total bytes 
sent to each partition. Maybe we could see this analogously to the tcp window 
size. For example, say you start with an expected size of 10 records (or 1k 
bytes or whatever). The partitioner keeps writing to the partition until this 
many records/bytes have been sent regardless of the batching. If the batch is 
not filled when the limit is reached, then we move onto the next partition, but 
we increase the batch size. On the other hand, if a batch gets sent before the 
limit is reached, then we decrease the size. In this way, we can keep a better 
balance between partitions.

>  Sticky partition leads to uneven product msg, resulting in abnormal delays 
> in some partations
> --
>
> Key: KAFKA-10888
> URL: https://issues.apache.org/jira/browse/KAFKA-10888
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 2.4.1
>Reporter: jr
>Priority: Major
> Attachments: image-2020-12-24-21-05-02-800.png, 
> image-2020-12-24-21-09-47-692.png, image-2020-12-24-21-10-24-407.png
>
>
>   110 producers ,550 partitions ,550 consumers , 5 nodes Kafka cluster
>   The producer uses the nullkey+stick partitioner, the total production rate 
> is about 100w tps
> Observed partition delay is abnormal and message distribution is uneven, 
> which leads to the maximum production and consumption delay of the partition 
> with more messages 
> abnormal.
>   I cannot find reason that stick will make the message distribution uneven 
> at this production rate.
>   I can't switch to the round-robin partitioner, which will increase the 
> delay and cpu cost. Is thathe stick partationer design cause uneven message 
> distribution, or this is abnormal. How to solve it?
>   !image-2020-12-24-21-09-47-692.png!
> As shown in the picture, the uneven distribution is concentrated on some 
> partitions and some brokers, there seems to be some rules.
> This problem does not only occur in one cluster, but in many high tps 
> clusters,
> The problem is more obvious on the test cluster we built.
> !image-2020-12-24-21-10-24-407.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-10888) Sticky partition leads to uneven product msg, resulting in abnormal delays in some partations

2021-02-16 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285383#comment-17285383
 ] 

Jason Gustafson edited comment on KAFKA-10888 at 2/16/21, 6:09 PM:
---

We have found one cause of imbalance when the sticky partitioner is used. 
Basically the intuition behind the sticky partitioner breaks down a little bit 
when a small `linger.ms` is in use (sadly this is the default). The user is 
opting out of batching with this setting which means there is often little 
opportunity to fill batches before they get drained and sent. That leaves the 
door open to sustained imbalance in some cases.

To see why, suppose that we have a producer writing to 3 partitions with 
linger.ms=0 and one partition slows down a little bit for some reason. It could 
be a leader change or some transient network issue. The producer will have to 
hold onto the batches for that partition until it becomes available. While it 
is holding onto those batches, additional batches will begin piling up. Each of 
these batches is likely to get filled because the producer is not ready to send 
to this partition yet.

Consider this from the perspective of the sticky partitioner. Every time the 
slow partition gets selected, the producer will fill the batches completely. On 
the other hand, the remaining "fast" partitions will likely not get their 
batches filled because of the `linger.ms=0` setting. As soon as a single record 
is available, it might get sent. So more data ends up getting written to the 
partition that has already started to build a backlog. And even after the cause 
of the original slowness (e.g. leader change) gets resolved, it might take some 
time for this imbalance to recover. We believe this can even create a runaway 
effect if the partition cannot catch up with the handicap of the additional 
load.

We analyzed one case where we thought this might be going on. Below I've 
summarized the writes over a period of one hour to 3 partitions. Partition 0 
here is the "slow" partition. All partitions get roughly the same number of 
batches, but the slow partition has much bigger batch sizes.

{code}
Partition TotalBatches TotalBytes TotalRecords BytesPerBatch RecordsPerBatch
0 1683 25953200   2522815420.80  14.99
1 1713 78368784622 4574.94   2.70
2 1711 75462124381 4410.41   2.56
{code}

After restarting the application, the producer was healthy again. It just was 
not able to recover with the imbalanced workload.


was (Author: hachikuji):
We have found one cause of imbalance when the sticky partitioner is used. 
Basically the intuition behind the sticky partitioner breaks down a little bit 
when a small `linger.ms` is in use (sadly this is the default). The user is 
sort of opting out of batching with this setting which means there is little 
opportunity to fill batches before they get drained and sent. That leaves the 
door open for a kind of imbalanced write problem.

To see why, suppose that we have a producer writing to 3 partitions with 
linger.ms=0 and one partition slows down a little bit for some reason. It could 
be a leader change or some transient network issue. The producer will have to 
hold onto the batches for that partition until it becomes available. While it 
is holding onto those batches, additional batches will begin piling up. Each of 
these batches is likely to get filled because the producer is not ready to send 
to this partition yet.

Consider this from the perspective of the sticky partitioner. Every time the 
slow partition gets selected, the producer will fill the batches completely. On 
the other hand, the remaining "fast" partitions will likely not get their 
batches filled because of the `linger.ms=0` setting. As soon as a single record 
is available, it might get sent. So more data ends up getting written to the 
partition that has already started to build a backlog. And even after the cause 
of the original slowness (e.g. leader change) gets resolved, it might take some 
time for this imbalance to recover. We believe this can even create a runaway 
effect if the partition cannot catch up with the handicap of the additional 
load.

We analyzed one case where we thought this might be going on. Below I've 
summarized the writes over a period of one hour to 3 partitions. Partition 0 
here is the "slow" partition. All partitions get roughly the same number of 
batches, but the slow partition has much bigger batch sizes.

{code}
Partition TotalBatches TotalBytes TotalRecords BytesPerBatch RecordsPerBatch
0 1683 25953200   2522815420.80  14.99
1 1713 78368784622 4574.94   2.70
2 1711 75462124381 4410.41   2.56
{code}

After restarting the application, the producer was