[ 
https://issues.apache.org/jira/browse/KAFKA-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Babrou updated KAFKA-6414:
-------------------------------
    Description: 
Let's suppose the following starting point:

* 1 topic
* 1 partition
* 1 reader
* 24h retention period
* leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
reader + 1x slack = total outbound)

In this scenario, when replica fails and needs to be brought back from scratch, 
you can catch up at 2x inbound bandwidth (1x regular replication + 1x slack 
used).

2x catch-up speed means replica will be at the point where leader is now in 24h 
/ 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
retention cliff and will be deleted. There's absolutely no use for this data, 
it will never be read from the replica in any scenario. And this not even 
including the fact that we still need to replicate 12h more of data that 
accumulated since the time we started.

My suggestion is to refill sufficiently out of sync replicas backwards from the 
tip: newest segments first, oldest segments last. Then we can stop when we hit 
retention cliff and replicate far less data. The lower the ratio of catch-up 
bandwidth to inbound bandwidth, the higher the returns would be. This will also 
set a hard cap on retention time: it will be no higher than retention period if 
catch-up speed if >1x (if it's less, you're forever out of ISR anyway).

What exactly "sufficiently out of sync" means in terms of lag is a topic for a 
debate. The default segment size is 1GiB, I'd say that being >1 full segments 
behind probably warrants this.

As of now, the solution for slow recovery appears to be to reduce retention to 
speed up recovery, which doesn't seem very friendly.

  was:
Let's suppose the following starting point:

* 1 topic
* 1 partition
* 1 reader
* 24h retention period
* leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
reader + 1x slack = total outbound)

In this scenario, when replica fails and needs to be brought back from scratch, 
you can catch up at 2x inbound bandwidth (1x regular replication + 1x slack 
used).

2x catch-up speed means replica will be at the point where leader is now in 24h 
/ 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
retention cliff and will be deleted. There's absolutely to use for this data, 
it will never be read from the replica in any scenario. And this not even 
including the fact that we still need to replicate 12h more of data that 
accumulated since the time we started.

My suggestion is to refill sufficiently out of sync replicas backwards from the 
tip: newest segments first, oldest segments last. Then we can stop when we hit 
retention cliff and replicate far less data. The lower the ratio of catch-up 
bandwidth to inbound bandwidth, the higher the returns would be. This will also 
set a hard cap on retention time: it will be no higher than retention period if 
catch-up speed if >1x (if it's less, you're forever out of ISR anyway).

What exactly "sufficiently out of sync" means in terms of lag is a topic for a 
debate. The default segment size is 1GiB, I'd say that being >1 full segments 
behind probably warrants this.

As of now, the solution for slow recovery appears to be to reduce retention to 
speed up recovery, which doesn't seem very friendly.


> Inverse replication for replicas that are far behind
> ----------------------------------------------------
>
>                 Key: KAFKA-6414
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6414
>             Project: Kafka
>          Issue Type: Bug
>          Components: replication
>            Reporter: Ivan Babrou
>
> Let's suppose the following starting point:
> * 1 topic
> * 1 partition
> * 1 reader
> * 24h retention period
> * leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
> reader + 1x slack = total outbound)
> In this scenario, when replica fails and needs to be brought back from 
> scratch, you can catch up at 2x inbound bandwidth (1x regular replication + 
> 1x slack used).
> 2x catch-up speed means replica will be at the point where leader is now in 
> 24h / 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
> retention cliff and will be deleted. There's absolutely no use for this data, 
> it will never be read from the replica in any scenario. And this not even 
> including the fact that we still need to replicate 12h more of data that 
> accumulated since the time we started.
> My suggestion is to refill sufficiently out of sync replicas backwards from 
> the tip: newest segments first, oldest segments last. Then we can stop when 
> we hit retention cliff and replicate far less data. The lower the ratio of 
> catch-up bandwidth to inbound bandwidth, the higher the returns would be. 
> This will also set a hard cap on retention time: it will be no higher than 
> retention period if catch-up speed if >1x (if it's less, you're forever out 
> of ISR anyway).
> What exactly "sufficiently out of sync" means in terms of lag is a topic for 
> a debate. The default segment size is 1GiB, I'd say that being >1 full 
> segments behind probably warrants this.
> As of now, the solution for slow recovery appears to be to reduce retention 
> to speed up recovery, which doesn't seem very friendly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to