[jira] [Updated] (KAFKA-7886) Some partitions are fully truncated during recovery when log.message.format = 0.10.2 & inter.broker.protocol >= 0.11

2019-01-31 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated KAFKA-7886:
---
Attachment: (was: java.activation-graph.png)

> Some partitions are fully truncated during recovery when log.message.format = 
> 0.10.2 & inter.broker.protocol >= 0.11
> 
>
> Key: KAFKA-7886
> URL: https://issues.apache.org/jira/browse/KAFKA-7886
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 2.0.1, 2.1.0
> Environment: centos 7 
>Reporter: Hervé RIVIERE
>Priority: Major
> Attachments: broker.log
>
>
> On a cluster of Kafka 2.0.1, and brokers configured with
>  * inter.broker.protocol.format = 2.0
>  * log.message.format.version = 0.10.2
>  
> In such configuration, when a broker is restarted (clean shutdown), the 
> recovery process, for some partitions, is not taking in account the high 
> watermark and is truncating and re-downloading the full partition.
> Typically for brokers with 500 partitions each / 5 TB of disk usage the 
> recovery process with this configuration is during up to 1 hour whereas it 
> usually takes less than 10 min in the same broker when 
> (inter.broker.protocol.format = log.message.format.version)
>  Partitions redownloaded seems not predictable : after several restart of the 
> same broker, partitions redownloaded are now always the same.
> Broker log filter for one specific partition that was redownloaded ( the 
> truncate offset : 12878451349 is corresponding to the log-start-offset) :
>  
> {code:java}
> 2019-01-31 09:23:34,703 INFO [ProducerStateManager partition=my_topic-11] 
> Writing producer snapshot at offset 13132373966 
> (kafka.log.ProducerStateManager)
> 2019-01-31 09:25:15,245 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Loading producer state till offset 13132373966 with message format version 1 
> (kafka.log.Log)
> 2019-01-31 09:25:15,245 INFO [ProducerStateManager partition=my_topic-11] 
> Writing producer snapshot at offset 13130789408 
> (kafka.log.ProducerStateManager)
> 2019-01-31 09:25:15,249 INFO [ProducerStateManager partition=my_topic-11] 
> Writing producer snapshot at offset 13131829288 
> (kafka.log.ProducerStateManager)
> 2019-01-31 09:25:15,388 INFO [ProducerStateManager partition=my_topic-11] 
> Writing producer snapshot at offset 13132373966 
> (kafka.log.ProducerStateManager)
> 2019-01-31 09:25:15,388 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Completed load of log with 243 segments, log start offset 12878451349 and log 
> end offset 13132373966 in 46273 ms (kafka.log.Log)
> 2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
> initial high watermark 13132373966 (kafka.cluster.Replica)
> 2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
> initial high watermark 0 (kafka.cluster.Replica)
> 2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
> initial high watermark 0 (kafka.cluster.Replica)
> 2019-01-31 09:28:42,132 INFO The cleaning for partition my_topic-11 is 
> aborted and paused (kafka.log.LogCleaner)
> 2019-01-31 09:28:42,133 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Truncating to offset 12878451349 (kafka.log.Log)
> 2019-01-31 09:28:42,135 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Scheduling log segment [baseOffset 12879521312, size 536869342] for deletion. 
> (kafka.log.Log)
> (...)
> 2019-01-31 09:28:42,521 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Scheduling log segment [baseOffset 13131829288, size 280543535] for deletion. 
> (kafka.log.Log)
> 2019-01-31 09:28:43,870 WARN [ReplicaFetcher replicaId=11, leaderId=13, 
> fetcherId=1] Truncating my_topic-11 to offset 12878451349 below high 
> watermark 13132373966 (kafka.server.ReplicaFetcherThread)
> 2019-01-31 09:29:03,703 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Found deletable segments with base offsets [12878451349] due to retention 
> time 25920ms breach (kafka.log.Log)
> 2019-01-31 09:28:42,550 INFO Compaction for partition my_topic-11 is resumed 
> (kafka.log.LogManager)
> {code}
>  
> We sucessfull tried to reproduce the same bug with kafka 0.11, 2.0.1 & 2.1.0
>  
> Same issue appears when we are doing a rolling restart by switching 
> log.message.format to 2.0
> Issue disappears when all brokers are with log.message.format = 2.0 & 
> inter.broker.protocol = 2.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7886) Some partitions are fully truncated during recovery when log.message.format = 0.10.2 & inter.broker.protocol >= 0.11

2019-01-31 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated KAFKA-7886:
---
Attachment: java.activation-graph.png

> Some partitions are fully truncated during recovery when log.message.format = 
> 0.10.2 & inter.broker.protocol >= 0.11
> 
>
> Key: KAFKA-7886
> URL: https://issues.apache.org/jira/browse/KAFKA-7886
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0, 2.0.1, 2.1.0
> Environment: centos 7 
>Reporter: Hervé RIVIERE
>Priority: Major
> Attachments: broker.log
>
>
> On a cluster of Kafka 2.0.1, and brokers configured with
>  * inter.broker.protocol.format = 2.0
>  * log.message.format.version = 0.10.2
>  
> In such configuration, when a broker is restarted (clean shutdown), the 
> recovery process, for some partitions, is not taking in account the high 
> watermark and is truncating and re-downloading the full partition.
> Typically for brokers with 500 partitions each / 5 TB of disk usage the 
> recovery process with this configuration is during up to 1 hour whereas it 
> usually takes less than 10 min in the same broker when 
> (inter.broker.protocol.format = log.message.format.version)
>  Partitions redownloaded seems not predictable : after several restart of the 
> same broker, partitions redownloaded are now always the same.
> Broker log filter for one specific partition that was redownloaded ( the 
> truncate offset : 12878451349 is corresponding to the log-start-offset) :
>  
> {code:java}
> 2019-01-31 09:23:34,703 INFO [ProducerStateManager partition=my_topic-11] 
> Writing producer snapshot at offset 13132373966 
> (kafka.log.ProducerStateManager)
> 2019-01-31 09:25:15,245 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Loading producer state till offset 13132373966 with message format version 1 
> (kafka.log.Log)
> 2019-01-31 09:25:15,245 INFO [ProducerStateManager partition=my_topic-11] 
> Writing producer snapshot at offset 13130789408 
> (kafka.log.ProducerStateManager)
> 2019-01-31 09:25:15,249 INFO [ProducerStateManager partition=my_topic-11] 
> Writing producer snapshot at offset 13131829288 
> (kafka.log.ProducerStateManager)
> 2019-01-31 09:25:15,388 INFO [ProducerStateManager partition=my_topic-11] 
> Writing producer snapshot at offset 13132373966 
> (kafka.log.ProducerStateManager)
> 2019-01-31 09:25:15,388 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Completed load of log with 243 segments, log start offset 12878451349 and log 
> end offset 13132373966 in 46273 ms (kafka.log.Log)
> 2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
> initial high watermark 13132373966 (kafka.cluster.Replica)
> 2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
> initial high watermark 0 (kafka.cluster.Replica)
> 2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
> initial high watermark 0 (kafka.cluster.Replica)
> 2019-01-31 09:28:42,132 INFO The cleaning for partition my_topic-11 is 
> aborted and paused (kafka.log.LogCleaner)
> 2019-01-31 09:28:42,133 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Truncating to offset 12878451349 (kafka.log.Log)
> 2019-01-31 09:28:42,135 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Scheduling log segment [baseOffset 12879521312, size 536869342] for deletion. 
> (kafka.log.Log)
> (...)
> 2019-01-31 09:28:42,521 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Scheduling log segment [baseOffset 13131829288, size 280543535] for deletion. 
> (kafka.log.Log)
> 2019-01-31 09:28:43,870 WARN [ReplicaFetcher replicaId=11, leaderId=13, 
> fetcherId=1] Truncating my_topic-11 to offset 12878451349 below high 
> watermark 13132373966 (kafka.server.ReplicaFetcherThread)
> 2019-01-31 09:29:03,703 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
> Found deletable segments with base offsets [12878451349] due to retention 
> time 25920ms breach (kafka.log.Log)
> 2019-01-31 09:28:42,550 INFO Compaction for partition my_topic-11 is resumed 
> (kafka.log.LogManager)
> {code}
>  
> We sucessfull tried to reproduce the same bug with kafka 0.11, 2.0.1 & 2.1.0
>  
> Same issue appears when we are doing a rolling restart by switching 
> log.message.format to 2.0
> Issue disappears when all brokers are with log.message.format = 2.0 & 
> inter.broker.protocol = 2.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7886) Some partitions are fully truncated during recovery when log.message.format = 0.10.2 & inter.broker.protocol >= 0.11

2019-01-31 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/KAFKA-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hervé RIVIERE updated KAFKA-7886:
-
Description: 
On a cluster of Kafka 2.0.1, and brokers configured with
 * inter.broker.protocol.format = 2.0
 * log.message.format.version = 0.10.2

 

In such configuration, when a broker is restarted (clean shutdown), the 
recovery process, for some partitions, is not taking in account the high 
watermark and is truncating and re-downloading the full partition.

Typically for brokers with 500 partitions each / 5 TB of disk usage the 
recovery process with this configuration is during up to 1 hour whereas it 
usually takes less than 10 min in the same broker when 
(inter.broker.protocol.format = log.message.format.version)
 Partitions redownloaded seems not predictable : after several restart of the 
same broker, partitions redownloaded are now always the same.

Broker log filter for one specific partition that was redownloaded ( the 
truncate offset : 12878451349 is corresponding to the log-start-offset) :

 
{code:java}
2019-01-31 09:23:34,703 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13132373966 (kafka.log.ProducerStateManager)
2019-01-31 09:25:15,245 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Loading producer state till offset 13132373966 with message format version 1 
(kafka.log.Log)
2019-01-31 09:25:15,245 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13130789408 (kafka.log.ProducerStateManager)
2019-01-31 09:25:15,249 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13131829288 (kafka.log.ProducerStateManager)
2019-01-31 09:25:15,388 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13132373966 (kafka.log.ProducerStateManager)

2019-01-31 09:25:15,388 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Completed load of log with 243 segments, log start offset 12878451349 and log 
end offset 13132373966 in 46273 ms (kafka.log.Log)

2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
initial high watermark 13132373966 (kafka.cluster.Replica)
2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
initial high watermark 0 (kafka.cluster.Replica)
2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
initial high watermark 0 (kafka.cluster.Replica)
2019-01-31 09:28:42,132 INFO The cleaning for partition my_topic-11 is aborted 
and paused (kafka.log.LogCleaner)

2019-01-31 09:28:42,133 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Truncating to offset 12878451349 (kafka.log.Log)

2019-01-31 09:28:42,135 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Scheduling log segment [baseOffset 12879521312, size 536869342] for deletion. 
(kafka.log.Log)
(...)
2019-01-31 09:28:42,521 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Scheduling log segment [baseOffset 13131829288, size 280543535] for deletion. 
(kafka.log.Log)
2019-01-31 09:28:43,870 WARN [ReplicaFetcher replicaId=11, leaderId=13, 
fetcherId=1] Truncating my_topic-11 to offset 12878451349 below high watermark 
13132373966 (kafka.server.ReplicaFetcherThread)
2019-01-31 09:29:03,703 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Found deletable segments with base offsets [12878451349] due to retention time 
25920ms breach (kafka.log.Log)
2019-01-31 09:28:42,550 INFO Compaction for partition my_topic-11 is resumed 
(kafka.log.LogManager)
{code}
 

We sucessfull tried to reproduce the same bug with kafka 0.11, 2.0.1 & 2.1.0

 

Same issue appears when we are doing a rolling restart by switching 
log.message.format to 2.0

Issue disappears when all brokers are with log.message.format = 2.0 & 
inter.broker.protocol = 2.0

  was:
On a cluster of Kafka 2.0.1, and brokers configured with
 * inter.broker.protocol.format = 2.0
 * log.message.format.version = 0.10.2

 

In such configuration, when a broker is restarted (clean shutdown), the 
recovery process, for some partitions, is not taking in account the high 
watermark and is truncating and re-downloading the full partition.

Typically for brokers with 500 partitions each / 5 TB of disk usage the 
recovery process with this configuration is during up to 1 hour whereas it 
usually takes less than 10 min in the same broker when 
(inter.broker.protocol.format = log.message.format.version)
 Partitions redownloaded seems not predictable : after several restart of the 
same broker, partitions redownloaded are now always the same.

Broker log filter for one specific partition that was redownloaded ( the 
truncate offset : 12878451349 is corresponding to the log-start-offset) :

 
{code:java}
2019-01-31 09:23:34,703 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13132373966 (kafka.log.ProducerStateManager)

[jira] [Updated] (KAFKA-7886) Some partitions are fully truncated during recovery when log.message.format = 0.10.2 & inter.broker.protocol >= 0.11

2019-01-31 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/KAFKA-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hervé RIVIERE updated KAFKA-7886:
-
Description: 
On a cluster of Kafka 2.0.1, and brokers configured with
 * inter.broker.protocol.format = 2.0
 * log.message.format.version = 0.10.2

 

In such configuration, when a broker is restarted (clean shutdown), the 
recovery process, for some partitions, is not taking in account the high 
watermark and is truncating and re-downloading the full partition.

Typically for brokers with 500 partitions each / 5 TB of disk usage the 
recovery process with this configuration is during up to 1 hour whereas it 
usually takes less than 10 min in the same broker when 
(inter.broker.protocol.format = log.message.format.version)
 Partitions redownloaded seems not predictable : after several restart of the 
same broker, partitions redownloaded are now always the same.

Broker log filter for one specific partition that was redownloaded ( the 
truncate offset : 12878451349 is corresponding to the log-start-offset) :

 
{code:java}
2019-01-31 09:23:34,703 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13132373966 (kafka.log.ProducerStateManager)
2019-01-31 09:25:15,245 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Loading producer state till offset 13132373966 with message format version 1 
(kafka.log.Log)
2019-01-31 09:25:15,245 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13130789408 (kafka.log.ProducerStateManager)
2019-01-31 09:25:15,249 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13131829288 (kafka.log.ProducerStateManager)
2019-01-31 09:25:15,388 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13132373966 (kafka.log.ProducerStateManager)

2019-01-31 09:25:15,388 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Completed load of log with 243 segments, log start offset 12878451349 and log 
end offset 13132373966 in 46273 ms (kafka.log.Log)

2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
initial high watermark 13132373966 (kafka.cluster.Replica)
2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
initial high watermark 0 (kafka.cluster.Replica)
2019-01-31 09:28:38,226 INFO Replica loaded for partition my_topic-11 with 
initial high watermark 0 (kafka.cluster.Replica)
2019-01-31 09:28:42,132 INFO The cleaning for partition my_topic-11 is aborted 
and paused (kafka.log.LogCleaner)

2019-01-31 09:28:42,133 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Truncating to offset 12878451349 (kafka.log.Log)

2019-01-31 09:28:42,135 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Scheduling log segment [baseOffset 12879521312, size 536869342] for deletion. 
(kafka.log.Log)
(...)
2019-01-31 09:28:42,521 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Scheduling log segment [baseOffset 13131829288, size 280543535] for deletion. 
(kafka.log.Log)
2019-01-31 09:28:43,870 WARN [ReplicaFetcher replicaId=11, leaderId=13, 
fetcherId=1] Truncating my_topic-11 to offset 12878451349 below high watermark 
13132373966 (kafka.server.ReplicaFetcherThread)
2019-01-31 09:29:03,703 INFO [Log partition=my_topic-11, dir=/var/lib/kafka] 
Found deletable segments with base offsets [12878451349] due to retention time 
25920ms breach (kafka.log.Log)
2019-01-31 09:28:42,550 INFO Compaction for partition my_topic-11 is resumed 
(kafka.log.LogManager)
{code}
 

We sucessfull tried to reproduce the same bug with kafka 0.11, 2.0.1 & 2.1.0

 

Same issue appears when we are doing a rolling restart by switching 
log.message.format to 2.0

Issue disappears when all brokers are in 

 

  was:
On a cluster of Kafka 2.0.1, and brokers configured with 
* inter.broker.protocol.format = 2.0
* log.message.format.version = 0.10.2

 

In such configuration, when a broker is restarted (clean shutdown), the 
recovery process, for some partitions, is not taking in account the high 
watermark and is truncating and re-downloading the full partition.


Typically for brokers with 500 partitions each / 5 TB of disk usage the 
recovery process with this configuration is during up to 1 hour whereas it 
usually takes less than 10 min in the same broker when 
(inter.broker.protocol.format = log.message.format.version)
Partitions redownloaded seems not predictable : after several restart of the 
same broker, partitions redownloaded are now always the same.


Broker log filter for one specific partition that was redownloaded ( the 
truncate offset : 12878451349 is corresponding to the log-start-offset) :

 
{code:java}
2019-01-31 09:23:34,703 INFO [ProducerStateManager partition=my_topic-11] 
Writing producer snapshot at offset 13132373966 (kafka.log.ProducerStateManager)
2019-01-31 09:25:15,245 INFO [Log partition=my_topic-11,