[
https://issues.apache.org/jira/browse/KAFKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408726#comment-15408726
]
Jun Rao commented on KAFKA-3518:
--------------------------------
Took a look at the log. The following are my findings.
>From report.txt, it shows the following messages are missing: 33867, 33870,
>33873, 33876, 33879, 33882, 33885
Grepping those message from the producer output show the following. All the
missing messages are consecutive in the log of test_topic partition 2.
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33867","class":"class
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11289,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33870","class":"class
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11290,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33873","class":"class
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11291,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33876","class":"class
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11292,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33879","class":"class
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11293,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33882","class":"class
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11294,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33885","class":"class
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11295,"key":null}
The broker logs show the following.
worker9:
[2016-04-06 12:46:28,812] INFO Truncating log test_topic-2 to offset 11296.
(kafka.log.Log)
[2016-04-06 12:46:36,867] INFO Truncating log test_topic-2 to offset 11296.
(kafka.log.Log)
[2016-04-06 12:46:39,032] INFO Truncating log test_topic-2 to offset 11296.
(kafka.log.Log)
[2016-04-06 12:46:39,319] INFO Truncating log test_topic-2 to offset 11289.
(kafka.log.Log)
[2016-04-06 12:48:33,940] INFO Truncating log test_topic-2 to offset 19965.
(kafka.log.Log)
worker10:
[2016-04-06 12:45:47,104] INFO Truncating log test_topic-2 to offset 0.
(kafka.log.Log)
[2016-04-06 12:46:16,088] INFO Truncating log test_topic-2 to offset 3952.
(kafka.log.Log)
[2016-04-06 12:46:54,141] INFO Truncating log test_topic-2 to offset 11289.
(kafka.log.Log)
[2016-04-06 12:46:56,494] INFO Truncating log test_topic-2 to offset 16968.
(kafka.log.Log)
worker 1:
[2016-04-06 12:45:47,108] INFO Truncating log test_topic-2 to offset 0.
(kafka.log.Log)
[2016-04-06 12:46:26,867] INFO Truncating log test_topic-2 to offset 7790.
(kafka.log.Log)
[2016-04-06 12:46:28,849] INFO Truncating log test_topic-2 to offset 11289.
(kafka.log.Log)
[2016-04-06 12:46:56,489] INFO Truncating log test_topic-2 to offset 16972.
(kafka.log.Log)
[2016-04-06 12:47:03,900] INFO Truncating log test_topic-2 to offset 16972.
(kafka.log.Log)
[2016-04-06 12:48:33,954] INFO Truncating log test_topic-2 to offset 19965.
(kafka.log.Log)
Offset 11289 is truncated a few times. Currently, there is a known issue
(KAFKA-1211) that if the follower becomes the leader very quickly after
truncation, some previously committed messages could be lost. This can happen
if the broker are restarted very quickly, which doesn't seem to be the case
from the test_log.debug.
[DEBUG - 2016-04-06 12:46:05,764 - remoteaccount - _ssh_quiet - lineno:213]:
Trying to run remote command: ssh ubuntu@worker10 -o 'HostName
ec2-54-187-119-87.us-west-2.compute.amazonaws.com' -o 'Port 22' -o
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o
'IdentitiesOnly yes' -o 'LogLevel FATAL' 'kill -15 28525'
[DEBUG - 2016-04-06 12:46:17,411 - remoteaccount - _ssh_quiet - lineno:213]:
Trying to run remote command: ssh ubuntu@worker1 -o 'HostName
ec2-54-213-143-217.us-west-2.compute.amazonaws.com' -o 'Port 22' -o
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o
'IdentitiesOnly yes' -o 'LogLevel FATAL' 'kill -15 7855'
[DEBUG - 2016-04-06 12:46:28,122 - remoteaccount - _ssh_quiet - lineno:213]:
Trying to run remote command: ssh ubuntu@worker9 -o 'HostName
ec2-54-200-156-141.us-west-2.compute.amazonaws.com' -o 'Port 22' -o
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o
'IdentitiesOnly yes' -o 'LogLevel FATAL' 'kill -15 8372'
[DEBUG - 2016-04-06 12:46:38,291 - remoteaccount - _ssh_quiet - lineno:213]:
Trying to run remote command: ssh ubuntu@worker10 -o 'HostName
ec2-54-187-119-87.us-west-2.compute.amazonaws.com' -o 'Port 22' -o
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o
'IdentitiesOnly yes' -o 'LogLevel FATAL' 'kill -15 30050'
[DEBUG - 2016-04-06 12:46:55,532 - remoteaccount - _ssh_quiet - lineno:213]:
Trying to run remote command: ssh ubuntu@worker1 -o 'HostName
ec2-54-213-143-217.us-west-2.compute.amazonaws.com' -o 'Port 22' -o
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o
'IdentitiesOnly yes' -o 'LogLevel FATAL' 'kill -15 9259'
Another possibility is due to KAFKA-3670 during preferred leader election. I
wasn't able to fully verify if this is indeed the case since we didn't enable
TRACE level logging in the controller and state-change.log where we log leader
change information. Filed KAFKA-4021 to track that.
We fixed KAFKA-3670 on May 8. I checked the system test history after May 8.
This test only failed 2 more times. However, the failure is not due to message
loss. In one case
(http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-05-10--001.1462898929--apache--trunk--6f18732/report.html),
the broker can't be started and in the other
(http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-05-09--001.1462811506--apache--trunk--f5b98b8/report.html),
the controller can't be identified. So, it seems this particular issue hasn't
shown up again and is likely fixed by KAFKA-3670. I think we can close this
jira for now.
The more complete fix will happen in KAFKA-1211, but it may take some time.
> Transient failure in ReplicationTest.test_replication_with_broker_failure
> with ConsumerTimeoutException
> -------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-3518
> URL: https://issues.apache.org/jira/browse/KAFKA-3518
> Project: Kafka
> Issue Type: Bug
> Components: system tests
> Reporter: Ewen Cheslack-Postava
>
> From
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-04-06--001.1459956800--apache--trunk--99d2329/report.html
> this test failed on a recent run:
> Module: kafkatest.tests.core.replication_test
> Class: ReplicationTest
> Method: test_replication_with_broker_failure
> Arguments:
> {
> "broker_type": "leader",
> "failure_mode": "clean_bounce",
> "security_protocol": "SASL_PLAINTEXT"
> }
> It is missing 7 messages on the consumer trying to validate the messages. The
> consumer hit its timeout, which should be sufficiently long to capture any
> successfully produced messages. I didn't notice anything obviously wrong in
> the logs. Could be some sort of consumer stall?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)