[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034892#comment-15034892 ] Rajini Sivaram commented on KAFKA-2891: --- [~benstopford] The logs from my failing test runs all show the same pattern - ISR set to 1 and messages acked when leader is the only ISR. When the leader gets killed by the test, messages are lost, as you would expect. The test was intended to run with min.insync.replicas set to 2, but due to a bug in the way min.insync.replicas was being set for topics, it was being left as default of one. All tests which currently set min.insync.replicas have copied the same config with the result that the config is never set. I have updated the PR for KAFKA-2642 with a fix for the min.insync.replicas setting in all the tests which set this. Have scheduled a build with the fix and will check the results in the morning. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034222#comment-15034222 ] Ben Stopford commented on KAFKA-2891: - [~rsivaram] I found an error in my analysis of KAFKA-2909 meaning that jira refers to actual data loss. KAFKA-2908 remains a client-side issue. This puts more evidence behind your theory that nodes are being killed before data is replicated. I'll be interested to see if this change is stable on Ec2. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034103#comment-15034103 ] Ben Stopford commented on KAFKA-2891: - [~rsivaram] The one thing we know for sure is that putting time between bounces solves the problem. Checking for the ISR to have two entries is a good option. You may even need to pad with pauses. It'd be great to get this test merged though, even if we have to go back to refactor it later. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033850#comment-15033850 ] Rajini Sivaram commented on KAFKA-2891: --- [~benstopford] I dont see errors in my local replication test runs when run with PLAINTEXT with either new consumer or old consumer. But it could just be hiding timing issues because the consumer is faster. I will run the tests again tonight with the fix from KAFKA-2913. I am hopeful that once your the issues you are seeing are fixed, the replication tests would just work :-) > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033724#comment-15033724 ] Ben Stopford commented on KAFKA-2891: - [~rsivaram] so - in my investigations, even with min.insync.replicas = 2 + clean_shutdown additional pauses are needed between bounces to get long term stability on Ec2. My theory is this is a problem consumer-side because I don't see evidence of data loss in Kafka. Maybe by waiting for the ISR to hit 2 you are getting similar behaviour. Your test is a little more extreme though due to the hard_bounce. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033694#comment-15033694 ] Rajini Sivaram commented on KAFKA-2891: --- [~benstopford] Yes, you are right, replication test does set min.insync.replicas, ignore my previous comment. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033560#comment-15033560 ] Ben Stopford commented on KAFKA-2891: - [~rsivaram] That sounds reasonable to me. I'm also surprised it works reliably with hard bounce currently. Note also that there are a couple of examples (in subtasks) of intermittent failures which look consumer related (as data makes it to kafka). Jason kindly took a look at this yesterday with one related fix [KAFKA-2913|https://issues.apache.org/jira/browse/KAFKA-2913]. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033457#comment-15033457 ] Rajini Sivaram commented on KAFKA-2891: --- [~geoffra] Replication tests expect all ack'ed messages to be received even though it runs with the default min.insync.replicas=1. The tests kills the leader of a partition in a loop while messages are being produced and consumed. This can (and does) result in ISRs dropping down to 1 (just the leader is the ISR list). Messages published when there are no other replicas are lost if the leader (the only ISR) is killed. It seems to me that the test's expectations are too high. When I modify the test (hard_bounce with SSL/SASL) to wait until there are atleast two entries in the ISR list before killing the leader, it passes reliably in my local test runs. I wonder if the only reason this test has been working is because PLAINTEXT consumers keep up with the producer and hence are unlikely to lose messages. Would it be a reasonable change to the test to ensure that there are at least two ISRs before killing the leader? > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030455#comment-15030455 ] Ben Stopford commented on KAFKA-2891: - Sorry [~hachikuji]- typo - should have said "implies the problem should not be consumer side". now changed. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030337#comment-15030337 ] Jason Gustafson commented on KAFKA-2891: [~benstopford] To be clear, are you saying that the message gap is on the server side? In other words, the messages were successfully acked by the producer, but were then lost? > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030082#comment-15030082 ] Ben Stopford commented on KAFKA-2891: - One more bit of info - when this problem occurs the missing messages are not in the server data files. This implies the problem should be on the consumer side. However we don't seem to see this when the old consumer is used. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028771#comment-15028771 ] Rajini Sivaram commented on KAFKA-2891: --- [~benstopford] Thank you, it looks like the same problem as KAFKA-2827 in my test logs too. Will rerun the tests when that is fixed. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027301#comment-15027301 ] Ben Stopford commented on KAFKA-2891: - So I'm starting to think the problem may be related to https://issues.apache.org/jira/browse/KAFKA-2827 (in my case at least). There are periods where the ISR drops to 1 which it shouldn't do during a clean bounce. Adding artificial pauses between node restarts also appears to remove the problem. Not definitive yet. Just a heads up. > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart
[ https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026907#comment-15026907 ] Ben Stopford commented on KAFKA-2891: - Yes. I get exactly the same. Worked fine for about six runs then got a run with: At least one acked message did not appear in the consumed messages. acked_minus_consumed: set([29073, 29067, 29076, 29070, 29079]) Which i have not seen before (i.e. just a few messages missing). > Gaps in messages delivered by new consumer after Kafka restart > -- > > Key: KAFKA-2891 > URL: https://issues.apache.org/jira/browse/KAFKA-2891 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 0.9.0.0 >Reporter: Rajini Sivaram >Priority: Critical > > Replication tests when run with the new consumer with SSL/SASL were failing > very often because messages were not being consumed from some topics after a > Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am > still seeing some failures (less often now) because a small set of messages > are not received after Kafka restart. This failure looks slightly different > from the one before the fix for KAFKA-2877 was applied, hence the new defect. > The test fails because not all acked messages are received by the consumer, > and the number of messages missing are quite small. > [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now? > Not sure if any of these log entries are important: > {quote} > [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed > due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group > failed due to unknown member id, resetting and retrying. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting > offset (org.apache.kafka.clients.consumer.internals.Fetcher) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)