Rajini Sivaram created KAFKA-3488: ------------------------------------- Summary: commitAsync() fails if metadata update creates new SASL/SSL connection Key: KAFKA-3488 URL: https://issues.apache.org/jira/browse/KAFKA-3488 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 0.9.0.1 Reporter: Rajini Sivaram Assignee: Rajini Sivaram
Sasl/SslConsumerTest.testSimpleConsumption() fails intermittently with a failure in {{commitAsync()}}. The exception stack trace shows: {quote} kafka.api.SaslPlaintextConsumerTest.testSimpleConsumption FAILED java.lang.AssertionError: expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:834) at org.junit.Assert.assertEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:631) at kafka.api.BaseConsumerTest.awaitCommitCallback(BaseConsumerTest.scala:340) at kafka.api.BaseConsumerTest.testSimpleConsumption(BaseConsumerTest.scala:85) {quote} I have recreated this with some additional trace. The tests run with a very small metadata expiry interval, triggering metadata updates quite often. If a metadata request immediately following a {{commitAsync()}} call creates a new SSL/SASL connection, {{ConsumerNetworkClient.poll}} returns to process the connection handshake packets. Since {{ConsumerNetworkClient.poll}} discards all unsent packets before returning from poll, this can result in the failure of the commit - the callback is invoked with {{SendFailedException}}. I understand that {{ConsumerNetworkClient.poll()}} discards unsent packets rather than buffer them to keep the code simple. And perhaps it is ok to fail {{commitAsync}} occasionally since the callback does indicate that the caller should retry. But it feels like an unnecessary limitation that requires error handling in client applications when there are no real failures and makes it much harder to test reliably. As special handling to fix issues like KAFKA-3412, KAFKA-2672 adds more complexity to the code anyway, and because it is much harder to debug failures that affect only SSL/SASL, it may be worth considering improving this behaviour. I will see if I can submit a PR for the specific issue I was seeing with the impact of handshakes on {{commitAsync()}}, but I will be interested in views on improving the logic in {{ConsumerNetworkClient}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)