Rajini Sivaram created KAFKA-3488:
-------------------------------------

             Summary: commitAsync() fails if metadata update creates new 
SASL/SSL connection
                 Key: KAFKA-3488
                 URL: https://issues.apache.org/jira/browse/KAFKA-3488
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 0.9.0.1
            Reporter: Rajini Sivaram
            Assignee: Rajini Sivaram


Sasl/SslConsumerTest.testSimpleConsumption() fails intermittently with a 
failure in {{commitAsync()}}. The exception stack trace shows:

{quote}
kafka.api.SaslPlaintextConsumerTest.testSimpleConsumption FAILED
java.lang.AssertionError: expected:<1> but was:<0>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:645)
        at org.junit.Assert.assertEquals(Assert.java:631)
        at 
kafka.api.BaseConsumerTest.awaitCommitCallback(BaseConsumerTest.scala:340)
        at 
kafka.api.BaseConsumerTest.testSimpleConsumption(BaseConsumerTest.scala:85)
{quote}

I have recreated this with some additional trace. The tests run with a very 
small metadata expiry interval, triggering metadata updates quite often. If a 
metadata request immediately following a {{commitAsync()}} call creates a new 
SSL/SASL connection, {{ConsumerNetworkClient.poll}} returns to process the 
connection handshake packets. Since {{ConsumerNetworkClient.poll}} discards all 
unsent packets before returning from poll, this can result in the failure of 
the commit - the callback is invoked with {{SendFailedException}}.

I understand that {{ConsumerNetworkClient.poll()}} discards unsent packets 
rather than buffer them to keep the code simple. And perhaps it is ok to fail 
{{commitAsync}} occasionally since the callback does indicate that the caller 
should retry. But it feels like an unnecessary limitation that requires error 
handling in client applications when there are no real failures and makes it 
much harder to test reliably. As special handling to fix issues like 
KAFKA-3412, KAFKA-2672 adds more complexity to the code anyway, and because it 
is much harder to debug failures that affect only SSL/SASL, it may be worth 
considering improving this behaviour.

I will see if I can submit a PR for the specific issue I was seeing with the 
impact of handshakes on {{commitAsync()}}, but I will be interested in views on 
improving the logic in {{ConsumerNetworkClient}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to