[ 
https://issues.apache.org/jira/browse/KAFKA-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220272#comment-15220272
 ] 

Jason Gustafson commented on KAFKA-3488:
----------------------------------------

This has come up a few times in slightly different scenarios, so it's 
definitely worth fixing. A few options come to mind:

1. The current behavior is basically a consequence of the fact that 
NetworkClient only accepts one request send for each connection on each 
invocation of poll(). If it supported multiple request sends, then we wouldn't 
have to touch ConsumerNetworkClient. For the consumer, requests are usually 
tiny, so we ought to be able to send a bunch of them at once.
2. We can have ConsumerNetworkClient "try harder" to get the requests out. 
Instead of throwing on the first send failure, we can have it go back into 
poll(). This should clear the pending request send in NetworkClient so that 
another can be sent. This is actually how it was written initially, so you can 
look back at the history to see what that looked like. I don't recall whether 
there was a great reason to change it, but I think we were concerned about the 
overhead from multiple iterations of poll().
3. We can reschedule the async commit to be sent later. Because send failure 
remains a possibility even with the above fixes, this is really the only way to 
guarantee that a commit eventually will get sent. Unfortunately, it introduces 
the possibility of commit reordering, so we need an approach to deal with that 
(e.g. sequence numbers).

Among the these options, (2) probably has the least risk, but it might be 
worthwhile seeing what (1) would look like. There may be other options as well 
that I haven't thought of. 

> commitAsync() fails if metadata update creates new SASL/SSL connection
> ----------------------------------------------------------------------
>
>                 Key: KAFKA-3488
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3488
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 0.9.0.1
>            Reporter: Rajini Sivaram
>            Assignee: Rajini Sivaram
>             Fix For: 0.10.0.0
>
>
> Sasl/SslConsumerTest.testSimpleConsumption() fails intermittently with a 
> failure in {{commitAsync()}}. The exception stack trace shows:
> {quote}
> kafka.api.SaslPlaintextConsumerTest.testSimpleConsumption FAILED
> java.lang.AssertionError: expected:<1> but was:<0>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:834)
>       at org.junit.Assert.assertEquals(Assert.java:645)
>       at org.junit.Assert.assertEquals(Assert.java:631)
>       at 
> kafka.api.BaseConsumerTest.awaitCommitCallback(BaseConsumerTest.scala:340)
>       at 
> kafka.api.BaseConsumerTest.testSimpleConsumption(BaseConsumerTest.scala:85)
> {quote}
> I have recreated this with some additional trace. The tests run with a very 
> small metadata expiry interval, triggering metadata updates quite often. If a 
> metadata request immediately following a {{commitAsync()}} call creates a new 
> SSL/SASL connection, {{ConsumerNetworkClient.poll}} returns to process the 
> connection handshake packets. Since {{ConsumerNetworkClient.poll}} discards 
> all unsent packets before returning from poll, this can result in the failure 
> of the commit - the callback is invoked with {{SendFailedException}}.
> I understand that {{ConsumerNetworkClient.poll()}} discards unsent packets 
> rather than buffer them to keep the code simple. And perhaps it is ok to fail 
> {{commitAsync}} occasionally since the callback does indicate that the caller 
> should retry. But it feels like an unnecessary limitation that requires error 
> handling in client applications when there are no real failures and makes it 
> much harder to test reliably. As special handling to fix issues like 
> KAFKA-3412, KAFKA-2672 adds more complexity to the code anyway, and because 
> it is much harder to debug failures that affect only SSL/SASL, it may be 
> worth considering improving this behaviour.
> I will see if I can submit a PR for the specific issue I was seeing with the 
> impact of handshakes on {{commitAsync()}}, but I will be interested in views 
> on improving the logic in {{ConsumerNetworkClient}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to