[jira] [Updated] (KAFKA-350) Enable message replication in the presence of controlled failures

Neha Narkhede (JIRA) Mon, 23 Jul 2012 19:30:38 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neha Narkhede updated KAFKA-350:
--------------------------------

    Attachment: kafka-350-v3.patch

Thanks for the review !

Jun's comments -

20. Made getLeaderRequest and getLogSegmentMetadataRequest private
21. partitionAndCollate() is used by dispatchSerializedData() API. The contract 
of dispatchSerializedData() is to return the list of messages that were not 
sent successfully so that they can be used on the next retry. Now, if 
partitionAndCollate() fails, we need dispatchSerializedData to return the 
entire input list of messages. If something else fails after 
partitionAndCollate, we need to return a filtered list of messages back. To 
handle this, partitionAndCollate returns an Option. I agree that this class has 
very complex code that needs a cleanup. Have filed KAFKA-409 for that.
22. We need to know the absolute end offset of the segment for truncation 
during make follower state change. Have changed the name of the API to 
absoluteEndOffset. 
23. We have KAFKA-376 filed to ensure that the consumer sees data only after 
the HW. This patch exposes data upto log end offset to replicas as well as 
consumers. So FileMessageSet.read() exposes all data upto log end offset.
24. Removed unused import.
25. Hmm, fixed the error message
26. Removed the TODOs related to KAFKA-350
27. Introduced the connection timeout and session timeout variables in the test 
harness and set it to 6s. 

Jay's comments -

2.3 So we talked about this offline, stating it here so that others can follow. 
There are 2 choices here -
2.3.1. One is to keep request timeout = socket timeout. This would make us 
fully dependent on the socket for the timeout. So, any requests that took 
longer than that on the server for some reason (GC, bugs etc), would throw 
SocketTimeoutException on the producer client. The only downside to this is 
that the producer client wouldn't know why it failed or whether it got written 
to 0 or 1 or n replicas. Al though, this should be very rare and in the normal 
case, the server process the request and be able to send the response within 
the timeout. This is assuming the timeout value was set counting the network 
delay as well. So if the request is travelling cross-colo, it is expected that 
you set the timeout keeping in mind the cross-colo network latency. 
2.3.2. The second choice is to set the socket timeout >> request timeout. This 
would ensure that in some rare failure cases on the server, the producer client 
would still be able to get back a response (most of the times) with a 
descriptive error message explaining the cause of the failure. However, it 
would also mean that under some failures like (server side kernel bug crashing 
the server etc), the timeout would actually be the socket timeout, which is 
usually set to a much higher value. This can confuse users who might expect the 
timeout to be the request timeout. Also, having two different timeouts also 
seems to complicate the guarantees provided by Kafka 
2.3.3. I personally think option 1 is simpler and provides no worse guarantees 
than option 2. This patch just sets the socket timeout to be the request 
timeout. 

2.4 Yeah, I think differentiating between “this should not be possible” and 
“this should not be possible in Kafka” is a little tricky. On one hand, it 
seems nicer to know that any exception thrown by Kafka will either be 
KafkaException or some exception that extends KafkaException. On the other 
hand, some states are just impossible and must be treated like assertions, for 
example, a socketchannel key that is not readable or writable or even valid. 
And in such cases, it might be slightly more convenient to have 
IllegalStateException. And I'm not sure I know the right answer here. I took a 
pass over all the IllegalStateException usages and converted the ones I think 
should be KafkaException, but I might not have done it in the best way possible.

                
> Enable message replication in the presence of controlled failures
> -----------------------------------------------------------------
>
>                 Key: KAFKA-350
>                 URL: https://issues.apache.org/jira/browse/KAFKA-350
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>         Attachments: kafka-350-v1.patch, kafka-350-v2.patch, 
> kafka-350-v3.patch
>
>
> KAFKA-46 introduced message replication feature in the absence of server 
> failures. This JIRA will improve the log recovery logic and fix other bugs to 
> enable message replication to happen in the presence of controlled server 
> failures

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-350) Enable message replication in the presence of controlled failures

Reply via email to