[ https://issues.apache.org/jira/browse/KAFKA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neha Narkhede updated KAFKA-350: -------------------------------- Attachment: kafka-350-v3.patch Thanks for the review ! Jun's comments - 20. Made getLeaderRequest and getLogSegmentMetadataRequest private 21. partitionAndCollate() is used by dispatchSerializedData() API. The contract of dispatchSerializedData() is to return the list of messages that were not sent successfully so that they can be used on the next retry. Now, if partitionAndCollate() fails, we need dispatchSerializedData to return the entire input list of messages. If something else fails after partitionAndCollate, we need to return a filtered list of messages back. To handle this, partitionAndCollate returns an Option. I agree that this class has very complex code that needs a cleanup. Have filed KAFKA-409 for that. 22. We need to know the absolute end offset of the segment for truncation during make follower state change. Have changed the name of the API to absoluteEndOffset. 23. We have KAFKA-376 filed to ensure that the consumer sees data only after the HW. This patch exposes data upto log end offset to replicas as well as consumers. So FileMessageSet.read() exposes all data upto log end offset. 24. Removed unused import. 25. Hmm, fixed the error message 26. Removed the TODOs related to KAFKA-350 27. Introduced the connection timeout and session timeout variables in the test harness and set it to 6s. Jay's comments - 2.3 So we talked about this offline, stating it here so that others can follow. There are 2 choices here - 2.3.1. One is to keep request timeout = socket timeout. This would make us fully dependent on the socket for the timeout. So, any requests that took longer than that on the server for some reason (GC, bugs etc), would throw SocketTimeoutException on the producer client. The only downside to this is that the producer client wouldn't know why it failed or whether it got written to 0 or 1 or n replicas. Al though, this should be very rare and in the normal case, the server process the request and be able to send the response within the timeout. This is assuming the timeout value was set counting the network delay as well. So if the request is travelling cross-colo, it is expected that you set the timeout keeping in mind the cross-colo network latency. 2.3.2. The second choice is to set the socket timeout >> request timeout. This would ensure that in some rare failure cases on the server, the producer client would still be able to get back a response (most of the times) with a descriptive error message explaining the cause of the failure. However, it would also mean that under some failures like (server side kernel bug crashing the server etc), the timeout would actually be the socket timeout, which is usually set to a much higher value. This can confuse users who might expect the timeout to be the request timeout. Also, having two different timeouts also seems to complicate the guarantees provided by Kafka 2.3.3. I personally think option 1 is simpler and provides no worse guarantees than option 2. This patch just sets the socket timeout to be the request timeout. 2.4 Yeah, I think differentiating between “this should not be possible” and “this should not be possible in Kafka” is a little tricky. On one hand, it seems nicer to know that any exception thrown by Kafka will either be KafkaException or some exception that extends KafkaException. On the other hand, some states are just impossible and must be treated like assertions, for example, a socketchannel key that is not readable or writable or even valid. And in such cases, it might be slightly more convenient to have IllegalStateException. And I'm not sure I know the right answer here. I took a pass over all the IllegalStateException usages and converted the ones I think should be KafkaException, but I might not have done it in the best way possible. > Enable message replication in the presence of controlled failures > ----------------------------------------------------------------- > > Key: KAFKA-350 > URL: https://issues.apache.org/jira/browse/KAFKA-350 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8 > Reporter: Neha Narkhede > Assignee: Neha Narkhede > Attachments: kafka-350-v1.patch, kafka-350-v2.patch, > kafka-350-v3.patch > > > KAFKA-46 introduced message replication feature in the absence of server > failures. This JIRA will improve the log recovery logic and fix other bugs to > enable message replication to happen in the presence of controlled server > failures -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira