[ https://issues.apache.org/jira/browse/KAFKA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neha Narkhede updated KAFKA-350: -------------------------------- Attachment: kafka-350-v2.patch Thanks for the review, Jay ! Here is another patch fixing almost all review comments - 1. Log 1. Refactoring of the hw logic is part of KAFKA-405. It is moved to a new file HighwaterMarkCheckpoint and is controlled by the ReplicaManager. The Log does not and should not know about high watermarks. 2. Getter/setter for hw is removed as part of KAFKA-405 anyways. 3. Agree with you on the weak public interface, especially Log needs a cleanup. I think you attempted that as part of KAFKA-371. I've cleaned up quite a few things as part of KAFKA-405 and this patch. Nevertheless, fixed the nextAppendOffset as part of this patch. It is not required when we have logEndOffset. Also, removed the getEndOffset from FileMessageSet. Added endOffset() API to the LogSegment in addition to a size() API. This is useful during truncation of the log based on high watermark. 4. Fixed the javadoc and removed the debug statement. 2. Config options 1. The isr.keep.in.sync.time.ms set to 10 seconds is also very lax. A healthy follower should be able to catch up in 100s of milliseconds even with a lot of topics, with a worst case of maybe 4-5 seconds. We will know the latency better when we run some large scale tests. But yeah, the issue with the system test is independent of what the right value should be. I was just explaining how I discovered this issue. :) 2. Good point about renaming it to replica.max.lag.time.ms. Also changed isr.keep.in.sync.bytes to replica.max.lag.bytes. 3. Since the producer does a blocking read on the socket, the socket timeout cannot be greater than the request timeout. If it is, then the request timeout guarantee would be violated, no ? 3. SocketServer.scala 1. Removed info log statement 2. Yes, wanted to simplify testing/debugging (thread dumps) when there are multiple servers on one machine. Not sure if this is the best way to do that. 3. processNewResponses() doesn't have to process outstanding requests during shutdown. It can shutdown by ignoring them and those requests will timeout anyways. But yes, good point about the event loop doing it instead. Fixed it. 4. System testing 1. Fixed the hacky change. I still need ProducerPerformance needs a complete redo. Filed bug KAFKA-408 to do that. 5. AdminUtils 1. Need this change to return appropriate error codes for topic metadata request. Without this, all produce requests are timing out while fetching metadata, since in the system test, at any point of time, one replica is always down. 6. Partition.scala 1. Removed the trace statement. 7. ErrorMapping 1. Removed getMaxErrorCode and its usage from ErrorMapping 2. Good point. Introduced a new class KafkaException and converted IllegalStateException and IllegalArgumentException to it. 3. NoLeaderForPartitionException is in fact marked as deleted in the patch 8. ConsoleConsumer 1. Good point. The finally block there didn't really make any sense. 9. FileMessageSet 1. I haven't added the log statements in this patch, we always had it. 2. The purpose of adding that was help with debugging producer side queue full exceptions. We turned on DEBUG to see what the server side flush sizes and latencies were. I think these statements were added when we didn't have an ability to monitor these. However, whoever added the log flush monitoring maybe forgot to remove these statements. I removed it in this patch. 3. Removed the “recover upto” log statement too 10. DefaultEventHandler 1. Agree with the unwieldly procedural code. Filed bug KAFKA-409 to clean this up. 11. ZkUtils 1. Renamed LeaderExists -> LeaderExistsListener 12. ZookeeperTestHarness 1. The sleep is actually not required. The EmbeddedZookeeper constructor returns only after the zk server is completely started. > Enable message replication in the presence of controlled failures > ----------------------------------------------------------------- > > Key: KAFKA-350 > URL: https://issues.apache.org/jira/browse/KAFKA-350 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8 > Reporter: Neha Narkhede > Assignee: Neha Narkhede > Attachments: kafka-350-v1.patch, kafka-350-v2.patch > > > KAFKA-46 introduced message replication feature in the absence of server > failures. This JIRA will improve the log recovery logic and fix other bugs to > enable message replication to happen in the presence of controlled server > failures -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira