[
https://issues.apache.org/jira/browse/KAFKA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neha Narkhede updated KAFKA-350:
--------------------------------
Attachment: kafka-350-v2.patch
Thanks for the review, Jay ! Here is another patch fixing almost all review
comments -
1. Log
1. Refactoring of the hw logic is part of KAFKA-405. It is moved to a new file
HighwaterMarkCheckpoint and is controlled by the ReplicaManager. The Log does
not and should not know about high watermarks.
2. Getter/setter for hw is removed as part of KAFKA-405 anyways.
3. Agree with you on the weak public interface, especially Log needs a cleanup.
I think you attempted that as part of KAFKA-371. I've cleaned up quite a few
things as part of KAFKA-405 and this patch. Nevertheless, fixed the
nextAppendOffset as part of this patch. It is not required when we have
logEndOffset. Also, removed the getEndOffset from FileMessageSet. Added
endOffset() API to the LogSegment in addition to a size() API. This is useful
during truncation of the log based on high watermark.
4. Fixed the javadoc and removed the debug statement.
2. Config options
1. The isr.keep.in.sync.time.ms set to 10 seconds is also very lax. A healthy
follower should be able to catch up in 100s of milliseconds even with a lot of
topics, with a worst case of maybe 4-5 seconds. We will know the latency better
when we run some large scale tests. But yeah, the issue with the system test is
independent of what the right value should be. I was just explaining how I
discovered this issue. :)
2. Good point about renaming it to replica.max.lag.time.ms. Also changed
isr.keep.in.sync.bytes to replica.max.lag.bytes.
3. Since the producer does a blocking read on the socket, the socket timeout
cannot be greater than the request timeout. If it is, then the request timeout
guarantee would be violated, no ?
3. SocketServer.scala
1. Removed info log statement
2. Yes, wanted to simplify testing/debugging (thread dumps) when there are
multiple servers on one machine. Not sure if this is the best way to do that.
3. processNewResponses() doesn't have to process outstanding requests during
shutdown. It can shutdown by ignoring them and those requests will timeout
anyways. But yes, good point about the event loop doing it instead. Fixed it.
4. System testing
1. Fixed the hacky change. I still need ProducerPerformance needs a complete
redo. Filed bug KAFKA-408 to do that.
5. AdminUtils
1. Need this change to return appropriate error codes for topic metadata
request. Without this, all produce requests are timing out while fetching
metadata, since in the system test, at any point of time, one replica is always
down.
6. Partition.scala
1. Removed the trace statement.
7. ErrorMapping
1. Removed getMaxErrorCode and its usage from ErrorMapping
2. Good point. Introduced a new class KafkaException and converted
IllegalStateException and IllegalArgumentException to it.
3. NoLeaderForPartitionException is in fact marked as deleted in the patch
8. ConsoleConsumer
1. Good point. The finally block there didn't really make any sense.
9. FileMessageSet
1. I haven't added the log statements in this patch, we always had it.
2. The purpose of adding that was help with debugging producer side queue full
exceptions. We turned on DEBUG to see what the server side flush sizes and
latencies were. I think these statements were added when we didn't have an
ability to monitor these. However, whoever added the log flush monitoring maybe
forgot to remove these statements. I removed it in this patch.
3. Removed the “recover upto” log statement too
10. DefaultEventHandler
1. Agree with the unwieldly procedural code. Filed bug KAFKA-409 to clean this
up.
11. ZkUtils
1. Renamed LeaderExists -> LeaderExistsListener
12. ZookeeperTestHarness
1. The sleep is actually not required. The EmbeddedZookeeper constructor
returns only after the zk server is completely started.
> Enable message replication in the presence of controlled failures
> -----------------------------------------------------------------
>
> Key: KAFKA-350
> URL: https://issues.apache.org/jira/browse/KAFKA-350
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Neha Narkhede
> Assignee: Neha Narkhede
> Attachments: kafka-350-v1.patch, kafka-350-v2.patch
>
>
> KAFKA-46 introduced message replication feature in the absence of server
> failures. This JIRA will improve the log recovery logic and fix other bugs to
> enable message replication to happen in the presence of controlled server
> failures
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira