[jira] [Updated] (KAFKA-350) Enable message replication in the presence of controlled failures

Neha Narkhede (JIRA) Fri, 20 Jul 2012 13:35:37 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neha Narkhede updated KAFKA-350:
--------------------------------

    Attachment: kafka-350-v2.patch

Thanks for the review, Jay ! Here is another patch fixing almost all review 
comments -

1. Log
1. Refactoring of the hw logic is part of KAFKA-405. It is moved to a new file 
HighwaterMarkCheckpoint and is controlled by the ReplicaManager. The Log does 
not and should not know about high watermarks.
2. Getter/setter for hw is removed as part of KAFKA-405 anyways.
3. Agree with you on the weak public interface, especially Log needs a cleanup. 
I think you attempted that as part of KAFKA-371. I've cleaned up quite a few 
things as part of KAFKA-405 and this patch. Nevertheless, fixed the 
nextAppendOffset as part of this patch. It is not required when we have 
logEndOffset. Also, removed the getEndOffset from FileMessageSet. Added 
endOffset() API to the LogSegment in addition to a size() API. This is useful 
during truncation of the log based on high watermark.
4. Fixed the javadoc and removed the debug statement.

2. Config options
1. The isr.keep.in.sync.time.ms set to 10 seconds is also very lax. A healthy 
follower should be able to catch up in 100s of milliseconds even with a lot of 
topics, with a worst case of maybe 4-5 seconds. We will know the latency better 
when we run some large scale tests. But yeah, the issue with the system test is 
independent of what the right value should be. I was just explaining how I 
discovered this issue. :)
2. Good point about renaming it to replica.max.lag.time.ms. Also changed 
isr.keep.in.sync.bytes to replica.max.lag.bytes.
3. Since the producer does a blocking read on the socket, the socket timeout 
cannot be greater than the request timeout. If it is, then the request timeout 
guarantee would be violated, no ?

3. SocketServer.scala
1. Removed info log statement
2. Yes, wanted to simplify testing/debugging (thread dumps) when there are 
multiple servers on one machine. Not sure if this is the best way to do that.
3. processNewResponses() doesn't have to process outstanding requests during 
shutdown. It can shutdown by ignoring them and those requests will timeout 
anyways. But yes, good point about the event loop doing it instead. Fixed it.

4. System testing
1. Fixed the hacky change. I still need ProducerPerformance needs a complete 
redo. Filed bug KAFKA-408 to do that.

5. AdminUtils
1. Need this change to return appropriate error codes for topic metadata 
request. Without this, all produce requests are timing out while fetching 
metadata, since in the system test, at any point of time, one replica is always 
down.

6. Partition.scala
1. Removed the trace statement.

7. ErrorMapping
1. Removed getMaxErrorCode and its usage from ErrorMapping
2. Good point. Introduced a new class KafkaException and converted 
IllegalStateException and IllegalArgumentException to it.
3. NoLeaderForPartitionException is in fact marked as deleted in the patch

8. ConsoleConsumer
1. Good point. The finally block there didn't really make any sense.

9. FileMessageSet
1. I haven't added the log statements in this patch, we always had it. 
2. The purpose of adding that was help with debugging producer side queue full 
exceptions. We turned on DEBUG to see what the server side flush sizes and 
latencies were. I think these statements were added when we didn't have an 
ability to monitor these. However, whoever added the log flush monitoring maybe 
forgot to remove these statements. I removed it in this patch.
3. Removed the “recover upto” log statement too

10. DefaultEventHandler
1. Agree with the unwieldly procedural code. Filed bug KAFKA-409 to clean this 
up. 

11. ZkUtils
1. Renamed LeaderExists -> LeaderExistsListener

12. ZookeeperTestHarness
1. The sleep is actually not required. The EmbeddedZookeeper constructor 
returns only after the zk server is completely started.
                
> Enable message replication in the presence of controlled failures
> -----------------------------------------------------------------
>
>                 Key: KAFKA-350
>                 URL: https://issues.apache.org/jira/browse/KAFKA-350
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>         Attachments: kafka-350-v1.patch, kafka-350-v2.patch
>
>
> KAFKA-46 introduced message replication feature in the absence of server 
> failures. This JIRA will improve the log recovery logic and fix other bugs to 
> enable message replication to happen in the presence of controlled server 
> failures

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-350) Enable message replication in the presence of controlled failures

Reply via email to