[jira] [Updated] (KAFKA-691) Fault tolerance broken with replication factor 1

2013-01-16 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-691:
--

Attachment: (was: kafka-691_extra.patch)

> Fault tolerance broken with replication factor 1
> 
>
> Key: KAFKA-691
> URL: https://issues.apache.org/jira/browse/KAFKA-691
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Jay Kreps
>Assignee: Maxime Brugidou
> Fix For: 0.8
>
> Attachments: kafka-691_extra.patch, KAFKA-691-v1.patch, 
> KAFKA-691-v2.patch
>
>
> In 0.7 if a partition was down we would just send the message elsewhere. This 
> meant that the partitioning was really more of a "stickiness" then a hard 
> guarantee. This made it impossible to depend on it for partitioned, stateful 
> processing.
> In 0.8 when running with replication this should not be a problem generally 
> as the partitions are now highly available and fail over to other replicas. 
> However in the case of replication factor = 1 no longer really works for most 
> cases as now a dead broker will give errors for that broker.
> I am not sure of the best fix. Intuitively I think this is something that 
> should be handled by the Partitioner interface. However currently the 
> partitioner has no knowledge of which nodes are available. So you could use a 
> random partitioner, but that would keep going back to the down node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-691) Fault tolerance broken with replication factor 1

2013-01-16 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-691:
--

Attachment: kafka-691_extra.patch

Attach the right patch (kafka-691_extra.patch).

> Fault tolerance broken with replication factor 1
> 
>
> Key: KAFKA-691
> URL: https://issues.apache.org/jira/browse/KAFKA-691
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Jay Kreps
>Assignee: Maxime Brugidou
> Fix For: 0.8
>
> Attachments: kafka-691_extra.patch, KAFKA-691-v1.patch, 
> KAFKA-691-v2.patch
>
>
> In 0.7 if a partition was down we would just send the message elsewhere. This 
> meant that the partitioning was really more of a "stickiness" then a hard 
> guarantee. This made it impossible to depend on it for partitioned, stateful 
> processing.
> In 0.8 when running with replication this should not be a problem generally 
> as the partitions are now highly available and fail over to other replicas. 
> However in the case of replication factor = 1 no longer really works for most 
> cases as now a dead broker will give errors for that broker.
> I am not sure of the best fix. Intuitively I think this is something that 
> should be handled by the Partitioner interface. However currently the 
> partitioner has no knowledge of which nodes are available. So you could use a 
> random partitioner, but that would keep going back to the down node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-691) Fault tolerance broken with replication factor 1

2013-01-16 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-691:
--

Attachment: kafka-691_extra.patch

The last patch introduced a bug. DefaultEventHander.getPartition() is expected 
to return the index of the partitionList, instead of the actual partition id. 
Attach a patch that fixes the issue.

> Fault tolerance broken with replication factor 1
> 
>
> Key: KAFKA-691
> URL: https://issues.apache.org/jira/browse/KAFKA-691
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Jay Kreps
>Assignee: Maxime Brugidou
> Fix For: 0.8
>
> Attachments: kafka-691_extra.patch, KAFKA-691-v1.patch, 
> KAFKA-691-v2.patch
>
>
> In 0.7 if a partition was down we would just send the message elsewhere. This 
> meant that the partitioning was really more of a "stickiness" then a hard 
> guarantee. This made it impossible to depend on it for partitioned, stateful 
> processing.
> In 0.8 when running with replication this should not be a problem generally 
> as the partitions are now highly available and fail over to other replicas. 
> However in the case of replication factor = 1 no longer really works for most 
> cases as now a dead broker will give errors for that broker.
> I am not sure of the best fix. Intuitively I think this is something that 
> should be handled by the Partitioner interface. However currently the 
> partitioner has no knowledge of which nodes are available. So you could use a 
> random partitioner, but that would keep going back to the down node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-691) Fault tolerance broken with replication factor 1

2013-01-10 Thread Maxime Brugidou (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxime Brugidou updated KAFKA-691:
--

Attachment: KAFKA-691-v2.patch

Thanks for your feedback, I updated it (v2) according to your notes (1. and 2.).

for 3. I believe you are right, except that:
3.1 It seems (correct me if i'm wrong) that a rebalance happen at the consumer 
initialization, so that means a consumer can't start if a broker is down
3.2 Can a rebalance be triggered when a partition is added or moved? Having a 
broker down shouldn't prevent me from reassigning partitions or adding 
partitions.


> Fault tolerance broken with replication factor 1
> 
>
> Key: KAFKA-691
> URL: https://issues.apache.org/jira/browse/KAFKA-691
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Jay Kreps
> Attachments: KAFKA-691-v1.patch, KAFKA-691-v2.patch
>
>
> In 0.7 if a partition was down we would just send the message elsewhere. This 
> meant that the partitioning was really more of a "stickiness" then a hard 
> guarantee. This made it impossible to depend on it for partitioned, stateful 
> processing.
> In 0.8 when running with replication this should not be a problem generally 
> as the partitions are now highly available and fail over to other replicas. 
> However in the case of replication factor = 1 no longer really works for most 
> cases as now a dead broker will give errors for that broker.
> I am not sure of the best fix. Intuitively I think this is something that 
> should be handled by the Partitioner interface. However currently the 
> partitioner has no knowledge of which nodes are available. So you could use a 
> random partitioner, but that would keep going back to the down node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-691) Fault tolerance broken with replication factor 1

2013-01-10 Thread Maxime Brugidou (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxime Brugidou updated KAFKA-691:
--

Attachment: KAFKA-691-v1.patch

Here is a first draft (v1) patch.

1. Added the consumer property "producer.metadata.refresh.interval.ms" defaults 
to 60 (10min)

2. The metadata is refreshed every 10min (only if a message is sent), and the 
set of topics to refresh is tracked in the topicMetadataToRefresh Set (cleared 
after every refresh) - I think the added value of refreshing regardless of 
partition availability is to detect new partitions

3. The good news is that I didn't touch the Partitioner API, I only changed the 
code to use available partitions if the key is null (as suggested by Jun), it 
will also throw a UnknownTopicOrPartitionException("No leader for any 
partition") if no partition is available at all

Let me know what you think about this patch. I ran a producer with that code 
successfully and tested with a broker down.

I now have some concerns about the consumer: the refresh.leader.backoff.ms 
config could help me (if i increase it to say, 10min) BUT the rebalance fails 
in any case since there is no leader for some partitions

I don't have a good workaround yet for that, any help/suggestion appreciated.

> Fault tolerance broken with replication factor 1
> 
>
> Key: KAFKA-691
> URL: https://issues.apache.org/jira/browse/KAFKA-691
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Jay Kreps
> Attachments: KAFKA-691-v1.patch
>
>
> In 0.7 if a partition was down we would just send the message elsewhere. This 
> meant that the partitioning was really more of a "stickiness" then a hard 
> guarantee. This made it impossible to depend on it for partitioned, stateful 
> processing.
> In 0.8 when running with replication this should not be a problem generally 
> as the partitions are now highly available and fail over to other replicas. 
> However in the case of replication factor = 1 no longer really works for most 
> cases as now a dead broker will give errors for that broker.
> I am not sure of the best fix. Intuitively I think this is something that 
> should be handled by the Partitioner interface. However currently the 
> partitioner has no knowledge of which nodes are available. So you could use a 
> random partitioner, but that would keep going back to the down node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira