[jira] [Commented] (KAFKA-2841) Group metadata cache loading is not safe when reloading a partition

2015-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15010067#comment-15010067
 ] 

ASF GitHub Bot commented on KAFKA-2841:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/530


> Group metadata cache loading is not safe when reloading a partition
> ---
>
> Key: KAFKA-2841
> URL: https://issues.apache.org/jira/browse/KAFKA-2841
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.9.0.0
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 0.9.0.0
>
>
> If the coordinator receives a leaderAndIsr request which includes a higher 
> leader epoch for one of the partitions that it owns, then it will reload the 
> offset/metadata for that partition again. This can happen because the leader 
> epoch is incremented for ISR changes which do not result in a new leader for 
> the partition. Currently, the coordinator replaces cached metadata values 
> blindly on reloading, which can result in weird behavior such as unexpected 
> session timeouts or request timeouts while rebalancing.
> To fix this, we need to check that the group being loaded has a higher 
> generation than the cached value before replacing it. Also, if we have to 
> replace a cached value (which shouldn't happen except when loading), we need 
> to be very careful to ensure that any active delayed operations won't affect 
> the group. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2841) Group metadata cache loading is not safe when reloading a partition

2015-11-16 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007035#comment-15007035
 ] 

Jun Rao commented on KAFKA-2841:


I thought the current logic is that if a group is being uploaded, the group is 
not accessible until the upload completes? Once the upload completes, the group 
should have the latest info.

> Group metadata cache loading is not safe when reloading a partition
> ---
>
> Key: KAFKA-2841
> URL: https://issues.apache.org/jira/browse/KAFKA-2841
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.9.0.0
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 0.9.0.0
>
>
> If the coordinator receives a leaderAndIsr request which includes a higher 
> leader epoch for one of the partitions that it owns, then it will reload the 
> offset/metadata for that partition again. This can happen because the leader 
> epoch is incremented for ISR changes which do not result in a new leader for 
> the partition. Currently, the coordinator replaces cached metadata values 
> blindly on reloading, which can result in weird behavior such as unexpected 
> session timeouts or request timeouts while rebalancing.
> To fix this, we need to check that the group being loaded has a higher 
> generation than the cached value before replacing it. Also, if we have to 
> replace a cached value (which shouldn't happen except when loading), we need 
> to be very careful to ensure that any active delayed operations won't affect 
> the group. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2841) Group metadata cache loading is not safe when reloading a partition

2015-11-16 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007059#comment-15007059
 ] 

Jason Gustafson commented on KAFKA-2841:


[~junrao] That is correct. The problem is that the group may be loaded more 
than once and the cached metadata object which holds group and member state may 
be replaced. When this happens, you can get very strange behavior since 
join/sync response callbacks may be lost and delayed operations (which still 
refer to the original metadata object) can cause conflicts. My patch makes this 
safer by preventing this replacement from taking place when a partition is 
loaded and by cleaning up the group state when the cached metadata is unloaded 
due to partition emigration to a new leader.

> Group metadata cache loading is not safe when reloading a partition
> ---
>
> Key: KAFKA-2841
> URL: https://issues.apache.org/jira/browse/KAFKA-2841
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.9.0.0
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 0.9.0.0
>
>
> If the coordinator receives a leaderAndIsr request which includes a higher 
> leader epoch for one of the partitions that it owns, then it will reload the 
> offset/metadata for that partition again. This can happen because the leader 
> epoch is incremented for ISR changes which do not result in a new leader for 
> the partition. Currently, the coordinator replaces cached metadata values 
> blindly on reloading, which can result in weird behavior such as unexpected 
> session timeouts or request timeouts while rebalancing.
> To fix this, we need to check that the group being loaded has a higher 
> generation than the cached value before replacing it. Also, if we have to 
> replace a cached value (which shouldn't happen except when loading), we need 
> to be very careful to ensure that any active delayed operations won't affect 
> the group. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2841) Group metadata cache loading is not safe when reloading a partition

2015-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005715#comment-15005715
 ] 

ASF GitHub Bot commented on KAFKA-2841:
---

GitHub user hachikuji opened a pull request:

https://github.com/apache/kafka/pull/530

KAFKA-2841: safe group metadata cache loading/unloading



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hachikuji/kafka KAFKA-2841

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/530.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #530


commit 881380eac954e0906ef2ec0fe3d5d8e067473a35
Author: Jason Gustafson 
Date:   2015-11-14T23:54:25Z

KAFKA-2841: safe group metadata cache loading/unloading




> Group metadata cache loading is not safe when reloading a partition
> ---
>
> Key: KAFKA-2841
> URL: https://issues.apache.org/jira/browse/KAFKA-2841
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.9.0.0
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 0.9.0.0
>
>
> If the coordinator receives a leaderAndIsr request which includes a higher 
> leader epoch for one of the partitions that it owns, then it will reload the 
> offset/metadata for that partition again. This can happen because the leader 
> epoch is incremented for ISR changes which do not result in a new leader for 
> the partition. Currently, the coordinator replaces cached metadata values 
> blindly on reloading, which can result in weird behavior such as unexpected 
> session timeouts or request timeouts while rebalancing.
> To fix this, we need to check that the group being loaded has a higher 
> generation than the cached value before replacing it. Also, if we have to 
> replace a cached value (which shouldn't happen except when loading), we need 
> to be very careful to ensure that any active delayed operations won't affect 
> the group. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2841) Group metadata cache loading is not safe when reloading a partition

2015-11-13 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005174#comment-15005174
 ] 

Jason Gustafson commented on KAFKA-2841:


[~guozhang] Yeah, I think that addresses the major issue, but I think it's 
still worthwhile to add the checks when inserting into the metadata cache.

> Group metadata cache loading is not safe when reloading a partition
> ---
>
> Key: KAFKA-2841
> URL: https://issues.apache.org/jira/browse/KAFKA-2841
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.9.0.0
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 0.9.0.0
>
>
> If the coordinator receives a leaderAndIsr request which includes a higher 
> leader epoch for one of the partitions that it owns, then it will reload the 
> offset/metadata for that partition again. This can happen because the leader 
> epoch is incremented for ISR changes which do not result in a new leader for 
> the partition. Currently, the coordinator replaces cached metadata values 
> blindly on reloading, which can result in weird behavior such as unexpected 
> session timeouts or request timeouts while rebalancing.
> To fix this, we need to check that the group being loaded has a higher 
> generation than the cached value before replacing it. Also, if we have to 
> replace a cached value (which shouldn't happen except when loading), we need 
> to be very careful to ensure that any active delayed operations won't affect 
> the group. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2841) Group metadata cache loading is not safe when reloading a partition

2015-11-13 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005170#comment-15005170
 ] 

Guozhang Wang commented on KAFKA-2841:
--

[~hachikuji] Is this solvable in KAFKA-2721?

> Group metadata cache loading is not safe when reloading a partition
> ---
>
> Key: KAFKA-2841
> URL: https://issues.apache.org/jira/browse/KAFKA-2841
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.9.0.0
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
>
> If the coordinator receives a leaderAndIsr request which includes a higher 
> leader epoch for one of the partitions that it owns, then it will reload the 
> offset/metadata for that partition again. This can happen because the leader 
> epoch is incremented for ISR changes which do not result in a new leader for 
> the partition. Currently, the coordinator replaces cached metadata values 
> blindly on reloading, which can result in weird behavior such as unexpected 
> session timeouts or request timeouts while rebalancing.
> To fix this, we need to check that the group being loaded has a higher 
> generation than the cached value before replacing it. Also, if we have to 
> replace a cached value (which shouldn't happen except when loading), we need 
> to be very careful to ensure that any active delayed operations won't affect 
> the group. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)