[jira] [Commented] (KAFKA-7725) Add a delay for further CG rebalances, beyond KIP-134 group.initial.rebalance.delay.ms
[ https://issues.apache.org/jira/browse/KAFKA-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730800#comment-16730800 ] Boyang Chen commented on KAFKA-7725: Thanks [~astubbs] for sharing this issue. Will add this to the KIP-345 Jira list. > Add a delay for further CG rebalances, beyond KIP-134 > group.initial.rebalance.delay.ms > -- > > Key: KAFKA-7725 > URL: https://issues.apache.org/jira/browse/KAFKA-7725 > Project: Kafka > Issue Type: New Feature > Components: clients, consumer, core >Affects Versions: 2.1.0 >Reporter: Antony Stubbs >Priority: Major > > KIP-134 group.initial.rebalance.delay.ms was a good start, but there are much > bigger problems where after a system is up and running, consumers can leave > and join in large amounts, causing rebalance storms. One example is > Mesosphere deploying new versions of an app - say there are 10 instances, > then 10 more instances are deployed with the new version, then the old 10 are > scaled down. Ideally this would be 1 or 2 rebalances, instead of 20. > The trade off is that if the delay is 5 seconds, every consumer joining > within that window would extend it by another 5 seconds, potentially causing > partitions to never be processed. To mitigate this, either a max rebalance > delay could also be added, or multiple consumers joining won't extend the > rebalance delay, so that it's always a max of 5 seconds. > Related: [KIP-345: Introduce static membership protocol to reduce consumer > rebalances|https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances] > KAFKA-7018: persist memberId for consumer restart -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-6863) Kafka clients should try to use multiple DNS resolved IP addresses if the first one fails
[ https://issues.apache.org/jira/browse/KAFKA-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated KAFKA-6863: --- Attachment: www.apache.orglicensesLICENSE-2.0.pdf > Kafka clients should try to use multiple DNS resolved IP addresses if the > first one fails > - > > Key: KAFKA-6863 > URL: https://issues.apache.org/jira/browse/KAFKA-6863 > Project: Kafka > Issue Type: Improvement > Components: clients >Affects Versions: 1.0.0, 1.1.0 >Reporter: Edoardo Comar >Assignee: Edoardo Comar >Priority: Major > Fix For: 2.1.0 > > Attachments: www.apache.orglicensesLICENSE-2.0.pdf > > > Currently Kafka clients resolve a symbolic hostname using > {{new InetSocketAddress(String hostname, int port)}} > which only picks one IP address even if the DNS has multiple records for the > hostname, as it calls > {{InetAddress.getAllByName(host)[0]}} > For some environments where the hostnames are mapped by the DNS to multiple > IPs, e.g. in clouds where the IPs point to the external load balancers, it > would be preferable that the client, on failing to connect to one of the IPs, > would try the other ones before giving up the connection. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-7755) Kubernetes - Kafka clients are resolving DNS entries only one time
[ https://issues.apache.org/jira/browse/KAFKA-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated KAFKA-7755: --- Attachment: pom.xml > Kubernetes - Kafka clients are resolving DNS entries only one time > -- > > Key: KAFKA-7755 > URL: https://issues.apache.org/jira/browse/KAFKA-7755 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0, 2.2.0, 2.1.1 > Environment: Kubernetes >Reporter: Loïc Monney >Priority: Blocker > Attachments: pom.xml > > > *Introduction* > Since 2.1.0 Kafka clients are supporting multiple DNS resolved IP addresses > if the first one fails. This change has been introduced by > https://issues.apache.org/jira/browse/KAFKA-6863. However this DNS resolution > is now performed only one time by the clients. This is not a problem if all > brokers have fixed IP addresses, however this is definitely an issue when > Kafka brokers are run on top of Kubernetes. Indeed, new Kubernetes pods will > receive another IP address, so as soon as all brokers will have been > restarted clients won't be able to reconnect to any broker. > *Impact* > Everyone running Kafka 2.1 or later on top of Kubernetes is impacted when a > rolling restart is performed. > *Root cause* > Since https://issues.apache.org/jira/browse/KAFKA-6863 Kafka clients are > resolving DNS entries only once. > *Proposed solution* > In > [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/ClusterConnectionStates.java#L368] > Kafka clients should perform the DNS resolution again when all IP addresses > have been "used" (when _index_ is back to 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7728) Add JoinReason to the join group request for better rebalance handling
[ https://issues.apache.org/jira/browse/KAFKA-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730594#comment-16730594 ] Boyang Chen commented on KAFKA-7728: [~enether] Thanks for the thoughts! I think the compatibility should be considered as the JoinGroupRequest version will be bumped. I already come up some common join reasons for a potential enum type: {code:java} public enum JoinGroupReason { BLIND("blind"), // Join request from a start-up consumer SELF_META_CHANGE("self_meta_change"), // The consumer metadata has changed TOPIC_METAD_CHANGE("topic_meta_change"); // The topic metadata changed (must be from the leader) } {code} the self metadata change might be trivial to realize now, but I think it would be better to discuss more scenarios before we finalize anything. Let's brainstorm on more join reasons that will be helpful for the broker to make decision. > Add JoinReason to the join group request for better rebalance handling > -- > > Key: KAFKA-7728 > URL: https://issues.apache.org/jira/browse/KAFKA-7728 > Project: Kafka > Issue Type: Improvement >Reporter: Boyang Chen >Assignee: Mayuresh Gharat >Priority: Major > Labels: consumer, mirror-maker, needs-kip > > Recently [~mgharat] and I discussed about the current rebalance logic on > leader join group request handling. So far we blindly trigger rebalance when > the leader rejoins. The caveat is that KIP-345 is not covering this effort > and if a consumer group is not using sticky assignment but using other > strategy like round robin, the redundant rebalance could still shuffle the > topic partitions around consumers. (for example mirror maker application) > I checked on broker side and here is what we currently do: > > {code:java} > if (group.isLeader(memberId) || !member.matches(protocols)) > // force a rebalance if a member has changed metadata or if the leader sends > JoinGroup. > // The latter allows the leader to trigger rebalances for changes affecting > assignment > // which do not affect the member metadata (such as topic metadata changes > for the consumer) {code} > Based on the broker logic, we only need to trigger rebalance for leader > rejoin when the topic metadata change has happened. I also looked up the > ConsumerCoordinator code on client side, and found out the metadata > monitoring logic here: > {code:java} > public boolean rejoinNeededOrPending() { > ... > // we need to rejoin if we performed the assignment and metadata has changed > if (assignmentSnapshot != null && > !assignmentSnapshot.equals(metadataSnapshot)) > return true; > }{code} > I guess instead of just returning true, we could introduce a new enum field > called JoinReason which could indicate the purpose of the rejoin. Thus we > don't need to do a full rebalance when the leader is just in rolling bounce. > We could utilize this information I guess. Just add another enum field into > the join group request called JoinReason so that we know whether leader is > rejoining due to topic metadata change. If yes, we trigger rebalance > obviously; if no, we shouldn't trigger rebalance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)