Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lucasbru merged PR #15778: URL: https://github.com/apache/kafka/pull/15778 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1581508063 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: Done. I also created the [Jira](https://issues.apache.org/jira/browse/KAFKA-16628) assigned to me to add a test to cover all the non-eager paths that we know we don't have yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lucasbru commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1581306529 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: Yes, I think that's a good plan. Can you update this PR to rename the test and create a separate JIRA for the new test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1581203195 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: > So maybe we should instead try to get the set of partitions and check that it didn't change? Doable, but that wouldn't truly ensure the static membership behaviour either I guess, because the test is intentionally leaving 1 member alive that could be the leader or not. So the assignment would remain the same regardless of the static membership under CONSUMER (or Cooperative) if the single partition belongs to the live member (single partition that I guess was intentionally decided to be able to easily check the delivery semantics after the bounces) So as I see it this test is specifically crafted to validate the stickiness that static membership brings into the `RangeAssignor` ( nicely explained in [this section](https://github.com/apache/kafka/blob/e7792258df934a5c8470c2925c5d164c7d5a8e6c/clients/src/main/java/org/apache/kafka/clients/consumer/RangeAssignor.java#L58-L78) of the `RangeAssignor` class doc btw). We're trying to apply it to the CONSUMER protocol but finding not much value given the purpose of the test and the shape it has (bounce n-1 nodes, check at least 1 revocation if dynamic, none if static, regardless of partition ownership). Listening to myself telling you this makes me reconsider if we should just not run this test for the new protocol, as it was truly never intended or run for CooperativeAssignor? (I would probably rename it to something like `test_eager_stickiness_on_static_consumer_bounce` , make the use of RangeAssignor explicit, and then it looks clearer that we don't want to run such test on Cooper ativeAssignor or CONSUMER protocol. We do have other tests that ensure that static membership behaves as expected for the new protocol (ex. `test_fencing_static_consumer`), but agreed that the "owned partition not re-assigned for a static member that is bounced" is not covered in sys tests (not for CONSUMER, not for Cooperative either). We could think of a new test to cover that. The shape would be different I expect, because it would either have to rely on bouncing a member with assignment while having others with none, or ensuring that all members have at least 1 partition. It would also need a different, more complex delivery semantics validation, if any. I would just suggest a different Jira/PR for a new test, to be able to finalize migrating the current sys tests that apply to the new protocol. Makes sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lucasbru commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1579343899 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: No, lets not remove a variation. I'm just thinking about the CONSUMER protocol case. If a dynamic client can pass this test with `num_revokes_after_bounce = 0`, how meaningful is it really to check that static clients have `num_revokes_after_bounce = 0`. I could write a consumer that ignores the static membership configuration completely, and still pass this test, right? The test description writes ``` In order to make sure the behavior of static members are different from dynamic ones, we take both static and dynamic membership into this test suite. ``` But in the `CONSUMER` protocol, it seems the behavior isn't all that difference, at least if we only look at `num_revokes_after_bounce`. So maybe we should instead try to get the set of partitions and check that it didn't change? Let me know if I'm understanding something wrong. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1578363393 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: This test was not testing the cooperative case before because it was using the base [setup_consumer](https://github.com/apache/kafka/blob/b8b2415d5e006cf91c0f74dcf60b764933c9c1d0/tests/kafkatest/tests/verifiable_consumer_test.py#L56) that has a range assignor by default. We do have a way to detect rebalances (number of calls to partitions assigned), but that does not allow any special check for dynamic only, since static members will get those same calls to partitions assigned (the static membership is only making that the group assignment is not re-computed, so partition is not re-assigned while the session lasts, expecting that the member might come back, but from the callbacks POV, the static member that leaves gets the same as the dynamic one). I agree that the value of the dynamic + cooperative is questionable, but there is value in verifying the message delivery semantics further down, when checking the total consumed. The `test_consumer_bounce` does a very similar check for dynamic members btw, and again, focusing on the delivery semantics, but both tests are not exactly the same (mainly keeping 1 node alive or not), so I would lean towards not removing the combination from this static test, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1578363393 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: This test was not testing the cooperative case before because it was using the base [setup_consumer](https://github.com/apache/kafka/blob/b8b2415d5e006cf91c0f74dcf60b764933c9c1d0/tests/kafkatest/tests/verifiable_consumer_test.py#L56) that has a range assignor by default. We do have a way to detect rebalances (number of calls to partitions assigned), but that does not allow any special check for dynamic only, since static members will get those same calls to partitions assigned (the static membership is only making that the group assignment is not re-computed, so partition is not re-assigned while the session lasts, expecting that the member might come back, but from the callbacks POV the static member that leaves gets the same as the dynamic one). I agree that the value of the dynamic + cooperative is questionable, but there is value in verifying the message delivery semantics further down, when checking the total consumed. The `test_consumer_bounce` does a very similar check for dynamic members btw, and again, focusing on the delivery semantics, but both tests are not exactly the same (mainly keeping 1 node alive or not), so I would lean towards not removing the combination from this static test, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1578363393 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: This test was not testing the cooperative case before because it was using the base [setup_consumer](https://github.com/apache/kafka/blob/b8b2415d5e006cf91c0f74dcf60b764933c9c1d0/tests/kafkatest/tests/verifiable_consumer_test.py#L56) that has a range assignor by default. We do have a way to detect rebalances (number of calls to partitions assigned), but that does not allow any special check for dynamic only, since static members will get those same calls to partitions assigned (the static membership is only making that the group assignment is not re-computed, so partition is not re-assigned while the session lasts, expecting that the member might come back). I agree that the value of the dynamic + cooperative is questionable, but there is value in verifying the message delivery semantics further down, when checking the total consumed. The `test_consumer_bounce` does a very similar check for dynamic members btw, and again, focusing on the delivery semantics, but both tests are not exactly the same (mainly keeping 1 node alive or not), so I would lean towards not removing the combination from this static test, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1578363393 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: This test was not testing the cooperative case before because it was using the base [setup_consumer](https://github.com/apache/kafka/blob/b8b2415d5e006cf91c0f74dcf60b764933c9c1d0/tests/kafkatest/tests/verifiable_consumer_test.py#L56) that has a range assignor by default. We do have a way to detect rebalances (number of calls to partitions assigned), but that does not allow any special check for dynamic only, since static members will get those same calls to partitions assigned (the static membership is only making that the partition is not re-assigned while the session lasts). I agree that the value of the dynamic + cooperative is questionable, but there is value in verifying the message delivery semantics further down, when checking the total consumed. The `test_consumer_bounce` does a very similar check for dynamic members btw, and again, focusing on the delivery semantics, but both tests are not exactly the same (mainly keeping 1 node alive or not), so I would lean towards not removing the combination from this static test, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1578363393 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: This test was not testing the cooperative case before because it was using the base [setup_consumer](https://github.com/apache/kafka/blob/b8b2415d5e006cf91c0f74dcf60b764933c9c1d0/tests/kafkatest/tests/verifiable_consumer_test.py#L56) that has a range assignor by default. We do have a way to detect rebalances (number of calls to partitions assigned), but that does not allow any special check for dynamic only, since static members will get those same calls to partitions assigned (the static membership is only making that the partition is not re-assigned/revoked while the session lasts). I agree that the value of the dynamic + cooperative is questionable, but there is value in verifying the message delivery semantics further down, when checking the total consumed. The `test_consumer_bounce` does a very similar check for dynamic members btw, and again, focusing on the delivery semantics, but both tests are not exactly the same (mainly keeping 1 node alive or not), so I would lean towards not removing the combination from this static test, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lucasbru commented on code in PR #15778: URL: https://github.com/apache/kafka/pull/15778#discussion_r1575801343 ## tests/kafkatest/tests/client/consumer_test.py: ## @@ -242,16 +242,15 @@ def test_static_consumer_bounce(self, clean_shutdown, static_membership, bounce_ self.rolling_bounce_consumers(consumer, keep_alive=num_keep_alive, num_bounces=num_bounces) num_revokes_after_bounce = consumer.num_revokes_for_alive() - num_revokes_before_bounce - -check_condition = num_revokes_after_bounce != 0 + # under static membership, the live consumer shall not revoke any current running partitions, # since there is no global rebalance being triggered. if static_membership: -check_condition = num_revokes_after_bounce == 0 - -assert check_condition, \ -"Total revoked count %d does not match the expectation of having 0 revokes as %d" % \ -(num_revokes_after_bounce, check_condition) +assert num_revokes_after_bounce == 0, \ +"Unexpected revocation triggered when bouncing static member. Expecting 0 but had %d revocations" % num_revokes_after_bounce +elif consumer.is_eager(): +assert num_revokes_after_bounce != 0, \ Review Comment: Did this test work before for `CooperativeStickyAssignor`, or was this case not tested? Also, is there _any_ way now to detect that without static membership, a rebalance happens during the roll? It seems like in the case where static_membership == false, and is_eager == false, we are not asserting anything now, so I wonder how much value is still there in that combination. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm commented on PR #15778: URL: https://github.com/apache/kafka/pull/15778#issuecomment-2069255159 Hey @lucasbru , could you take a look when you have a chance? Similar issue to the one on https://github.com/apache/kafka/pull/15661. With this I get successful runs of the test (some flaky combinations, mostly around failures already filed/in-flight like failures removing unexpected assignment from list, timeout shutting down) SESSION REPORT (ALL TESTS) ducktape version: 0.11.4 session_id: 2024-04-22--006 run time: 66 minutes 30.269 seconds tests run:16 passed: 14 flaky:2 failed: 0 ignored: 0 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] KAFKA-16465: Fix consumer sys test revocation validation [kafka]
lianetm opened a new pull request, #15778: URL: https://github.com/apache/kafka/pull/15778 This fixes a consumer system test that was failing for the new protocol. The failure was because the test was expecting the eager behaviour of partitions being revoked on every rebalance, and it was wrongfully applying it to the runs with the new protocol too. This same situation was previously identified and fixed in other parts of the sys test with https://github.com/apache/kafka/pull/15661. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org