Yida Wu has posted comments on this change. ( http://gerrit.cloudera.org:8080/23204 )
Change subject: IMPALA-14234: Fix a version mismatch DCHECK hit when admissiond cluster membership recovering ...................................................................... Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/23204/2/be/src/scheduling/cluster-membership-mgr.cc File be/src/scheduling/cluster-membership-mgr.cc: http://gerrit.cloudera.org:8080/#/c/23204/2/be/src/scheduling/cluster-membership-mgr.cc@300 PS2, Line 300: // to current membership and cleared. : new_state = recovering_membership_; : if (current_membership_ != nullptr : > I think this is possible if we get a full membership update which starts fr Maybe I can file another jira for the timestamp based approach for the membership versioning. http://gerrit.cloudera.org:8080/#/c/23204/3/be/src/scheduling/cluster-membership-mgr.cc File be/src/scheduling/cluster-membership-mgr.cc: http://gerrit.cloudera.org:8080/#/c/23204/3/be/src/scheduling/cluster-membership-mgr.cc@305 PS3, Line 305: } > I'm not quite understanding why the if statement is necessary. Wouldn't it It can have a loop stuff for the recovery process, in line 313 it will bump the version , "new_state->version += 1;", so i guess we want it to keep increasing during the recovery process but not smaller than the current one, then when it quits the recovery process, it will be overwritten to the current membership. http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2238 PS3, Line 2238: This sequence can trigger a DCHECK hit in AdmissionD. > Any idea what percentage of the time the DCHECK fails? If we know that num >From my experience, the DCHECK fails more than 50% of the time, often on the >first run. Since the test can run for a few minutes and the bug isn’t >critical, it may not be worth the added time to run it multiple times. A >single run should be sufficient to catch the issue in builds. http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2243 PS3, Line 2243: select count(*) from functional.alltypes limit 1 > A coordinator-only query such as "select 1" would be better to eliminate an "select 1" is a trivial query, can skip the admissiond. I think it might be better to use a bit complex short query, which could be better to ensure the cluster is working as expected after all the things. http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2254 PS3, Line 2254: create_hs2_client() > Nit: Beeswax is deprecated, use HS2 if possible. Also in line 2259. Done http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2271 PS3, Line 2271: statestore.wait_for_exit() > Use statestore.wait_for_exit() instead of sleep Done http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2274 PS3, Line 2274: coord1.wait_for_exit() > Use coord1.wait_for_exit() instead of sleep Done http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2278 PS3, Line 2278: handle3 = client2.execute_async(short_query) > Does this query need a slight delay to have a better chance of failing the I think the current step is intentional, the error typically occurs when the new query and the statestore happen to start around the same time. Not sure if AC_BEFORE_ADMISSION would help to repro, but since the DCHECK fails more than 50% of the time, maybe it is good enough to catch it? http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2305 PS3, Line 2305: except Exception: > Why the need to restart coord1? This test class appears to be restarting t It might fail the teardown check earlier, but seems adding the closing handle can help. Removed the coord start stuff. -- To view, visit http://gerrit.cloudera.org:8080/23204 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iea77347bd4775abd7866817146e326c7c5042f5e Gerrit-Change-Number: 23204 Gerrit-PatchSet: 4 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jason Fehr <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Mon, 04 Aug 2025 20:59:07 +0000 Gerrit-HasComments: Yes
