Yida Wu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23204 )

Change subject: IMPALA-14234: Fix a version mismatch DCHECK hit when admissiond 
cluster membership recovering
......................................................................


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/23204/2/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/23204/2/be/src/scheduling/cluster-membership-mgr.cc@300
PS2, Line 300:       // to current membership and cleared.
             :       new_state = recovering_membership_;
             :       if (current_membership_ != nullptr
             :
> I think this is possible if we get a full membership update which starts fr
Maybe I can file another jira for the timestamp based approach for the 
membership versioning.


http://gerrit.cloudera.org:8080/#/c/23204/3/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/23204/3/be/src/scheduling/cluster-membership-mgr.cc@305
PS3, Line 305:       }
> I'm not quite understanding why the if statement is necessary.  Wouldn't it
It can have a loop stuff for the recovery process, in line 313 it will bump the 
version , "new_state->version += 1;", so i guess we want it to keep increasing 
during the recovery process but not smaller than the current one, then when it 
quits the recovery process, it will be overwritten to the current membership.


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2238
PS3, Line 2238:     This sequence can trigger a DCHECK hit in AdmissionD.
> Any idea what percentage of the time the DCHECK fails?  If we know that num
>From my experience, the DCHECK fails more than 50% of the time, often on the 
>first run. Since the test can run for a few minutes and the bug isn’t 
>critical, it may not be worth the added time to run it multiple times. A 
>single run should be sufficient to catch the issue in builds.


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2243
PS3, Line 2243: select count(*) from functional.alltypes limit 1
> A coordinator-only query such as "select 1" would be better to eliminate an
"select 1" is a trivial query, can skip the admissiond. I think it might be 
better to use a bit complex short query, which could be better to ensure the 
cluster is working as expected after all the things.


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2254
PS3, Line 2254: create_hs2_client()
> Nit:  Beeswax is deprecated, use HS2 if possible.  Also in line 2259.
Done


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2271
PS3, Line 2271:     statestore.wait_for_exit()
> Use statestore.wait_for_exit() instead of sleep
Done


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2274
PS3, Line 2274:     coord1.wait_for_exit()
> Use coord1.wait_for_exit() instead of sleep
Done


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2278
PS3, Line 2278:     handle3 = client2.execute_async(short_query)
> Does this query need a slight delay to have a better chance of failing the
I think the current step is intentional, the error typically occurs when the 
new query and the statestore happen to start around the same time. Not sure if 
AC_BEFORE_ADMISSION would help to repro, but since the DCHECK fails more than 
50% of the time, maybe it is good enough to catch it?


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2305
PS3, Line 2305:     except Exception:
> Why the need to restart coord1?  This test class appears to be restarting t
It might fail the teardown check earlier, but seems adding the closing handle 
can help. Removed the coord start stuff.



--
To view, visit http://gerrit.cloudera.org:8080/23204
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iea77347bd4775abd7866817146e326c7c5042f5e
Gerrit-Change-Number: 23204
Gerrit-PatchSet: 4
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Mon, 04 Aug 2025 20:59:07 +0000
Gerrit-HasComments: Yes

Reply via email to