[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-13 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/21520


Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting. This patch adds code to handle inconsistent
cluster membership state after statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 111 insertions(+), 30 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/1
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16345/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Fri, 14 Jun 2024 04:22:25 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG@12
PS1, Line 12: This patch adds code to handle inconsistent
: cluster membership state after statestore failover in the same 
way.
Please clarify what is the expected behavior. When is query should cancel and 
not cancel?
Is both failover scenario and regular StateStore restart scenario should not 
cancel RUNNING query?


http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc@1095
PS1, Line 1095: bool has_failed_before = connection_failure_metric_->GetValue() 
> 0;
  :   bool in_failed_grace_period = 
MilliSecondsSinceLastRegistration()
  :   < FLAGS_statestore_subscriber_recovery_grace_period_ms;
I think 'has_disconnect_before' and 'in_disconnect_grace_period' is a better 
name to distinguish from failover scenario.

My understanding is that, disconnect only means temporary network issue to 
StateStore, and is not changing from active StateStore to standby StateStore.


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@677
PS1, Line 677: # Wait for long enough for the standby statestored to detect the 
failure of active
 :   # statestored and assign itself in active role.
Is this 2s (statestore_peer_timeout_seconds)?


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
Why not let query finish?
I thought this test want to show that query completes and not impacted by 
StateStore failover.



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Fri, 14 Jun 2024 18:12:59 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 112 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/2
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> If the query is cancelled, its state will be changed to QueryState.CANCELLE
Finishing this query takes about 120 seconds on my local machine. It seems too 
long for one test case. Without waiting to finish, it takes about 19 seconds.



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Fri, 14 Jun 2024 20:25:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16356/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Fri, 14 Jun 2024 20:45:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG@12
PS1, Line 12: y defining a post-recovery grace period. During
: the grace period, don't update the current cluster membership so 
th
> Define a post-recovery grace period after statestore has been disconnected
Thanks!
Follow up question: after grace period is over, is it the case that StateStore 
will send new cluster membership update to Coordinator and Coordinator will 
cancel the query because cluster membership has changed since query start 
RUNNING?


http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc@1095
PS1, Line 1095: bool has_disconnect_before = 
connection_failure_metric_->GetValue() > 0;
  :   bool in_disconnect_grace_period = 
MilliSecondsSinceLastRegistration()
  :   < FLAGS_statestore_subscriber_recovery_grace_period_ms;
> Renamed two variables.
Done


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@653
PS1, Line 653: """Test that a momentary inconsistent cluster membership state 
after statestore
 : service fail-over will not result in query cancellation. 
Also make sure that query
 : get cancelled if a backend actually went down."""
Is there an existing opposite test that shows query cancelled after grace 
period due to cluster membership difference?


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> Finishing this query takes about 120 seconds on my local machine. It seems
In that case, is it better to cancel through query handle instead?

client.cancel(handle)

In this test, is the cancelation reason because cluster membership change due 
to stopped backend, or just because row transmission broken?



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Fri, 14 Jun 2024 20:58:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Michael Smith (Code Review)
Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h
File be/src/statestore/statestore-subscriber.h:

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h@345
PS2, Line 345:   return MonotonicMillis() - last_failover_time_.Load();
nit: Might be nice to DCHECK that the result is not negative.


http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc@1098
PS2, Line 1098:   bool has_failover_before = last_failover_time_.Load() > 0;
Do we really need to check this? If it's 0, then MilliSecondsSinceLastFailover 
will be a very large number and not less than 
statestore_subscriber_recovery_grace_period_ms. Right now it means we read from 
an atomic twice in a row.


http://gerrit.cloudera.org:8080/#/c/21520/2/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/2/tests/custom_cluster/test_statestored_ha.py@692
PS2, Line 692: assert False, "Query expected to fail"
Can we verify that it had to wait longer than the CANCELLATION_GRACE_PERIOD_S 
for the query to fail?



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Fri, 14 Jun 2024 21:25:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> In that case, is it better to cancel through query handle instead?
What will happen if, after client.execute_async(slow_query), you start 4th 
Impalad, and then kill that 4th Impalad at this point?
Will that stop the query, even if the 4th Impalad is not part of query 
execution?



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Fri, 14 Jun 2024 22:42:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 113 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/3
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG@12
PS1, Line 12: y defining a post-recovery grace period. During
: the grace period, don't update the current cluster membership so 
th
> Thanks!
Active statestored keeps sending cluster membership periodically to all 
subscribers, including coordinators/executors. Sending intervals are 100 ms by 
default. After grace period, the collected cluster membership should be 
consistent and it's safe to use the current cluster membership to detect if a 
node is unhealthy and cancel running queries which are assigned to unhealthy 
nodes.


http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h
File be/src/statestore/statestore-subscriber.h:

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h@345
PS2, Line 345:   int64_t time_ms = MonotonicMillis() - 
last_failover_time_.Load();
> nit: Might be nice to DCHECK that the result is not negative.
Added DCHECK


http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc@1098
PS2, Line 1098:   bool in_failover_grace_period = 
MilliSecondsSinceLastFailover()
> Do we really need to check this? If it's 0, then MilliSecondsSinceLastFailo
Right, removed has_failover_before


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@653
PS1, Line 653: """Test that a momentary inconsistent cluster membership state 
after statestore
 : service fail-over will not result in query cancellation. 
Also make sure that query
 : get cancelled if a backend actually went down aft
> Is there an existing opposite test that shows query cancelled after grace p
Second part of this test case shows query cancelled after grace period due to 
cluster membership change.
Updated comments.


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> In that case, is it better to cancel through query handle instead?
Cancellation reason is cluster membership change after grace period.


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> What will happen if, after client.execute_async(slow_query), you start 4th
The query will not be affected if 4th impalad has no scheduled task.


http://gerrit.cloudera.org:8080/#/c/21520/2/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/2/tests/custom_cluster/test_statestored_ha.py@692
PS2, Line 692: assert False, "Query expected to fail"
> Can we verify that it had to wait longer than the CANCELLATION_GRACE_PERIOD
Yes, need to add another test case.



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Fri, 14 Jun 2024 23:51:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Michael Smith (Code Review)
Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h
File be/src/statestore/statestore-subscriber.h:

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h@345
PS2, Line 345:   int64_t time_ms = MonotonicMillis() - 
last_failover_time_.Load();
> nit: Might be nice to DCHECK that the result is not negative.
Done


http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc@1098
PS2, Line 1098:   bool in_failover_grace_period = 
MilliSecondsSinceLastFailover()
> Do we really need to check this? If it's 0, then MilliSecondsSinceLastFailo
Done



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Fri, 14 Jun 2024 23:51:53 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16357/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sat, 15 Jun 2024 00:16:07 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-14 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 3: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@653
PS1, Line 653: """Test that a momentary inconsistent cluster membership state 
after statestore
 : service fail-over will not result in query cancellation. 
Also make sure that query
 : get cancelled if a backend actually went down aft
> Second part of this test case shows query cancelled after grace period due
Ack


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> The query will not be affected if 4th impalad has no scheduled task.
Ack



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sat, 15 Jun 2024 01:32:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Abhishek Rawat (Code Review)
Abhishek Rawat has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21520/3//COMMIT_MSG@12
PS3, Line 12: During
: the grace period, don't update the current cluster membership so 
that
: the inconsistent membership will not be used to cancel queries on
: coordinators and executors.
Does this also mean that during the grace period, newly added executors won't 
be part of the cluster?



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sat, 15 Jun 2024 18:06:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/4/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/4/tests/custom_cluster/test_statestored_ha.py@727
PS4, Line 727: <
flake8: W504 line break after binary operator



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 01:20:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#4). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 156 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/4
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16359/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 4
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 01:43:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/5/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/5/tests/custom_cluster/test_statestored_ha.py@727
PS5, Line 727: \
flake8: E502 the backslash is redundant between brackets



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 02:47:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#5). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 156 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/5
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/6/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/6/tests/custom_cluster/test_statestored_ha.py@727
PS6, Line 727: +
flake8: W504 line break after binary operator



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 6
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 02:59:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#6). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 156 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/6
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 6
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/5/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/5/tests/custom_cluster/test_statestored_ha.py@727
PS5, Line 727:
> flake8: E502 the backslash is redundant between brackets
Done



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 6
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 02:58:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16360/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 03:10:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16361/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 6
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 03:20:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10723/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 7
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 03:48:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 7: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/10723/


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 7
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Sun, 16 Jun 2024 03:48:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-15 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has removed a vote on this change.

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Removed Verified-1 by Impala Public Jenkins 
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 7
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#8). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 157 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/8
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 8
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/8/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/8/tests/custom_cluster/test_statestored_ha.py@727
PS8, Line 727: +
flake8: W504 line break after binary operator



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 8
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 14:26:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16367/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 8
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 14:49:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 157 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/9
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 9
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/8/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/8/tests/custom_cluster/test_statestored_ha.py@727
PS8, Line 727: P
> flake8: W504 line break after binary operator
Done



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 9
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 16:52:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 9: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/21520/9/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/9/tests/custom_cluster/test_statestored_ha.py@704
PS9, Line 704: sleep(1)
nit: why sleep(1) when wait_until_ready=True above?


http://gerrit.cloudera.org:8080/#/c/21520/9/tests/custom_cluster/test_statestored_ha.py@705
PS9, Line 705:   assert(not 
statestore_service_0.get_metric_value("statestore.active-status"))
 :   
assert(statestore_service_1.get_metric_value("statestore.active-status"))
nit: These assertion and others about active-status are not really 
straightforward to understand. Adding human-readable error message will be 
helpful.

  assert(not 
statestore_service_0.get_metric_value("statestore.active-status")), 'First 
statestored must not be active'
  
assert(statestore_service_1.get_metric_value("statestore.active-status")), 
'Second statestored must be active'



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 9
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 17:14:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16372/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 9
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 17:14:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#10). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 165 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/10
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 10
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/21520/9/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/9/tests/custom_cluster/test_statestored_ha.py@704
PS9, Line 704:
> nit: why sleep(1) when wait_until_ready=True above?
add a buffer time for metrics to be updated on statestored.


http://gerrit.cloudera.org:8080/#/c/21520/9/tests/custom_cluster/test_statestored_ha.py@705
PS9, Line 705:   # Restart original active statestored. Verify that the 
statestored does not resume
 :   # its active role.
> nit: These assertion and others about active-status are not really straight
added error messages for assertions.



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 10
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 17:49:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 10: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 10
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 17:55:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16373/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 10
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 18:13:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Andrew Sherman (Code Review)
Andrew Sherman has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 9:

(1 comment)

Quick question

http://gerrit.cloudera.org:8080/#/c/21520/9/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/9/be/src/statestore/statestore-subscriber.cc@1098
PS9, Line 1098:   bool in_failover_grace_period = 
MilliSecondsSinceLastFailover()
At startup last_failover_time_ is zero, so for the first 
statestore_subscriber_recovery_grace_period_ms we are considered to be in the 
failover grace period? Could that be a problem?



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 9
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 18:23:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21520/9/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/9/be/src/statestore/statestore-subscriber.cc@1098
PS9, Line 1098:   bool in_failover_grace_period = 
MilliSecondsSinceLastFailover()
> At startup last_failover_time_ is zero, so for the first statestore_subscri
At startup, last_failover_time_ is zero. If last_failover_time_ is not updated, 
MilliSecondsSinceLastFailover() return a very big number and it must be greater 
than FLAGS_statestore_subscriber_recovery_grace_period_ms so we don't need to 
check if last_failover_time_ is not zero.



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 10
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 18:32:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Michael Smith (Code Review)
Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 10:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py@680
PS10, Line 680:   self.wait_for_state(handle, QueryState.RUNNING, 1000)
This waits for 1000 seconds? That seems excessive. I'd try 10s to handle slower 
test environments.


http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py@710
PS10, Line 710:   assert (not 
statestore_service_0.get_metric_value("statestore.active-status")), \
Maybe wait_for_metric_value would be appropriate here instead of a sleep.


http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py@717
PS10, Line 717:   self.wait_for_state(handle, QueryState.RUNNING, 1000)
nit: Again waiting 1000 seconds.


http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py@736
PS10, Line 736:   while (time.time() - start_time < timeout_s):
nit: I think you could also use wait_for_state(EXCEPTION), then rely on your 
assert that time.time()-start_time > timeout_s.



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 10
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 18:56:00 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Wenzhe Zhou (Code Review)
Wenzhe Zhou has uploaded a new patch set (#12). ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 160 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/21520/12
--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Michael Smith (Code Review)
Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 12: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 20:56:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Michael Smith (Code Review)
Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 12:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py@680
PS10, Line 680:   self.wait_for_state(handle, QueryState.RUNNING, 120, 
client)
> It's timeout for query to reach RUNNING state. Normally query should reach
Ack


http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py@710
PS10, Line 710:   assert (not 
statestore_service_0.get_metric_value("statestore.active-status")), \
> Changed to wait_for_metric_value
Ack


http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py@717
PS10, Line 717:   self.wait_for_state(handle, QueryState.RUNNING, 120, 
client)
> changed to 120
Ack


http://gerrit.cloudera.org:8080/#/c/21520/10/tests/custom_cluster/test_statestored_ha.py@736
PS10, Line 736:   self.wait_for_state(handle, QueryState.EXCEPTION, 
timeout_s, client)
> changed to use wait_for_state(EXCEPTION)
Ack



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 20:57:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/16377/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 21:18:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 12: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 22:10:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10728/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Mon, 17 Jun 2024 22:23:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..


Patch Set 12: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 12
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou 
Gerrit-Comment-Date: Tue, 18 Jun 2024 03:27:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

2024-06-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/21520 )

Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
..

IMPALA-13159: Fix query cancellation caused by statestore failover

A momentary inconsistent cluster membership state after statestore
failover results in query cancellation.
We already have code to handle inconsistent cluster membership after
statestore restarting by defining a post-recovery grace period. During
the grace period, don't update the current cluster membership so that
the inconsistent membership will not be used to cancel queries on
coordinators and executors.
This patch handles inconsistent cluster membership state after
statestore failover in the same way.

Testing:
 - Added a new test case to verify that inconsistent cluster
   membership after statestore failover will not result in query
   cancellation.
 - Fixed closing client issue for Catalogd HA test case
   test_catalogd_failover_with_sync_ddl when the test fails.
 - Passed core test.

Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Reviewed-on: http://gerrit.cloudera.org:8080/21520
Reviewed-by: Michael Smith 
Reviewed-by: Riza Suminto 
Tested-by: Impala Public Jenkins 
---
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M tests/custom_cluster/test_catalogd_ha.py
M tests/custom_cluster/test_statestored_ha.py
5 files changed, 160 insertions(+), 31 deletions(-)

Approvals:
  Michael Smith: Looks good to me, but someone else must approve
  Riza Suminto: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 13
Gerrit-Owner: Wenzhe Zhou 
Gerrit-Reviewer: Abhishek Rawat 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Smith 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Wenzhe Zhou