[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
  quickly (after waiting report_status_retry_interval_ms), we wait the
  regular reporting interval (status_report_interval_ms), regenerate
  any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
  controls the number of failed reports that are allowed before the
  query is cancelled. --report_status_retry_interval_ms is removed.
- Backoff is used for repeated failed attempts, such that for a period
  between retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Reviewed-on: http://gerrit.cloudera.org:8080/12049
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 192 insertions(+), 65 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 9
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 8: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Thu, 07 Feb 2019 22:21:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3736/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Thu, 07 Feb 2019 18:08:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 8: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Thu, 07 Feb 2019 18:08:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-06 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Thu, 07 Feb 2019 02:46:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-06 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py@158
PS5, Line 158: # Since the sleep time (1000ms) is much longer than the rpc 
timeout (100ms), all
 : # reports will appear to fail. The query is designed to 
result in many intermediate
 : # status reports but fewer than the max allowed failures, so 
the query should succeed.
 : query_options = {'debug_action': 
'REPORT_EXEC_STATUS_DELAY:SLEEP@1000'}
> Sure, happy to write such a test if you think there's additional value in i
Thanks for pointing out the insert errors below. LGTM.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Thu, 07 Feb 2019 02:46:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2007/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Wed, 06 Feb 2019 21:49:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-06 Thread Thomas Marshall (Code Review)
Hello Michael Ho, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12049

to look at the new patch set (#7).

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
  quickly (after waiting report_status_retry_interval_ms), we wait the
  regular reporting interval (status_report_interval_ms), regenerate
  any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
  controls the number of failed reports that are allowed before the
  query is cancelled. --report_status_retry_interval_ms is removed.
- Backoff is used for repeated failed attempts, such that for a period
  between retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 192 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/7
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 6:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2005/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Wed, 06 Feb 2019 20:59:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-06 Thread Thomas Marshall (Code Review)
Hello Michael Ho, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12049

to look at the new patch set (#6).

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
  quickly (after waiting report_status_retry_interval_ms), we wait the
  regular reporting interval (status_report_interval_ms), regenerate
  any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
  controls the number of failed reports that are allowed before the
  query is cancelled. --report_status_retry_interval_ms is removed.
- Backoff is used for repeated failed attempts, such that for a period
  between retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 192 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/6
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-06 Thread Thomas Marshall (Code Review)
Thomas Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 6:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc
File be/src/runtime/fragment-instance-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc@259
PS5, Line 259: // Since execution was already finished, the conten
> Can you please a comment on why we don't advance the seq number here to hel
Done


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc@300
PS5, Line 300: ) !=
> typo ?
Done


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@421
PS5, Line 421:   /// Returns the amount of time in ms to wait before sending 
the next status report,
> in milliseconds
Done


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@422
PS5, Line 422: terval with
> This isn't correct, right ?
Done


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@424
PS5, Line 424: int64_t GetReportWaitTimeMs()
> int64_t GetReportWaitTimeMs() const;
Done


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.cc@415
PS5, Line 415:
> A minor quirk I just noticed. If the user sets FLAGS_status_report_interval
Done


http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py@158
PS5, Line 158: # Since the sleep time (1000ms) is much longer than the rpc 
timeout (100ms), all
 : # reports will appear to fail. The query is designed to 
result in many intermediate
 : # status reports but fewer than the max allowed failures, so 
the query should succeed.
 : query_options = {'debug_action': 
'REPORT_EXEC_STATUS_DELAY:SLEEP@1000'}
> This is a good test. Can you please add a test which exercises the stateful
Sure, happy to write such a test if you think there's additional value in it, 
though I'll point out that this test does exercise the stateful reporting part 
via the insert errors checked for below.

Do you know if we have such a udf already available or would I need to write 
one? If it generates errors probabilistically, how do we check that the errors 
that are reported back were in fact the errors that were produced during the 
run?



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Wed, 06 Feb 2019 20:40:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-02-05 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 5: Code-Review+2

(7 comments)

LGTM. Please address the remaining comments.

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc
File be/src/runtime/fragment-instance-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc@259
PS5, Line 259: instance_status->set_report_seq_no(report_seq_no_);
Can you please a comment on why we don't advance the seq number here to help 
readers understand it ?


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc@300
PS5, Line 300: on in
typo ?


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@421
PS5, Line 421:   /// Returns the amount of time to wait before sending the next 
status report, calculated
in milliseconds


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@422
PS5, Line 422:  exponential
This isn't correct, right ?


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@424
PS5, Line 424: int64_t GetReportWaitTimeMs();
int64_t GetReportWaitTimeMs() const;


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.cc@415
PS5, Line 415: FLAGS_status_report_interval_ms
A minor quirk I just noticed. If the user sets FLAGS_status_report_interval_ms 
to 0, we may want to have a minimum sleep time. Otherwise, if we failed to send 
the last report, we may retry without sleeping.

I guess my earlier suggestion to use the same timeout may have led to this 
quirk but the current timeout logic still seems simpler overall.


http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py@158
PS5, Line 158: # Since the sleep time (1000ms) is much longer than the rpc 
timeout (100ms), all
 : # reports will appear to fail. The query is designed to 
result in many intermediate
 : # status reports but fewer than the max allowed failures, so 
the query should succeed.
 : query_options = {'debug_action': 
'REPORT_EXEC_STATUS_DELAY:SLEEP@1000'}
This is a good test. Can you please add a test which exercises the stateful 
reporting part by running some UDFs which may generate errors 
probabilistically. (e.g. as a scan predicate which calls the UDF). In this 
case, we will also start accumulating errors in the fragment instance state.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Tue, 05 Feb 2019 23:11:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-31 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1947/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Fri, 01 Feb 2019 01:14:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-31 Thread Thomas Marshall (Code Review)
Hello Michael Ho, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12049

to look at the new patch set (#5).

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
  quickly (after waiting report_status_retry_interval_ms), we wait the
  regular reporting interval (status_report_interval_ms), regenerate
  any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
  controls the number of failed reports that are allowed before the
  query is cancelled. --report_status_retry_interval_ms is removed.
- Exponential backoff is used, such that for a period between
  retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 185 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/5
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-31 Thread Thomas Marshall (Code Review)
Thomas Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 5:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/coordinator-backend-state.cc@344
PS4, Line 344:   for (const auto& stateful_report : 
instance_exec_status.stateful_report()) {
> DCHECK_LE(stateful_report.report_seq_no(), report_seq_no());
Done


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.h
File be/src/runtime/fragment-instance-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.h@170
PS4, Line 170: :vector<
> typo
Done


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.h@171
PS4, Line 171:
> nit: prev_stateful_reports_;
Done


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.cc
File be/src/runtime/fragment-instance-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.cc@253
PS4, Line 253: DCHECK(!final_report_sent_
> Please see comments below about some potential simplification we can do.
Done


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.cc@279
PS4, Line 279: {prev_stateful_reports_.begin(), 
prev_stateful_reports_.end()};
> nit: indent 4
Done


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.cc@298
PS4, Line 298: if (num_reports > 0 && prev_statef
> As discussed in person, this could have been simpler by not storing the fin
Done


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.h@410
PS4, Line 410: report'
> stale ?
Done


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.h@411
PS4, Line 411: profiles_forest'.
> stale ?
Done


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.cc@563
PS4, Line 563: portExecStatus();
> Given this is quite clunky to write and used at more than one places, may b
Done



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Fri, 01 Feb 2019 00:30:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-31 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 4:

(9 comments)

Looking good. Some more minor comments.

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/coordinator-backend-state.cc@344
PS4, Line 344:   for (const auto& stateful_report : 
instance_exec_status.stateful_report()) {
DCHECK_LE(stateful_report.report_seq_no(), report_seq_no());


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.h
File be/src/runtime/fragment-instance-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.h@170
PS4, Line 170: recieved
typo


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.h@171
PS4, Line 171: stateful_reports_
nit: prev_stateful_reports_;


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.cc
File be/src/runtime/fragment-instance-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.cc@253
PS4, Line 253: if (final_report_saved_) {
Please see comments below about some potential simplification we can do.


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.cc@279
PS4, Line 279:   {stateful_reports_.begin(), stateful_reports_.end()};
nit: indent 4


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/fragment-instance-state.cc@298
PS4, Line 298: if (instance_exec_status.done()) {
As discussed in person, this could have been simpler by not storing the final 
profile here. We just have to keep track of the fact that we don't need to bump 
the sequence number once the final report has been generated at least once.


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.h@410
PS4, Line 410: report_
stale ?


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.h@411
PS4, Line 411: profiles_forest_'
stale ?


http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/4/be/src/runtime/query-state.cc@563
PS4, Line 563: FLAGS_status_report_interval_ms * (num_failed_reports_ + 1)
Given this is quite clunky to write and used at more than one places, may be it 
makes sense to factor it into a function itself. This also makes changing the 
logic of sleep time (e.g. exponential backoff) in the future easier.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Thu, 31 Jan 2019 19:46:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1920/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Wed, 30 Jan 2019 01:04:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-29 Thread Thomas Marshall (Code Review)
Thomas Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc@300
PS2, Line 300: lExecState* dml_exec_sta
> Thanks for the explanation. This seems a bit subtle indeed.
Done


http://gerrit.cloudera.org:8080/#/c/12049/3/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/3/be/src/runtime/query-state.h@394
PS3, Line 394: eryCtx&
> nit: succeeded, matching "failed" below.
Done


http://gerrit.cloudera.org:8080/#/c/12049/3/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/3/be/src/runtime/query-state.cc@382
PS3, Line 382: if (rpc_status.ok()) {
 :   fis->ReportSuccessful(instance_exec_status);
> Reading the documentation of vector::erase(), it's not exactly an efficient
Done


http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto@117
PS2, Line 117:   // Sequence number prevents out-of-order or duplicated updates 
from being applied.
 :   // 'report_seq_no' will be <= the
> May be worth documenting the relationship of this seq_no with that in Fragm
Done


http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto@138
PS2, Line 138:   optional FInstanceExecStatePB current_state = 4;
 :
 :   // Cumulative structural changes made by the
> Sounds good to me.
IMPALA-8139



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Tue, 29 Jan 2019 20:56:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-29 Thread Thomas Marshall (Code Review)
Hello Michael Ho, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12049

to look at the new patch set (#4).

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
  quickly (after waiting report_status_retry_interval_ms), we wait the
  regular reporting interval (status_report_interval_ms), regenerate
  any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
  controls the number of failed reports that are allowed before the
  query is cancelled. --report_status_retry_interval_ms is removed.
- Exponential backoff is used, such that for a period between
  retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 192 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/4
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-24 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc@300
PS2, Line 300: nstance_exec_status :
> Yes, but instance_stats->done_ can be set by an intermediate report (if tha
Thanks for the explanation. This seems a bit subtle indeed.

Would things be simpler if we just stop bumping the sequence number after 
generating the last report for a fragment instance ? In other words, we need to 
record the fact that the final report may have been generated already, which is 
different from whether final report has been sent.


http://gerrit.cloudera.org:8080/#/c/12049/3/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/3/be/src/runtime/query-state.h@394
PS3, Line 394: succeeds
nit: succeeded, matching "failed" below.


http://gerrit.cloudera.org:8080/#/c/12049/3/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/3/be/src/runtime/query-state.cc@382
PS3, Line 382: it = report_.mutable_instance_exec_status()->erase(it);
 : profile_it = 
profiles_forest_.profile_trees.erase(profile_it);
Reading the documentation of vector::erase(), it's not exactly an efficient 
operation as it seems to do quite a bit of copying. Doing this in a list may 
end up being O(n^2) here.

I suppose the reason for doing this dance here is to make UpdateReport() work 
on both code paths of ReportSuccessful() and ReportFailed().

Assuming the common case is that the report succeeds, it seems to be doing 
quite a bit of extra work just to erase each element one by one. Shouldn't the 
penalty of copying be paid instead when the report fails, which is presumably 
not too common ? In other words, should we just copy / move the stateful report 
into the fragment instance state instead ?

A slightly more efficient approach of what this loop is trying to achieve could 
be swapping elements at the end of the list to this spot being erased.  In 
which case, it would be O(n) worst case. However, I am not sure if this 
complexity is warranted.


http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto@117
PS2, Line 117:   // Sequence number prevents out-of-order or duplicated updates 
from being applied.
 :   optional int64 report_seq_no = 1;
May be worth documenting the relationship of this seq_no with that in 
FragmentInstanceExecStatusPB. If I understand it correctly, this one should be 
<= that in FragmentInstanceExecStatusPB.


http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto@138
PS2, Line 138:   // Cumulative structural changes made by the table sink of 
this fragment
 :   // instance. This is sent only when 'done' above is true. Not 
idempotent.
 :   optional DmlExecStatusPB dml_exec_status = 5;
> Its not quite straight-forward to do, so given this patch is already reason
Sounds good to me.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 24 Jan 2019 21:55:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1850/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 23 Jan 2019 01:16:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-22 Thread Thomas Marshall (Code Review)
Hello Michael Ho, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12049

to look at the new patch set (#3).

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
  quickly (after waiting report_status_retry_interval_ms), we wait the
  regular reporting interval (status_report_interval_ms), regenerate
  any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
  controls the number of failed reports that are allowed before the
  query is cancelled. --report_status_retry_interval_ms is removed.
- Exponential backoff is used, such that for a period between
  retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 195 insertions(+), 93 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/3
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2019-01-22 Thread Thomas Marshall (Code Review)
Thomas Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc@300
PS2, Line 300: nstance_exec_status :
> If we send the final report multiple times, won't report_seq_no == last_rep
Yes, but instance_stats->done_ can be set by an intermediate report (if that 
particular fragment is done but others aren't). So the following sequence could 
occur:
- Some fragment finishes but others are still running, backends sends a report, 
backend thinks the report failed but actually the coordinator received it.
- Backend sends another report. This one has a higher sequence number, but 
still has the info for the finished fragment as we didn't think that it was 
received. Coordinator already processed the report for the finished fragment, 
so we want to skip processing it again.


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc@320
PS2, Line 320: n_ran
> const auto&
Done


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/fragment-instance-state.h
File be/src/runtime/fragment-instance-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/fragment-instance-state.h@101
PS2, Line 101: received
> nit: received
Done


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.h@387
PS2, Line 387: The number of failed intermediate reports since the la
> This is reset to 0 upon a successful RPC, right ? So, will it be clearer to
Done


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc@386
PS2, Line 386: } else {
> const TUniqueId&
Done


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc@563
PS2, Line 563: e all fragment instances finished preparing successfully, start
> Some more thoughts on this:
Done


http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto@116
PS2, Line 116: StatefulStatusPB {
> Is there a more succinct name for it ? StatefulStatusPB ?
Done


http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto@138
PS2, Line 138:   // Cumulative structural changes made by the table sink of 
this fragment
 :   // instance. This is sent only when 'done' above is true. Not 
idempotent.
 :   optional DmlExecStatusPB dml_exec_status = 5;
> Why isn't this moved into FInstanceExecStatusStatefulPB
Its not quite straight-forward to do, so given this patch is already reasonably 
large and complicated, I think its better to leave as followup work.

If you agree, I'll file a JIRA for it.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 23 Jan 2019 00:25:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-20 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc@563
PS2, Line 563: FLAGS_status_report_interval_ms * (num_failed_reports_ + 1))) {
> This backoff time seems way larger than the retry interval used when failin
Some more thoughts on this:

The behavior in this patch is that we will back off on the intermediate reports 
with at least 5 seconds (or whatever FLAGS_status_report_interval_ms is set to) 
but then we only spin for about 300ms (or whatever 
FLAGS_report_status_retry_interval_ms is set to) when it fails to be sent. This 
behavior seems rather complicated without much benefit. Having 2 different 
flags controlling the behavior also makes it hard to use and reason about the 
behavior.

An executor is merely using this periodic status report as a proxy to determine 
the liveness of the coordinator.  So, wouldn't the behavior be simpler and more 
consistent if an executor just considers the coordinator as dysfunctional after 
 FLAGS_status_report_max_retries failed reports in a row. In other words, an 
executor provides a total time allowance of  FLAGS_status_report_max_retries * 
FLAGS_report_interval_ms (or potentially longer with exponential backoff) for 
coordinator to respond before giving up.


With the above policy, we don't distinguish between intermediate report and 
final report for the backoff behavior. A clock will be started on the first 
failure and an executor gives up after the time allowance has passed.

As a side note, an executor should ideally be able to rely on statestore to 
provide proper cluster membership but the reality is that it's insufficient for 
determining the health of a node (see IMPALA-7872). So, we fall back to this 
poor-man approach of using periodic status report as a heartbeat to probe the 
coordinator.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 20 Dec 2018 20:15:17 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-17 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 2:

(8 comments)

Some trivial comments for now. I have yet to think through the retry logic to 
see if they can be simplified and more consistent.

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc@300
PS2, Line 300: || instance_stats->done_
If we send the final report multiple times, won't report_seq_no == 
last_report_seq_no here ?


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/coordinator-backend-state.cc@320
PS2, Line 320: auto
const auto&


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/fragment-instance-state.h
File be/src/runtime/fragment-instance-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/fragment-instance-state.h@101
PS2, Line 101: recieved
nit: received


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.h@387
PS2, Line 387: The number of consecutive failed intermediate reports.
This is reset to 0 upon a successful RPC, right ? So, will it be clearer to say 
"The number of failed intermediate reports since the last successfully sent 
report" ?


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc@386
PS2, Line 386: TUniqueId
const TUniqueId&


http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc@563
PS2, Line 563: FLAGS_status_report_interval_ms * (num_failed_reports_ + 1))) {
This backoff time seems way larger than the retry interval used when failing to 
send the final report. This seems a bit inconsistent.


http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto@116
PS2, Line 116: FInstanceExecStatusStatefulPB
Is there a more succinct name for it ? StatefulStatusPB ?


http://gerrit.cloudera.org:8080/#/c/12049/2/common/protobuf/control_service.proto@138
PS2, Line 138:   // Cumulative structural changes made by the table sink of 
this fragment
 :   // instance. This is sent only when 'done' above is true. Not 
idempotent.
 :   optional DmlExecStatusPB dml_exec_status = 5;
Why isn't this moved into FInstanceExecStatusStatefulPB



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 18 Dec 2018 00:43:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1619/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 15 Dec 2018 00:01:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 2:

Hue polls the HS2 GetLog() endpoint - i.e. ImpalaServer::GetLog(), but the main 
purpose of that afaik is to see the query progress.


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 14 Dec 2018 23:55:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-14 Thread Thomas Marshall (Code Review)
Thomas Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/coordinator-backend-state.cc@299
PS1, Line 299: report_seq_no <= instance_stats->last_report_seq_no_ || 
instance_stats->done_) {
> Wouldn't this be a problem for intermediate report ? If the coordinator is
You're right that this approach may cause non-idempotent parts of the report to 
be applied twice (at the moment, only the error log).

As discussed offline, only bumping the seq no on the executor side if the 
report was successful doesn't work - since we'll wait 5s+ before sending the 
new report, it will contain new info that we do want to have applied.

I also looked into the idea of just deferring sending the error log until the 
final message. Currently, the error log is only exposed by the beeswax call 
get_log(). I don't know if this is commonly called by users in the middle of 
query execution (the shell calls it, but only at the end of the query), but 
going forward we probably want to expose this in more ways (eg. as part of the 
runtime profile), and I think it's a better user experience to make the errors 
accessible as soon as possible, for example to help identify issues with 
queries that are very long running or appear to hang. Further, there are other 
non-idempotent things that we may want to include (eg. it would be nice to have 
the dml stats incrementally updated instead of just at the end), so for these 
reasons I think its worth trying to come up with a solution for this now.

The solution I put together separates the non-idempotent parts of the report 
out into a 'stateful' report. If the report rpc fails, each fragment instance 
stores the stateful report it had generated, and adds it to the next report 
associated with its original sequence number. This ensures that the coordinator 
will only apply stateful portions of the report for sequence numbers it hasn't 
seen yet. Once a report is successful, all the stored stateful reports are 
cleared.

Downsides to this approach vs. just sending the error log with the final report:
- Copying the stateful report into the fragment instance's state takes time and 
adds memory usage and sending multiple stateful reports increases the size of 
the report rpc and puts more load on the coordinator. These issues will only 
occur if a report rpc actually fails, which is hopefully rare, but if things 
are already failing this may bog the system down even more.
- Additional code complexity.


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/fragment-instance-state.h
File be/src/runtime/fragment-instance-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/fragment-instance-state.h@99
PS1, Line 99: void SetFinalReportSent() { final_report_sent_ = true; }
> May help to add comments on the expected caller and on when this is expecte
Done


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.h@417
PS1, Line 417: (bool final
> Comment on the input parameter.
Done


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@58
PS1, Line 58: DEFINE_int32(status_report_max_failures, 3,
: "Max number of consecutive failed status reports to allow 
before cancelling");
> Actually, max_retries seems to be safer in the sense that it guarantees a m
Done


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@324
PS1, Line 324: Try to send the RPC 3 times before failing. Sleep for 100ms 
between retries.
> May need to be updated.
Done


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@368
PS1, Line 368: fis_map_[id]->runtime_state()->ClearUnreportedErrors();
> There is a race here: we may be clearing newly added errors we added after
Done



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Comment-Date: Fri, 14 Dec 2018 23:38:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-14 Thread Thomas Marshall (Code Review)
Hello Michael Ho, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12049

to look at the new patch set (#2).

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When an intermediate report is generated, if the first attempt to
  send it to the coordinator fails, the report is discarded. Only the
  final report, when all of the fragments instances have completed, is
  retried.
- Exponential backoff is used, both for the time between generating
  intermediate reports (controlled by FLAG_status_report_interval_ms)
  and for the time between retries for the final report (controlled by
  FLAG_report_status_retry_interval_ms) such that for a period between
  retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
7 files changed, 139 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/2
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-11 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@58
PS1, Line 58: DEFINE_int32(status_report_max_failures, 3,
: "Max number of consecutive failed status reports to allow 
before cancelling");
> I thought we want to use a fixed timeout approach for the maximum retries ?
Actually, max_retries seems to be safer in the sense that it guarantees a 
minimum amount of time the thread will sleep before giving up as we know the 
sleep time between each retry.

With an absolute timeout, there is no guarantee on the number of retries we 
will do. If the system is overloaded, the query state thread may not get to run 
very often before expiration so the number of retries is non-deterministic.


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@368
PS1, Line 368: fis_map_[id]->runtime_state()->ClearUnreportedErrors();
There is a race here: we may be clearing newly added errors we added after the 
profile was computed.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Comment-Date: Tue, 11 Dec 2018 22:25:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-10 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/coordinator-backend-state.cc@299
PS1, Line 299: report_seq_no <= instance_stats->last_report_seq_no_ || 
instance_stats->done_) {
Wouldn't this be a problem for intermediate report ? If the coordinator is slow 
to respond to the RPC, the client side (i.e. the executor) may assume that the 
intermediate reports were never applied even if it was actually applied by the 
coordinator. In which case, the new report may come with a larger sequence 
number which is larger than the old one and that may cause duplicated report.

May be it makes sense to keep track of whether a FIS' report was successfully 
applied at the executor side and bump the sequence number if the last one was 
applied successfully. Not sure if there is any corner case which may not have 
been considered in this alternate approach ?!


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/fragment-instance-state.h
File be/src/runtime/fragment-instance-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/fragment-instance-state.h@99
PS1, Line 99: void SetFinalReportSent() { final_report_sent_ = true; }
May help to add comments on the expected caller and on when this is expected to 
be called ?


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.h@417
PS1, Line 417: (bool final
Comment on the input parameter.


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@58
PS1, Line 58: DEFINE_int32(status_report_max_failures, 3,
: "Max number of consecutive failed status reports to allow 
before cancelling");
I thought we want to use a fixed timeout approach for the maximum retries ? I 
am okay with max_retries too but I think it may make sense to have larger 
number of retries than 3.


http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@324
PS1, Line 324: Try to send the RPC 3 times before failing. Sleep for 100ms 
between retries.
May need to be updated.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Comment-Date: Mon, 10 Dec 2018 21:56:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1555/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Marshall 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Michael Ho 
Gerrit-Comment-Date: Thu, 06 Dec 2018 23:39:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

2018-12-06 Thread Thomas Marshall (Code Review)
Thomas Marshall has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12049


Change subject: IMPALA-4555: Make QueryState's status reporting more robust
..

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When an intermediate report is generated, if the first attempt to
  send it to the coordinator fails, the report is discarded. Only the
  final report, when all of the fragments instances have completed, is
  retried.
- Exponential backoff is used, both for the time between generating
  intermediate reports (controlled by FLAG_status_report_interval_ms)
  and for the time between retries for the final report (controlled by
  FLAG_report_status_retry_interval_ms) such that for a period between
  retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 61 insertions(+), 15 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/1
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Marshall