[Impala-ASF-CR] IMPALA-6662: Make stress test resilient to hangs due to client crashes

Lars Volker (Code Review) Thu, 15 Mar 2018 14:16:37 -0700

Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9635 )


Change subject: IMPALA-6662: Make stress test resilient to hangs due to client 
crashes
......................................................................


Patch Set 4:

(6 comments)

I had a first look and left some comments but I think the approach needs a more 
thorough overhaul. It is not clear to me how queries and query-runners relate 
to each other and the data structures involved in starting new runners seem 
opaque to me. I think it would be easier to understand if we had a proper 
multiprocessing.Queue() with work items (each representing a query) and a pool 
of workers that consume items from the queue and put results into a result 
queue, which then gets processed by the main process. Then the whole bug would 
come down to an item being removed from the work queue but no corresponding 
item showing up in the result queue. It would also simplify the locking and 
computing sums of counters.

I can see how that could be out of scope for this change. Let me know if you'd 
move forward with the current change as is and I'll do another pass.

http://gerrit.cloudera.org:8080/#/c/9635/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9635/4//COMMIT_MSG@53
PS4, Line 53: Testing: Ran the stress test with the new patch a few times to 
make sure
You could add a pause to one of the runners and kill it to test this manually.


http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py
File tests/stress/concurrent_select.py:

http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@327
PS4, Line 327:     # All values below are cumulative.
maybe add a comment to describe how these are populated?

It think it would also help to add a brief comment to explain what each of them 
means.


http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@373
PS4, Line 373:       for runner in self._query_runners:
You could just call _total_num_queries_started() and 
_total_num_queries_finished() here. However, you'd need to switch 
_query_runners_lock to RLock.


http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@429
PS4, Line 429:     with self._query_runners_lock:
I don't think holding the lock is necessary here. Python's GIL makes sure that 
list operations themselves are threadsafe


http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@528
PS4, Line 528:         while self._total_num_queries_started() < 
self._num_queries_to_run:
It is not clear to me whether a query runner will run a single query and then 
exit, or whether it will keep running queries. Here it looks like we are 
starting new runners until we have run all of the queries, one per query. 
However, in _start_single_runner it looks like we are running a loop over the 
global query queue. Does this mean all runners continue to run queries, and we 
also continue to ramp up the number of runners all the time?


http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@548
PS4, Line 548:               Process(target=self._start_single_runner, 
args=(query_runner, ))
This looks hairy to me, can you add a comment explaining how exactly sharing of 
data works here? My understanding of Process() is that only particular members 
of query_runner (Value, Array) will use shared memory and be synchronized.



--
To view, visit http://gerrit.cloudera.org:8080/9635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7
Gerrit-Change-Number: 9635
Gerrit-PatchSet: 4
Gerrit-Owner: Sailesh Mukil <sail...@cloudera.com>
Gerrit-Reviewer: David Knupp <dkn...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-Comment-Date: Thu, 15 Mar 2018 21:16:11 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6662: Make stress test resilient to hangs due to client crashes

Reply via email to