Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9635 )
Change subject: IMPALA-6662: Make stress test resilient to hangs due to client crashes ...................................................................... Patch Set 4: (6 comments) I had a first look and left some comments but I think the approach needs a more thorough overhaul. It is not clear to me how queries and query-runners relate to each other and the data structures involved in starting new runners seem opaque to me. I think it would be easier to understand if we had a proper multiprocessing.Queue() with work items (each representing a query) and a pool of workers that consume items from the queue and put results into a result queue, which then gets processed by the main process. Then the whole bug would come down to an item being removed from the work queue but no corresponding item showing up in the result queue. It would also simplify the locking and computing sums of counters. I can see how that could be out of scope for this change. Let me know if you'd move forward with the current change as is and I'll do another pass. http://gerrit.cloudera.org:8080/#/c/9635/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9635/4//COMMIT_MSG@53 PS4, Line 53: Testing: Ran the stress test with the new patch a few times to make sure You could add a pause to one of the runners and kill it to test this manually. http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py File tests/stress/concurrent_select.py: http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@327 PS4, Line 327: # All values below are cumulative. maybe add a comment to describe how these are populated? It think it would also help to add a brief comment to explain what each of them means. http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@373 PS4, Line 373: for runner in self._query_runners: You could just call _total_num_queries_started() and _total_num_queries_finished() here. However, you'd need to switch _query_runners_lock to RLock. http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@429 PS4, Line 429: with self._query_runners_lock: I don't think holding the lock is necessary here. Python's GIL makes sure that list operations themselves are threadsafe http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@528 PS4, Line 528: while self._total_num_queries_started() < self._num_queries_to_run: It is not clear to me whether a query runner will run a single query and then exit, or whether it will keep running queries. Here it looks like we are starting new runners until we have run all of the queries, one per query. However, in _start_single_runner it looks like we are running a loop over the global query queue. Does this mean all runners continue to run queries, and we also continue to ramp up the number of runners all the time? http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@548 PS4, Line 548: Process(target=self._start_single_runner, args=(query_runner, )) This looks hairy to me, can you add a comment explaining how exactly sharing of data works here? My understanding of Process() is that only particular members of query_runner (Value, Array) will use shared memory and be synchronized. -- To view, visit http://gerrit.cloudera.org:8080/9635 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7 Gerrit-Change-Number: 9635 Gerrit-PatchSet: 4 Gerrit-Owner: Sailesh Mukil <sail...@cloudera.com> Gerrit-Reviewer: David Knupp <dkn...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Brown <mi...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-Comment-Date: Thu, 15 Mar 2018 21:16:11 +0000 Gerrit-HasComments: Yes