gaussianrecurrence commented on a change in pull request #813:
URL: https://github.com/apache/geode-native/pull/813#discussion_r812251799
##########
File path: cppcache/integration-test/fw_dunit.cpp
##########
@@ -361,19 +355,48 @@ void Task::setTimeout(int seconds) {
}
}
-class TestProcess : virtual public dunit::Manager {
- private:
- WorkerId m_sId;
-
+class TestProcess {
public:
TestProcess(const std::string &cmdline, uint32_t id)
- : Manager(cmdline), m_sId(id) {}
+ : id_{id}, running_{false}, cmd_{cmdline} {}
- WorkerId &getWorkerId() { return m_sId; }
+ WorkerId &getWorkerId() { return id_; }
+
+ void run() {
+ auto arguments = bpo::split_unix(cmd_);
+
+ std::string exe = arguments[0];
+ arguments.erase(arguments.begin());
+ process_ = bp::child(exe, bp::args = arguments);
+
+ process_.wait();
+ if (process_.exit_code() != 0) {
+ std::clog << "Worker " << id_.getIdName() << " exited with code "
+ << process_.exit_code() << std::endl;
+ }
+
+ running_ = false;
+ }
+
+ void start() {
+ running_ = true;
+ thread_ = std::thread{[this]() { run(); }};
+ }
+
+ void stop() {
+ if (thread_.joinable()) {
+ thread_.join();
+ }
+ }
+
+ bool running() const { return running_; }
Review comment:
Thing is this running flag only have an impact on the logic whenever
checking if a worker has prematurely terminated its execution. It's true that
for consistency it'd be best to set this flag to false at the end of stop, even
thought it won't be used.
Regarding the shutdown logic it's an interesting question because I had to
look into it several times in order to solve the issue that was happening on
Windows.
This is the way it works up to this change:
1. The coordinator (or TestDriver) will wait untill all of the task are
completed or until the shared state is set to failed.
2. On either case the termination shared state is set to true and the
coordinator waits for all of the workers to complete their work (meaning the
worker status is WORKER_STATE_DONE).
Also the coordinator is given a grace period while waiting for all of
the workers to finish their tasks which is TASK_TIMEOUT (harcoded as 1800
seconds). If the timeout is exceed the shared state is set to failed.
3. After this, the coordinator (TestDriver) is destroyed, which implies
that all of the open handles are cleared up, nothing else. For more reference
you can look at ACE_Process destructor. Note that once TestDriver is destroyed,
there is no guarantee that the worker processes have actually terminated. In
example, it could happen that the workers are doing some cleanup, even thought
its state is set to WORKER_STATE_DONE.
And the way I changed it to work now is by changing the 3rd step (1st and
2nd remain unchanged).
Now in the 3rd step, before destroying the coordinator we wait for all of
the worker processes to actually terminate.
The thing I noticed is that ctest on Windows waits for all of the child
processes to finish before exiting.
Meaning that if you start up a cluster inside a test ran with ctest, ctest
won't exit until all of the java processes terminate. So, on both
implementations, it could happen that ctest gets stuck if for example the
worker in charge of starting up the cluster and tearing it down unexpectedly
terminates.
The implementation before Revision 1 was non-gracefully terminating all of
the workers, meaning it could happen that cleanup were interrupted, leaving
some members online, hence getting ctest stuck, and that's what I solved in
revision 1.
So answering to your question. What are the guarantees that termination
actually happen? None, with both implementations. I can get it might be
confusing that it's stated on the comments "// worker destructor should
terminate process.", but the thing is that ACE_Process destructor does not do
that, and if it'd do that, it could lead to some dead-locks on Windows.
I could explore the possibility of killing all of the coordinator children,
hence guaranteeing termination, but thing is that there is not a cross-platform
way of doing it with boost. But I did not invest much time into this, since we
are already setting ctest timeout and also it's highly unlikely that the worker
in charge of shutting down the cluster unexpectedly terminates (it would need
to involve GFSH segfaulting).
So all things considered (and sorry for the long read), do you think it's
worth looking into how to kill all of the children or just rely on ctest
timeout for these highly unlikely scenario?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]