[GitHub] [geode-native] gaussianrecurrence commented on a change in pull request #813: GEODE-9325: Replace old ITs worker spawner

GitBox Tue, 22 Feb 2022 10:42:48 -0800


gaussianrecurrence commented on a change in pull request #813:
URL: https://github.com/apache/geode-native/pull/813#discussion_r812251799




##########
File path: cppcache/integration-test/fw_dunit.cpp
##########
@@ -361,19 +355,48 @@ void Task::setTimeout(int seconds) {
   }
 }
 
-class TestProcess : virtual public dunit::Manager {
- private:
-  WorkerId m_sId;
-
+class TestProcess {
  public:
   TestProcess(const std::string &cmdline, uint32_t id)
-      : Manager(cmdline), m_sId(id) {}
+      : id_{id}, running_{false}, cmd_{cmdline} {}
 
-  WorkerId &getWorkerId() { return m_sId; }
+  WorkerId &getWorkerId() { return id_; }
+
+  void run() {
+    auto arguments = bpo::split_unix(cmd_);
+
+    std::string exe = arguments[0];
+    arguments.erase(arguments.begin());
+    process_ = bp::child(exe, bp::args = arguments);
+
+    process_.wait();
+    if (process_.exit_code() != 0) {
+      std::clog << "Worker " << id_.getIdName() << " exited with code "
+                << process_.exit_code() << std::endl;
+    }
+
+    running_ = false;
+  }
+
+  void start() {
+    running_ = true;
+    thread_ = std::thread{[this]() { run(); }};
+  }
+
+  void stop() {
+    if (thread_.joinable()) {
+      thread_.join();
+    }
+  }
+
+  bool running() const { return running_; }

Review comment:
       Thing is this running flag only have an impact on the logic whenever 
checking if a worker has prematurely terminated its execution. It's true that 
for consistency it'd be best to set this flag to false at the end of stop, even 
thought it won't be used.
   
   Regarding the shutdown logic it's an interesting question because I had to 
look into it several times in order to solve the issue that was happening on 
Windows.
   This is the way it works up to this change:
    1. The coordinator (or TestDriver) will wait untill all of the task are 
completed or until the shared state is set to failed.
    2. On either case the termination shared state is set to true and the 
coordinator waits for all of the workers to complete their work (meaning the 
worker status is WORKER_STATE_DONE).
        Also the coordinator is given a grace period while waiting for all of 
the workers to finish their tasks which is TASK_TIMEOUT (harcoded as 1800 
seconds). If the timeout is exceed the shared state is set to failed.
    3. After this, the coordinator (TestDriver) is destroyed, which implies 
that all of the open handles are cleared up, nothing else. For more reference 
you can look at ACE_Process destructor. Note that once TestDriver is destroyed, 
there is no guarantee that the worker processes have actually terminated. In 
example, it could happen that the workers are doing some cleanup, even thought 
its state is set to WORKER_STATE_DONE.
    
    And the way I changed it to work now is by changing the 3rd step (1st and 
2nd remain unchanged).
    Now in the 3rd step, before destroying the coordinator we wait for all of 
the worker processes to actually terminate.
    
   The thing I noticed is that ctest on Windows waits for all of the child 
processes to finish before exiting.
   Meaning that if you start up a cluster inside a test ran with ctest, ctest 
won't exit until all of the java processes terminate. So, on both 
implementations, it could happen that ctest gets stuck if for example the 
worker in charge of starting up the cluster and tearing it down unexpectedly 
terminates.
   
   The implementation before Revision 1 was non-gracefully terminating all of 
the workers, meaning it could happen that cleanup were interrupted, leaving 
some members online, hence getting ctest stuck, and that's what I solved in 
revision 1.
   
   So answering to your question. What are the guarantees that termination 
actually happen? None, with both implementations. I can get it might be 
confusing that it's stated on the comments "// worker destructor should 
terminate process.", but the thing is that ACE_Process destructor does not do 
that, and if it'd do that, it could lead to some dead-locks on Windows.
   
   I could explore the possibility of killing all of the coordinator children, 
hence guaranteeing termination, but thing is that there is not a cross-platform 
way of doing it with boost. But I did not invest much time into this, since we 
are already setting ctest timeout and also it's highly unlikely that the worker 
in charge of shutting down the cluster unexpectedly terminates (it would need 
to involve GFSH segfaulting).
   So all things considered (and sorry for the long read), do you think it's 
worth looking into how to kill all of the children or just rely on ctest 
timeout for these highly unlikely scenario?
    




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [geode-native] gaussianrecurrence commented on a change in pull request #813: GEODE-9325: Replace old ITs worker spawner

Reply via email to