[
https://issues.apache.org/jira/browse/KUDU-3624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17901276#comment-17901276
]
ASF subversion and git services commented on KUDU-3624:
-------------------------------------------------------
Commit da6211df5f8df0c53ceedd542b61634f3bab7205 in kudu's branch
refs/heads/master from Ádám Bakai
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=da6211df5 ]
[subprocess] KUDU-3624 Fix DoWait thread-safety
waitpid() does return with an error if it is called with a pid that was
already shut down. So Subprocess::DoWait() stores the return value of
previous waitpid execution and returns it instead of running it again.
But in EchoSubprocessTest.TestSubprocessMetricsOnError it can happen
that SubprocessServer::ExitCheckerThread() and Subprocess::KillAndWait()
both call Subprocess::DoWait() and both of them call waitpid. And if
ExitCheckerThread() calls it second, then it fails the following check:
Check failed: _s.ok() Bad status: Runtime error: Unable
to wait on child: No child processes (error 10)
To fix this behaviour, wait_mutex_ is added. If a thread runs and
calls waitpid(), other threads won't execute it in the same time. If
locking is unsuccessful but the WaitMode is NON_BLOCKING, then return as
if nothing happened. Unit test SubprocessTest.TestMultiThreadWait
was added to verify executing two wait commands concurrently.
Change-Id: I1cb540860b439c26e1c8529123c8b29940d9f84f
Reviewed-on: http://gerrit.cloudera.org:8080/22056
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Alexey Serbin <[email protected]>
> EchoSubprocessTest.TestSubprocessMetricsOnError is flaky
> --------------------------------------------------------
>
> Key: KUDU-3624
> URL: https://issues.apache.org/jira/browse/KUDU-3624
> Project: Kudu
> Issue Type: Sub-task
> Reporter: Bakai Ádám
> Assignee: Bakai Ádám
> Priority: Major
>
> {code:java}
> I20241104 14:09:34.411099 543202 server.cc:273] Received an EOF from the
> subprocess
> W20241104 14:09:34.411275 543165 server.cc:408] The subprocess has exited
> with status 9
> I20241104 14:09:34.417060 543203 server.cc:440] outbound queue shut down:
> Aborted:
> I20241104 14:09:34.417075 543200 server.cc:366] get failed, inbound queue
> shut down: Aborted:
> I20241104 14:09:34.417109 543201 server.cc:366] get failed, inbound queue
> shut down: Aborted:
> I20241104 14:09:34.417068 543199 server.cc:366] get failed, inbound queue
> shut down: Aborted:
> I20241104 14:09:37.790342 543244 server.cc:273] Received an EOF from the
> subprocess
> F20241104 14:09:37.790630 543207 server.cc:401] Check failed: _s.ok() Bad
> status: Runtime error: Unable to wait on child: No child processes (error 10)
> *** Check failure stack trace: ***
> I20241104 14:09:37.790678 543242 server.cc:366] get failed, inbound queue
> shut down: Aborted:
> I20241104 14:09:37.790673 543243 server.cc:366] get failed, inbound queue
> shut down: Aborted:
> *** Aborted at 1730729377 (unix time) try "date -d @1730729377" if you are
> using GNU date ***
> I20241104 14:09:37.790684 543241 server.cc:366] get failed, inbound queue
> shut down: Aborted:
> I20241104 14:09:37.790699 543245 server.cc:440] outbound queue shut down:
> Aborted:
> PC: @ 0x0 (unknown)
> *** SIGABRT (@0x848a9) received by PID 542889 (TID 0x704a47600700) from PID
> 542889; stack trace: ***
> @ 0x704a4d478980 (unknown)
> @ 0x704a4d0b3e87 gsignal
> @ 0x704a4d0b57f1 abort
> @ 0x704a4e171d8d google::LogMessage::Fail()
> @ 0x704a4e175b53 google::LogMessage::SendToLog()
> @ 0x704a4e17178c google::LogMessage::Flush()
> @ 0x704a4e172f19 google::LogMessageFatal::~LogMessageFatal()
> @ 0x704a4f855a7e
> kudu::subprocess::SubprocessServer::ExitCheckerThread()
> @ 0x704a4f8529bd
> _ZZN4kudu10subprocess16SubprocessServer4InitEvENKUlvE0_clEv
> @ 0x704a4f8568ab
> _ZNSt17_Function_handlerIFvvEZN4kudu10subprocess16SubprocessServer4InitEvEUlvE0_E9_M_invokeERKSt9_Any_data
> @ 0x704a4f8cb06a std::function<>::operator()()
> @ 0x704a4eff682b kudu::Thread::SuperviseThread()
> @ 0x704a4d46d6db start_thread
> @ 0x704a4d19661f clone
> zsh: abort (core dumped) ./bin/subprocess_proxy-test --gtest_repeat=999
> --gtest_break_on_failure {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)