We are using gRPC with a purpose built GCC v7.5-2019.12 cross-compiler targeting ARM Cortex-A15 or Cortex-A9 based upon this software release from Linaro http://releases.linaro.org/components/toolchain/gcc-linaro/7.5-2019.12
The hang can be easily seen by running out gTests on these devices just a few iterations. And whenever it happens the following message is printed on the console, in a never ending loop, once in every few seconds: E0607 14:59:50.307433749 6785 ev_epollex_linux.cc:519] Error shutting down fd 29. errno: 9 E0607 15:00:17.878449028 6775 ev_epollex_linux.cc:519] Error shutting down fd 31. errno: 9 E0607 15:01:02.998995428 6775 ev_epollex_linux.cc:519] Error shutting down fd 31. errno: 9 E0607 15:01:18.042279014 6775 ev_epollex_linux.cc:519] Error shutting down fd 31. errno: 9 When I searched for this message on internet, I found this open bug on gRPC: https://github.com/grpc/grpc/issues/25289 <https://github.com/grpc/grpc/issues/25289> On Thursday, March 10, 2022 at 6:40:04 AM UTC+5:30 Anh Tran wrote: > That problem seems to go away when I upgrade to protobuf 3.20 for some > reason. > > On Tuesday, March 8, 2022 at 9:56:17 AM UTC+10 Anh Tran wrote: > >> It seems like there's some deadlock bug or undocumented expected >> behaviour in gRPC source code. >> >> However for the life of me, I can't make a minimal example to reproduce >> the problem. >> >> >> On Monday, March 7, 2022 at 1:06:32 PM UTC+10 Plin Chen wrote: >> >>> Our Situation is that when we try to close App without shutting gRPC >>> client down, the app hangs up there. >>> Of course, if we shutdown gRPC client manually several seconds before >>> quiting the app, things will go as normal. >>> ---- >>> We are using the latest gRPC-CSharp version. I notice that shutdown >>> method is an aync method. >>> When the "OnApplicationQuit" event is catched and we try to invoke >>> shutdown(gRPC) before the app really quit, >>> gRPC ThreadPool would seem like blocking with the quitting process. >>> ---- >>> We've tried 2 workarounds, and either works: >>> 1, catch OnApplicationQuit, invoke "shutdown" of gRPC, let the >>> application wait for 2 seconds then try to close the app. >>> 2, Everytime when client finish calling remote procedure of grpc server, >>> shutdown the grpc client at once. >>> On Friday, March 4, 2022 at 9:18:19 PM UTC+8 Anh Tran wrote: >>> >>>> I have some time to revisit this problem today. Running with GRPC >>>> VERBOSITY set to DEBUG gives me this (anything starts with* >>>> [2022-03-04T23* is my own logging): >>>> >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501457] Logger level is set at info. >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [warning] [process >>>> 501457, thread 501457] Persistent not specified and will default to false >>>> D0304 23:15:41.640009936 501457 ev_posix.cc:172] Using >>>> polling engine: epollex >>>> D0304 23:15:41.640079032 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "grpclb" >>>> D0304 23:15:41.640106084 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "priority_experimental" >>>> D0304 23:15:41.640124027 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "weighted_target_experimental" >>>> D0304 23:15:41.640127329 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "pick_first" >>>> D0304 23:15:41.640150128 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "round_robin" >>>> D0304 23:15:41.640153574 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "ring_hash_experimental" >>>> D0304 23:15:41.640162650 501457 dns_resolver_ares.cc:624] Using ares >>>> dns resolver >>>> D0304 23:15:41.640230087 501457 certificate_provider_registry.cc:33] >>>> registering certificate provider factory for "file_watcher" >>>> D0304 23:15:41.640236782 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "cds_experimental" >>>> D0304 23:15:41.640241375 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "xds_cluster_impl_experimental" >>>> D0304 23:15:41.640258747 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "xds_cluster_resolver_experimental" >>>> D0304 23:15:41.640262652 501457 lb_policy_registry.cc:42] >>>> registering LB policy factory for "xds_cluster_manager_experimental" >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501457] gRPC client at localhost:4500 is registering with the >>>> server... >>>> I0304 23:15:41.640909979 501458 subchannel.cc:945] subchannel >>>> 0x55e6713dcea0 {address=ipv6:[::1]:4500, >>>> args=grpc.client_channel_factory=0x55e6713d6a20, >>>> grpc.default_authority=localhost:4500, >>>> grpc.internal.subchannel_pool=0x55e6713bb810, >>>> grpc.primary_user_agent=grpc-c++/1.44.0-dev, >>>> grpc.resource_quota=0x55e6713d7b40, >>>> grpc.server_uri=dns:///localhost:4500}: >>>> connect failed: {"created":"@1646399741.640833261","description":"Failed >>>> to >>>> connect to remote host: Connection >>>> refused","errno":111,"file":"/home/hoover/workspace/DPAV/Linux/OpenSource/gRPC-focal/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection >>>> >>>> refused","syscall":"connect","target_address":"ipv6:[::1]:4500"} >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501457] ... gRPC client at localhost:4500 successfully registered >>>> with server >>>> I0304 23:15:41.642209150 501462 server_builder.cc:327] >>>> Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ >>>> timeout >>>> (msec): 10000 >>>> I0304 23:15:41.642437546 501462 socket_utils_common_posix.cc:353] >>>> TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501457] JobClient client constructed. >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501457] HelloWorld client constructed. >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501457] New id assigned: >>>> localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501457] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> is trying to register client for job type: : hello_world >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501463] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> registered with Job Server 4501 >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501457] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> received ACK for RegisterNewClient. >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> identified JSON message as issueNewJob. >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> accepted job. >>>> [2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457, >>>> thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> acknowledged the job. >>>> [2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457, >>>> thread 501485] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> identified JSON message as jobProgressUpdateAck. >>>> [2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457, >>>> thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> received ACK for final JobProgressUpdate. >>>> [2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457, >>>> thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642 >>>> finished processing a job and now shutting down. >>>> D0304 23:15:42.482888232 501484 chttp2_transport.cc:1715] ipv4: >>>> 127.0.0.1:56916: Sending goaway >>>> err={"created":"@1646399742.482860742","description":"Server >>>> shutdown","file":"/home/hoover/workspace/DPAV/Linux/OpenSource/gRPC-focal/src/core/lib/surface/server.cc","file_line":481,"grpc_status":0} >>>> >>>> On Monday, February 14, 2022 at 1:37:12 PM UTC+10 Anh Tran wrote: >>>> >>>>> Hi team, >>>>> >>>>> Yes. that is the case. I am also confused... >>>>> >>>>> >>>>> On Saturday, February 12, 2022 at 9:00:29 AM UTC+10 veb...@google.com >>>>> wrote: >>>>> >>>>>> Hi Ahn, >>>>>> >>>>>> Thanks for the stack trace but I'm a bit confused. Is this what you >>>>>> got while your program was stuck waiting forever? It looks a bit off >>>>>> because Valgrind complains about "Conditional jump or move depends on >>>>>> uninitialised value" and callstack appears to show some deadline inside. >>>>>> >>>>>> On Thursday, February 10, 2022 at 1:10:12 AM UTC-8 Anh Tran wrote: >>>>>> >>>>>>> Hi @google.com,, >>>>>>> >>>>>>> Thanks for getting back to me at this. Please see my stack trace >>>>>>> from valgrind: >>>>>>> >>>>>>> ==287419== Memcheck, a memory error detector >>>>>>> ==287419== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward >>>>>>> et al. >>>>>>> ==287419== Using Valgrind-3.18.1 and LibVEX; rerun with -h for >>>>>>> copyright info >>>>>>> ==287419== Command: >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli >>>>>>> --job-type >>>>>>> slam >>>>>>> ==287419== Parent PID: 287418 >>>>>>> ==287419== >>>>>>> ==287419== Conditional jump or move depends on uninitialised value(s) >>>>>>> ==287419== at 0x91743A: void** NextStackFrame<true, >>>>>>> false>(void**, void const*, unsigned long, unsigned long) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x916F98: int UnwindImpl<false, false>(void**, >>>>>>> int*, int, int, void const*, int*) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x916D8D: absl::lts_20211102::GetStackTrace(void**, >>>>>>> int, int) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x90F27C: absl::lts_20211102::(anonymous >>>>>>> namespace)::GetStack(void**, int) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x924569: >>>>>>> absl::lts_20211102::synchronization_internal::GraphCycles::UpdateStackTrace(absl::lts_20211102::synchronization_internal::GraphId, >>>>>>> >>>>>>> int, int (*)(void**, int)) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x90F342: >>>>>>> absl::lts_20211102::DeadlockCheck(absl::lts_20211102::Mutex*) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x90F84C: >>>>>>> absl::lts_20211102::DebugOnlyDeadlockCheck(absl::lts_20211102::Mutex*) >>>>>>> (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x90FB6E: absl::lts_20211102::Mutex::Lock() (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x8A3AD0: gpr_mu_lock (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x7DFA25: begin_worker(grpc_pollset*, >>>>>>> grpc_pollset_worker*, grpc_pollset_worker**, long) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x7DFFDA: pollset_work(grpc_pollset*, >>>>>>> grpc_pollset_worker**, long) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x7716F4: pollset_work(grpc_pollset*, >>>>>>> grpc_pollset_worker**, long) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== Uninitialised value was created by a stack allocation >>>>>>> ==287419== at 0x1C3FD0: >>>>>>> Emesent::GRPCQueue::Stub::Register(grpc::ClientContext*, Emesent::Empty >>>>>>> const&, Emesent::RegistrationResponse*) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== >>>>>>> ==287419== Thread 7 grpcpp_sync_ser: >>>>>>> ==287419== Conditional jump or move depends on uninitialised value(s) >>>>>>> ==287419== at 0x916F7F: int UnwindImpl<false, false>(void**, >>>>>>> int*, int, int, void const*, int*) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x916D8D: absl::lts_20211102::GetStackTrace(void**, >>>>>>> int, int) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x90F27C: absl::lts_20211102::(anonymous >>>>>>> namespace)::GetStack(void**, int) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x924569: >>>>>>> absl::lts_20211102::synchronization_internal::GraphCycles::UpdateStackTrace(absl::lts_20211102::synchronization_internal::GraphId, >>>>>>> >>>>>>> int, int (*)(void**, int)) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x90F342: >>>>>>> absl::lts_20211102::DeadlockCheck(absl::lts_20211102::Mutex*) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x90F84C: >>>>>>> absl::lts_20211102::DebugOnlyDeadlockCheck(absl::lts_20211102::Mutex*) >>>>>>> (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x90FB6E: absl::lts_20211102::Mutex::Lock() (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x8A3AD0: gpr_mu_lock (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x5121A4: grpc_cq_begin_op(grpc_completion_queue*, >>>>>>> void*) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x5199C4: >>>>>>> grpc_core::Server::ShutdownAndNotify(grpc_completion_queue*, void*) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x51CD63: grpc_server_shutdown_and_notify (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== by 0x231AA2: >>>>>>> grpc::Server::ShutdownInternal(gpr_timespec) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== Uninitialised value was created by a stack allocation >>>>>>> ==287419== at 0x1D37E0: std::_Function_handler<void >>>>>>> (std::__cxx11::basic_string<char, std::char_traits<char>, >>>>>>> std::allocator<char> > const&, std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> > const&), >>>>>>> workflow_client::GRPCIcomm::start_and_wait()::{lambda(std::__cxx11::basic_string<char, >>>>>>> >>>>>>> std::char_traits<char>, std::allocator<char> > const&, >>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>, >>>>>>> std::allocator<char> > const&)#1}>::_M_invoke(std::_Any_data const&, >>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>, >>>>>>> std::allocator<char> > const&, std::__cxx11::basic_string<char, >>>>>>> std::char_traits<char>, std::allocator<char> > const&) (in >>>>>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli) >>>>>>> ==287419== >>>>>>> >>>>>>> In case this is interesting, this is how I start my server and wait >>>>>>> until shutdown: >>>>>>> >>>>>>> std::string server_addr_and_port = base_address_ + ":" + id_; >>>>>>> grpc::ServerBuilder server_builder; >>>>>>> server_builder.AddListeningPort(server_addr_and_port, grpc:: >>>>>>> InsecureServerCredentials()); >>>>>>> >>>>>>> GRPCQueueImpl grpc_queue_service; >>>>>>> server_builder.RegisterService(&grpc_queue_service); >>>>>>> server_ = server_builder.BuildAndStart(); >>>>>>> server_->Wait(); >>>>>>> On Thursday, February 10, 2022 at 9:08:39 AM UTC+10 >>>>>>> ctmah...@gmail.com wrote: >>>>>>> >>>>>>>> I have also experienced the same issue in my work environment. Our >>>>>>>> work around was to have N second timeout and check a shutdown flag and >>>>>>>> then >>>>>>>> call shutdown. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Feb 9, 2022 at 3:54 PM, <veb...@google.com' via grpc.io> wrote: >>>>>>>> >>>>>>>> Would you capture the call-stack when Shutdown function appears to >>>>>>>> hang? >>>>>>>> >>>>>>>> On Wednesday, February 2, 2022 at 10:24:14 PM UTC-8 Anh Tran wrote: >>>>>>>> >>>>>>>>> The last commit on my local copy of gRPC is 06 Jan 2022, so not >>>>>>>>> that old. >>>>>>>>> On Thursday, February 3, 2022 at 4:51:16 AM UTC+10 Easwar >>>>>>>>> Swaminathan wrote: >>>>>>>>> >>>>>>>>>> Has anyone of y'all seen this in a more recent version of gRPC. >>>>>>>>>> `1.30.2` is really old. >>>>>>>>>> >>>>>>>>>> On Tuesday, February 1, 2022 at 12:15:54 AM UTC-8 Prasanta Behera >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> No, I've not found any solution yet! I was trying to have a >>>>>>>>>>> hello-world kind of example to reproduce to open a ticket, but as >>>>>>>>>>> you said >>>>>>>>>>> with such simple examples it never reproduces... >>>>>>>>>>> >>>>>>>>>>> On Tuesday, February 1, 2022 at 1:42:43 PM UTC+5:30 Anh Tran >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I should also add that this behavior happens during threading. >>>>>>>>>>>> >>>>>>>>>>>> On Tuesday, February 1, 2022 at 3:45:04 PM UTC+10 Anh Tran >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Unfortunately I came across the same problem. >>>>>>>>>>>>> When trying to isolate the problem (having a simple hello >>>>>>>>>>>>> world server-client), the problem seems to go away, i.e server >>>>>>>>>>>>> can be shut >>>>>>>>>>>>> down. >>>>>>>>>>>>> >>>>>>>>>>>>> But when using in my own project, this happens. And despite my >>>>>>>>>>>>> effort to track down the problem, I can't seem to spot one. >>>>>>>>>>>>> This is weird since this version of Shutdown should force shut >>>>>>>>>>>>> down after a certain period, as describe in the doc. >>>>>>>>>>>>> >>>>>>>>>>>>> Prasanta, do you happen to find the solution to your problem? >>>>>>>>>>>>> Maybe it will give a hint to mine. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thursday, January 20, 2022 at 10:02:43 PM UTC+10 Prasanta >>>>>>>>>>>>> Behera wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> gRPC version : 1.30.2 (c++) >>>>>>>>>>>>>> $ gcc --version >>>>>>>>>>>>>> gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 >>>>>>>>>>>>>> >>>>>>>>>>>>>> We use gRPC is synchronous mode and stop our gRPC services >>>>>>>>>>>>>> with this code >>>>>>>>>>>>>> >>>>>>>>>>>>>> m_server->Shutdown(std::chrono::system_clock::now() + >>>>>>>>>>>>>> std::chrono::seconds(1)); >>>>>>>>>>>>>> m_server->Wait(); >>>>>>>>>>>>>> >>>>>>>>>>>>>> On rare occasions (about once in 1000 tries) the Shutdown() >>>>>>>>>>>>>> call >>>>>>>>>>>>>> does NOT return and the process just hangs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there anything I missing to have a reliable shutdown of >>>>>>>>>>>>>> the server? >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "grpc.io" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to grpc-io+u...@googlegroups.com. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/grpc-io/96350454-3e52-4068-b3e1-1fea25537098n%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/grpc-io/96350454-3e52-4068-b3e1-1fea25537098n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>>> -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/bedcda4a-eef3-4fce-a02f-01ead008d61an%40googlegroups.com.