Thanks for the quick support, Wu Sheng! Thanks & Best Regards
Xiaochao Zhang(James) DI SW CAS MP EMK DO-CHN No.7, Xixin Avenue, Chengdu High-Tech Zone Chengdu, China 611731 Email: [email protected] <mailto:[email protected]> From: Sheng Wu <[email protected]> Sent: Thursday, January 16, 2020 10:47 AM To: dev <[email protected]> Subject: Re: Question about org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED Inline Zhang, James <[email protected]<mailto:[email protected]>> 于2020年1月16日周四 上午10:34写道: Dear Skywalking Dev team, I had deployed Skywaking Java agent & UI/OAP/ES service into backend microservices K8S cluster. During our JMeter performance testing we found many org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED logs both in agent side and OAP server side. Agent side: ERROR 2020-01-14 03:50:52:070 SkywalkingAgent-5-ServiceAndEndpointRegisterClient-0 ServiceAndEndpointRegisterClient : ServiceAndEndpointRegisterClient execute fail. org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED at org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222) ERROR 2020-01-14 03:46:22:069 SkywalkingAgent-4-JVMService-consume-0 JVMService : send JVM metrics to Collector fail. org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED at org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222) OAP server side: 2020-01-14 03:53:18,935 - org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient -147226067 [grpc-default-executor-863] ERROR [] - DEADLINE_EXCEEDED: deadline exceeded after 19999979082ns io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 19999979082ns at io.grpc.Status.asRuntimeException(Status.java:526) ~[grpc-core-1.15.1.jar:1.15.1] and the respective Instance Throughput curve don none-flat(with Exception log) curve vs. flat curve(no Exception log) [cid:16fac3bc8614cff311] VS. [cid:16fac3bc8625b16b22] I checked the TraceSegmentServiceClient and related source code and found that this Exception from agent side is an Error consume behavior, but the error data is not counted into abandoned data size account. [cid:16fac3bc862692e333] I’m wondering that when this gRPC exception occurs, whether the trace data sent to OAP server is lost or not? Most likely, lost. In case that the trace data is lost, why the lost data is not counted into the abandoned data static? And the metric calculation during the trace data lost time range is distorted due to incomplete trace data collection? Because by using gRPC streaming, we don't know how many segments lost. Is there any configuration needed from agent or/and oap server side to resolve this gPRC exception issue to avoid trace data lost? I think, you should increase the backend resource or resolve the network unstable issue. P.S. I also met the “trace segment has been abandoned, cause by buffer is full” issue before due to the default 5*300 buffer is not enough. In this case trace data is lost at agent side directly before sending to OAP collector. 5 * 3000 should be enough for most users unless your system is very high load or network is unstable like I said above. When you said 10 * 3000 is better, I am guessing your network or network performance is not stable, so you need more buffers at the agent side holding the data. However after I increased the agent side trace data buffer to 10*3000, this abandoned issue never occurred again. http-nio-0.0.0.0-9090-exec-23 TraceSegmentServiceClient : One trace segment has been abandoned, cause by buffer is full. Thanks & Best Regards Xiaochao Zhang(James) DI SW CAS MP EMK DO-CHN No.7, Xixin Avenue, Chengdu High-Tech Zone Chengdu, China 611731 Email: [email protected] <mailto:[email protected]>
