Dear Skywalking Dev team,
I had deployed Skywaking Java agent & UI/OAP/ES service into backend 
microservices K8S cluster. During our JMeter performance testing we found many 
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED logs both in agent side and OAP server side.
Agent side:
ERROR 2020-01-14 03:50:52:070 
SkywalkingAgent-5-ServiceAndEndpointRegisterClient-0 
ServiceAndEndpointRegisterClient : ServiceAndEndpointRegisterClient execute 
fail.
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED
        at 
org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
ERROR 2020-01-14 03:46:22:069 SkywalkingAgent-4-JVMService-consume-0 JVMService 
: send JVM metrics to Collector fail.
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED
        at 
org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)

OAP server side:
2020-01-14 03:53:18,935 - 
org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient -147226067 
[grpc-default-executor-863] ERROR [] - DEADLINE_EXCEEDED: deadline exceeded 
after 19999979082ns
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 
19999979082ns
               at io.grpc.Status.asRuntimeException(Status.java:526) 
~[grpc-core-1.15.1.jar:1.15.1]

and the respective Instance Throughput curve don
none-flat(with Exception log) curve vs. flat curve(no Exception log)
[cid:[email protected]]  VS. [cid:[email protected]]

I checked the TraceSegmentServiceClient  and related source code and found that 
this Exception from agent side is an Error consume behavior, but the error data 
is not counted into abandoned data size account.
[cid:[email protected]]

I'm wondering that when this gRPC exception occurs, whether the trace data sent 
to OAP server is lost or not?
In case that the trace data is lost, why the lost data is not counted into the 
abandoned data static? And the metric calculation during the trace data lost 
time range is distorted due to incomplete trace data collection?

Is there any configuration needed from agent or/and oap server side to resolve 
this gPRC exception issue to avoid trace data lost?

P.S.
I also met the "trace segment has been abandoned, cause by buffer is full" 
issue before due to the default 5*300 buffer is not enough. In this case trace 
data is lost at agent side directly before sending to OAP collector.
However after I increased the agent side trace data buffer to 10*3000, this 
abandoned issue never occurred again.
http-nio-0.0.0.0-9090-exec-23 TraceSegmentServiceClient : One trace segment has 
been abandoned, cause by buffer is full.

Thanks & Best Regards

Xiaochao Zhang(James)
DI SW CAS MP EMK DO-CHN
No.7, Xixin Avenue, Chengdu High-Tech Zone
Chengdu, China  611731
Email: [email protected] <mailto:[email protected]>

Reply via email to