Dear Skywalking Dev team,
I had deployed Skywaking Java agent & UI/OAP/ES service into backend
microservices K8S cluster. During our JMeter performance testing we found many
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
DEADLINE_EXCEEDED logs both in agent side and OAP server side.
Agent side:
ERROR 2020-01-14 03:50:52:070
SkywalkingAgent-5-ServiceAndEndpointRegisterClient-0
ServiceAndEndpointRegisterClient : ServiceAndEndpointRegisterClient execute
fail.
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
DEADLINE_EXCEEDED
at
org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
ERROR 2020-01-14 03:46:22:069 SkywalkingAgent-4-JVMService-consume-0 JVMService
: send JVM metrics to Collector fail.
org.apache.skywalking.apm.dependencies.io.grpc.StatusRuntimeException:
DEADLINE_EXCEEDED
at
org.apache.skywalking.apm.dependencies.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:222)
OAP server side:
2020-01-14 03:53:18,935 -
org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient -147226067
[grpc-default-executor-863] ERROR [] - DEADLINE_EXCEEDED: deadline exceeded
after 19999979082ns
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after
19999979082ns
at io.grpc.Status.asRuntimeException(Status.java:526)
~[grpc-core-1.15.1.jar:1.15.1]
and the respective Instance Throughput curve don
none-flat(with Exception log) curve vs. flat curve(no Exception log)
[cid:[email protected]] VS. [cid:[email protected]]
I checked the TraceSegmentServiceClient and related source code and found that
this Exception from agent side is an Error consume behavior, but the error data
is not counted into abandoned data size account.
[cid:[email protected]]
I'm wondering that when this gRPC exception occurs, whether the trace data sent
to OAP server is lost or not?
In case that the trace data is lost, why the lost data is not counted into the
abandoned data static? And the metric calculation during the trace data lost
time range is distorted due to incomplete trace data collection?
Is there any configuration needed from agent or/and oap server side to resolve
this gPRC exception issue to avoid trace data lost?
P.S.
I also met the "trace segment has been abandoned, cause by buffer is full"
issue before due to the default 5*300 buffer is not enough. In this case trace
data is lost at agent side directly before sending to OAP collector.
However after I increased the agent side trace data buffer to 10*3000, this
abandoned issue never occurred again.
http-nio-0.0.0.0-9090-exec-23 TraceSegmentServiceClient : One trace segment has
been abandoned, cause by buffer is full.
Thanks & Best Regards
Xiaochao Zhang(James)
DI SW CAS MP EMK DO-CHN
No.7, Xixin Avenue, Chengdu High-Tech Zone
Chengdu, China 611731
Email: [email protected] <mailto:[email protected]>