[GitHub] [skywalking] kezhenxu94 opened a new issue, #9814: [Bug] Cluster coordinator should be responsive to instance up/down

GitBox Wed, 19 Oct 2022 06:12:40 -0700


kezhenxu94 opened a new issue, #9814:
URL: https://github.com/apache/skywalking/issues/9814


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Apache SkyWalking Component
   
   OAP server (apache/skywalking)
   
   ### What happened
   
   Currently the cluster coordinator is using polling strategy to fetch the OAP 
instances, and the interval is 5s:
   
   
https://github.com/apache/skywalking/blob/34cfafe398e80ca1a7299e1243827937c1a691dd/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/remote/client/RemoteClientManager.java#L96
   
   There is a common case when some of the OAP instances restarted/shutdown, 
the living ones will still try to connect to those dead instances in the 
interval, causing error logs like this:
   
   ```
   2022-10-18 07:33:28,297 - 
org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient -31025 
[grpc-default-executor-0] ERROR [] - UNAVAILABLE: io exception
   io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
        at io.grpc.Status.asRuntimeException(Status.java:535) 
~[grpc-api-1.49.0.jar:1.49.0]
        at 
io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
 [grpc-stub-1.49.0.jar:1.49.0]
        at 
io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) 
[grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) 
[grpc-core-1.49.0.jar:1.49.0]
        at 
io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744)
 [grpc-core-1.49.0.jar:1.49.0]
        at 
io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
 [grpc-core-1.49.0.jar:1.49.0]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) 
[grpc-core-1.49.0.jar:1.49.0]
        at 
io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) 
[grpc-core-1.49.0.jar:1.49.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
[?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]
   Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
finishConnect(..) failed: Connection refused: /10.92.1.254:31800
   Caused by: java.net.ConnectException: finishConnect(..) failed: Connection 
refused
        at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) 
~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
        at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) 
~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
        at io.netty.channel.unix.Socket.finishConnect(Socket.java:359) 
~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final]
        at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
 ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
        at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
 ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
        at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
 ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
        at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) 
~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) 
~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
 ~[netty-common-4.1.81.Final.jar:4.1.81.Final]
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
~[netty-common-4.1.81.Final.jar:4.1.81.Final]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[netty-common-4.1.81.Final.jar:4.1.81.Final]
        ... 1 more
   ```
   
   For most service-discover service like Kubernetes they should have a 
watcher/listener mode that clients can register to the changes of instances and 
get notified right after the instances' states changed, it's also easier to 
wrap the polling mechanism and expose as a listener mechanism for those that 
doesn't support listener mode.
   
   ### What you expected to happen
   
   The OAP should be more responsive to the changes of the cluster instances 
states. Reducing unnecessary errors.
   
   ### How to reproduce
   
   Start OAP in Kubernetes cluster mode, set the replicas to more than 1, after 
the OAP is ready, restart one of the OAP instance, and observe the logs in the 
other living OAP instances, there should be error logs as posted above.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [skywalking] kezhenxu94 opened a new issue, #9814: [Bug] Cluster coordinator should be responsive to instance up/down

Reply via email to