kezhenxu94 opened a new issue, #9814: URL: https://github.com/apache/skywalking/issues/9814
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar issues. ### Apache SkyWalking Component OAP server (apache/skywalking) ### What happened Currently the cluster coordinator is using polling strategy to fetch the OAP instances, and the interval is 5s: https://github.com/apache/skywalking/blob/34cfafe398e80ca1a7299e1243827937c1a691dd/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/remote/client/RemoteClientManager.java#L96 There is a common case when some of the OAP instances restarted/shutdown, the living ones will still try to connect to those dead instances in the interval, causing error logs like this: ``` 2022-10-18 07:33:28,297 - org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient -31025 [grpc-default-executor-0] ERROR [] - UNAVAILABLE: io exception io.grpc.StatusRuntimeException: UNAVAILABLE: io exception at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.49.0.jar:1.49.0] at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) [grpc-stub-1.49.0.jar:1.49.0] at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) [grpc-core-1.49.0.jar:1.49.0] at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) [grpc-core-1.49.0.jar:1.49.0] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744) [grpc-core-1.49.0.jar:1.49.0] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) [grpc-core-1.49.0.jar:1.49.0] at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.49.0.jar:1.49.0] at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [grpc-core-1.49.0.jar:1.49.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.92.1.254:31800 Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final] at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final] at io.netty.channel.unix.Socket.finishConnect(Socket.java:359) ~[netty-transport-native-unix-common-4.1.81.Final-linux-x86_64.jar:4.1.81.Final] at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final] at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final] at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:489) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397) ~[netty-transport-classes-epoll-4.1.81.Final.jar:4.1.81.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.81.Final.jar:4.1.81.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.81.Final.jar:4.1.81.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.81.Final.jar:4.1.81.Final] ... 1 more ``` For most service-discover service like Kubernetes they should have a watcher/listener mode that clients can register to the changes of instances and get notified right after the instances' states changed, it's also easier to wrap the polling mechanism and expose as a listener mechanism for those that doesn't support listener mode. ### What you expected to happen The OAP should be more responsive to the changes of the cluster instances states. Reducing unnecessary errors. ### How to reproduce Start OAP in Kubernetes cluster mode, set the replicas to more than 1, after the OAP is ready, restart one of the OAP instance, and observe the logs in the other living OAP instances, there should be error logs as posted above. ### Anything else _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
