Hello, yeling, I checked the code of the old version. Since the current retry process will execute failed.remove(invocation); anyway, this will cause our retry to be done only once. I think this is a bug in the current failback.
I have given a few reviews about your pr. I need you to improve your ut now, I don't think it is rigorous at the moment. The other parts I think are correct. I will merge it after you perfect your ut. This is your first pr, expecting you more PR. :) 你好,yeling, 我查看了老版本的代码。 由于目前重试过程之后,无论如何都会执行failed.remove(invocation);,这会导致我们的重试无论如何只会进行一次。我认为这是目前failback的一个bug。 关于你的pr,我已经给出了几次review。目前需要你完善一下你的ut,我认为目前并不严谨。其他的部分我认为是正确的。在你完善你的ut之后我会merge它。 这是你的第一个pr,期待你更多的PR。:) yeling <xcwang...@gmail.com> 于2018年12月21日周五 下午10:22写道: > FailbackClusterInvoker should limit the size of failed invocations > issues:https://github.com/apache/incubator-dubbo/issues/2425 > > > In the FailbackClusterInvoker class, the failed member variable does not > control the size, which may cause memory leaks. When reading the source > code, I found some problems: > > 1. When a timed task is executed, just one invoke request fails and is > added directly to the queue. This will cause the newly failed task to > execute immediately, which does not meet the previous expectations. In > theory, the invoke request is executed once in five seconds. > > 2. The addFailed (Invocation invocation, AbstractCluster Invoker<?> > invoker) method passes in the Cluster itself. > > Look at some of the code for retry: > try { > invoker.invoke(invocation); > failed.remove(invocation); > } catch (Throwable e) { > logger.error("Failed retry to invoke method " + > invocation.getMethodName() + ", waiting again.", e); > } > Since invoker is Cluster here, the specific implementation here is Failback > Cluster Invoker. Look at the doInvoke part of the code: > > protected Result doInvoke(Invocation invocation, List<Invoker<T>> > invokers, LoadBalance loadbalance) throws RpcException { > try { > checkInvokers(invokers, invocation); > Invoker<T> invoker = select(loadbalance, invocation, invokers, > null); > return invoker.invoke(invocation); > } catch (Throwable e) { > logger.error("Failback to invoke method " + > invocation.getMethodName() + ", wait for retry in background. Ignored > exception: "+ e.getMessage() + ", ", e); > addFailed(invocation, this); > return new RpcResult(); // ignore > } > Obviously normally, there is a catch Throwable that is caught once and > never goes into retry code. I understand that we don't want such a design. > > I have made some optimization and improvement on this. > 1. Use the time wheel to solve the problem of task execution and ensure > that the failed task is not executed immediately, but after 5 seconds. > 2. Set the configurable number of failbackTasks tasks (default 100) and the > number of retries (default 3) > 3. The retry logic does not start with FailbackCluster Invoker, but follows > the following strategy > > Invoker < T > retryInvoker = select (load balance, invocation, invokers, > lastInvokers); > RetryInvoker. invoke (invocation); > > test case: > FailbackCluster InvokerTest. testRetryFailed > I have submitted PR > https://github.com/apache/incubator-dubbo/pull/2822 > 在FailbackClusterInvoker类中,成员变量failed没有控制容量,可能会引起内存泄露。在阅读源码时,发现有一些问题: > 1.当定时任务在执行时,正好一个invoke请求失败,会直接加入到队列中,这会导致刚失败的任务会立即执行,不符合之前的预期,理论上的5秒执行一次。 > 2.addFailed(Invocation invocation, AbstractClusterInvoker<?> > invoker)方法传入的是Cluster本身。 > 看下重试的部分代码: > try { > invoker.invoke(invocation); > failed.remove(invocation); > } catch (Throwable e) { > logger.error("Failed retry to invoke method " + > invocation.getMethodName() + ", waiting again.", e); > } > > 由于invoker这里是Cluster,这里具体实现是FailbackClusterInvoker,看下doInvoke部分代码: > protected Result doInvoke(Invocation invocation, List<Invoker<T>> > invokers, LoadBalance loadbalance) throws RpcException { > try { > checkInvokers(invokers, invocation); > Invoker<T> invoker = select(loadbalance, invocation, invokers, > null); > return invoker.invoke(invocation); > } catch (Throwable e) { > logger.error("Failback to invoke method " + > invocation.getMethodName() + ", wait for retry in background. Ignored > exception: "+ e.getMessage() + ", ", e); > addFailed(invocation, this); > return new RpcResult(); // ignore > } > 很明显正常情况下,是又一次被捕获异常,永远走不进重试代码中的catch Throwable,我的理解任务,这并不是想要这样的设计。 > > 我对此做了一些优化和改造: > 1.使用时间轮来解决任务执行的问题,确保失败的任务不是马上执行,而是5秒后; > 2.设置可配置的failbackTasks任务数(默认100)和复用retries重试次数(默认3) > 3.重试逻辑不从FailbackClusterInvoker开始执行,而是如下策略 > Invoker<T> retryInvoker = select(loadbalance, invocation, invokers, > lastInvokers); > retryInvoker.invoke(invocation); > > 测试用例 > FailbackClusterInvokerTest.testRetryFailed > > 我已经提交了PR,你们能阅读一下吗? > https://github.com/apache/incubator-dubbo/pull/2822 >