hjwsm1989 opened a new pull request, #3283:
URL: https://github.com/apache/brpc/pull/3283
### What problem does this PR solve?
PR #3268 ("Use monotonic time instead of wall time") migrated several hot
paths from `butil::gettimeofday_us()` to `butil::cpuwide_time_us()`,
including `LocalityAwareLoadBalancer::Weight::Update`, which now reads
`end_time_us` from the cpuwide clock. The migration was incomplete on the
caller side: `Channel::CallMethod` and the retry sites in
`Controller::OnVersionedRPCReturned` still pass `gettimeofday_us()` as the
begin time, which is then plumbed through `Controller::IssueRPC` →
`Controller::Call::begin_time_us` → `SelectIn::begin_time_us` →
`CallInfo::begin_time_us` into `Weight::Update`.
In `Weight::Update`:
```cpp
const int64_t end_time_us = butil::cpuwide_time_us(); // e.g.
5_000_000
const int64_t latency = end_time_us - ci.begin_time_us; // - wall_us
(~1.7e15)
// = ~
-1.7e15 us
if (latency <= 0) {
// time skews, ignore the sample.
return 0;
}
```
So every sample trips the time-skew short-circuit. `_time_q` never
accumulates entries, `_avg_latency` stays at 0, and locality-aware feedback
is silently disabled. The same wall/cpuwide mix also leaks into
`Controller::latency_us()` (`controller.h:215`) where `_begin_time_us` is
wallclock and the unset-end-time fallback uses cpuwide.
**Visible downstream effect**: cold-start `list://` channels with `lb=la`
and 2 backends occasionally fail RPCs with `EHOSTDOWN` (`Fail to select
server from list://...`) on retry, even when one backend is healthy.
### Bisect / reproduction
I ran a bisect against the 51 commits between brpc 1.16.1 and 1.17.0-rc2
in our internal fork (master code unchanged, only brpc varied), each step
running a 2-backend `list://` + `lb=la` + `max_retry=1` probe RPC 500
times:
| step | commit | result |
|------|---------------------------------------------|-------------|
| good | `c41e838a` Release 1.16.1 | 0/500 fail |
| ... | (4 GOOD steps) | ... |
| #47 | `771de31e` Fix UAF in LatencyRecorder | 0/500 fail |
| **#48** | **`12fb539a` Use monotonic time (#3268)** | **25/500 fail** |
| #49 | `b223e60b` Fix RDMA resource | 30/500 fail |
| bad | `604dad0c` 1.17.0-rc2 | 28/500 fail |
Single-line isolation on `12fb539a`: reverting only
```diff
int64_t LocalityAwareLoadBalancer::Weight::Update(const CallInfo& ci, ...) {
- const int64_t end_time_us = butil::cpuwide_time_us();
+ const int64_t end_time_us = butil::gettimeofday_us();
```
(while keeping all other changes from #3268) makes the test pass 500/500.
This nails the regression to that single time-source assignment, and the
fix is to migrate the corresponding callers to cpuwide.
### What is changed
This PR aligns the client-side callers with #3268's intent so all client
RPC time accounting uses `cpuwide_time_us` consistently:
| file | line | change |
|------|------|--------|
| `src/brpc/channel.cpp` | 451 | `CallMethod` entry begin time
|
| `src/brpc/channel.cpp` | 628 | sync RPC end time
|
| `src/brpc/controller.cpp` | 672 | backup-request retry `IssueRPC`
|
| `src/brpc/controller.cpp` | 705 | retry-backoff deadline check
|
| `src/brpc/controller.cpp` | 715 | regular retry `IssueRPC`
|
| `src/brpc/controller.cpp` | 803 | circuit-breaker feedback latency
|
| `src/brpc/controller.cpp` | 980 | async `OnRPCEnd`
|
| `src/brpc/controller.cpp` | 1019 | backup-thread `OnRPCEnd`
|
After this commit, `Controller::_begin_time_us` and `_end_time_us` are
both monotonic, and `Controller::latency_us()` is consistent regardless
of which branch it takes. There are no semantic changes to the public
`latency_us()` API beyond switching the underlying clock from wall to
monotonic, which is the same direction #3268 already went.
### Test
Adds
`test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source`,
a hermetic LB-level test that exercises the invariant `Weight::Update`
relies on:
- **Path A** (regression repro): supply `gettimeofday_us()` as
`CallInfo::begin_time_us`. The test asserts `_avg_latency` stays at 0
because the time-skew short-circuit fires on every sample.
- **Path B** (post-fix): supply `cpuwide_time_us()` as
`CallInfo::begin_time_us`. The test asserts `_avg_latency > 0` because
the latency samples are accepted.
The test does not start a Server and does not depend on the network, so
it is deterministic in CI (~36ms locally).
`la_sanity` and `la_selection_too_long` continue to pass.
### Check List
- [x] No public API change.
- [x] New test added (`la_records_latency_with_consistent_time_source`).
- [x] Backward compatible: `Controller::latency_us()` is now monotonic, in
line with the direction #3268 already started.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]