hjwsm1989 opened a new pull request, #3283:
URL: https://github.com/apache/brpc/pull/3283

   ### What problem does this PR solve?
   
   PR #3268 ("Use monotonic time instead of wall time") migrated several hot
   paths from `butil::gettimeofday_us()` to `butil::cpuwide_time_us()`,
   including `LocalityAwareLoadBalancer::Weight::Update`, which now reads
   `end_time_us` from the cpuwide clock. The migration was incomplete on the
   caller side: `Channel::CallMethod` and the retry sites in
   `Controller::OnVersionedRPCReturned` still pass `gettimeofday_us()` as the
   begin time, which is then plumbed through `Controller::IssueRPC` →
   `Controller::Call::begin_time_us` → `SelectIn::begin_time_us` →
   `CallInfo::begin_time_us` into `Weight::Update`.
   
   In `Weight::Update`:
   
   ```cpp
   const int64_t end_time_us = butil::cpuwide_time_us();          // e.g. 
5_000_000
   const int64_t latency = end_time_us - ci.begin_time_us;        // - wall_us 
(~1.7e15)
                                                                  // = ~ 
-1.7e15 us
   if (latency <= 0) {
       // time skews, ignore the sample.
       return 0;
   }
   ```
   
   So every sample trips the time-skew short-circuit. `_time_q` never
   accumulates entries, `_avg_latency` stays at 0, and locality-aware feedback
   is silently disabled. The same wall/cpuwide mix also leaks into
   `Controller::latency_us()` (`controller.h:215`) where `_begin_time_us` is
   wallclock and the unset-end-time fallback uses cpuwide.
   
   **Visible downstream effect**: cold-start `list://` channels with `lb=la`
   and 2 backends occasionally fail RPCs with `EHOSTDOWN` (`Fail to select
   server from list://...`) on retry, even when one backend is healthy.
   
   ### Bisect / reproduction
   
   I ran a bisect against the 51 commits between brpc 1.16.1 and 1.17.0-rc2
   in our internal fork (master code unchanged, only brpc varied), each step
   running a 2-backend `list://` + `lb=la` + `max_retry=1` probe RPC 500
   times:
   
   | step | commit                                      | result      |
   |------|---------------------------------------------|-------------|
   | good | `c41e838a` Release 1.16.1                   |  0/500 fail |
   | ...  | (4 GOOD steps)                              | ...         |
   | #47  | `771de31e` Fix UAF in LatencyRecorder       |  0/500 fail |
   | **#48** | **`12fb539a` Use monotonic time (#3268)** | **25/500 fail** |
   | #49  | `b223e60b` Fix RDMA resource                | 30/500 fail |
   | bad  | `604dad0c` 1.17.0-rc2                       | 28/500 fail |
   
   Single-line isolation on `12fb539a`: reverting only
   
   ```diff
    int64_t LocalityAwareLoadBalancer::Weight::Update(const CallInfo& ci, ...) {
   -    const int64_t end_time_us = butil::cpuwide_time_us();
   +    const int64_t end_time_us = butil::gettimeofday_us();
   ```
   
   (while keeping all other changes from #3268) makes the test pass 500/500.
   This nails the regression to that single time-source assignment, and the
   fix is to migrate the corresponding callers to cpuwide.
   
   ### What is changed
   
   This PR aligns the client-side callers with #3268's intent so all client
   RPC time accounting uses `cpuwide_time_us` consistently:
   
   | file | line | change |
   |------|------|--------|
   | `src/brpc/channel.cpp`     | 451  | `CallMethod` entry begin time          
    |
   | `src/brpc/channel.cpp`     | 628  | sync RPC end time                      
    |
   | `src/brpc/controller.cpp`  | 672  | backup-request retry `IssueRPC`        
    |
   | `src/brpc/controller.cpp`  | 705  | retry-backoff deadline check           
    |
   | `src/brpc/controller.cpp`  | 715  | regular retry `IssueRPC`               
    |
   | `src/brpc/controller.cpp`  | 803  | circuit-breaker feedback latency       
    |
   | `src/brpc/controller.cpp`  | 980  | async `OnRPCEnd`                       
    |
   | `src/brpc/controller.cpp`  | 1019 | backup-thread `OnRPCEnd`               
    |
   
   After this commit, `Controller::_begin_time_us` and `_end_time_us` are
   both monotonic, and `Controller::latency_us()` is consistent regardless
   of which branch it takes. There are no semantic changes to the public
   `latency_us()` API beyond switching the underlying clock from wall to
   monotonic, which is the same direction #3268 already went.
   
   ### Test
   
   Adds 
`test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source`,
   a hermetic LB-level test that exercises the invariant `Weight::Update`
   relies on:
   
   - **Path A** (regression repro): supply `gettimeofday_us()` as
     `CallInfo::begin_time_us`. The test asserts `_avg_latency` stays at 0
     because the time-skew short-circuit fires on every sample.
   - **Path B** (post-fix): supply `cpuwide_time_us()` as
     `CallInfo::begin_time_us`. The test asserts `_avg_latency > 0` because
     the latency samples are accepted.
   
   The test does not start a Server and does not depend on the network, so
   it is deterministic in CI (~36ms locally).
   
   `la_sanity` and `la_selection_too_long` continue to pass.
   
   ### Check List
   
   - [x] No public API change.
   - [x] New test added (`la_records_latency_with_consistent_time_source`).
   - [x] Backward compatible: `Controller::latency_us()` is now monotonic, in
         line with the direction #3268 already started.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to