Bug#1138503: tor: Hidden services disappear after a while

Petter Reinholdtsen Sat, 30 May 2026 22:31:19 -0700

Package: tor
Version: 0.4.9.8-0+deb13u1
Tags: patch

For a while now I have experienced that my tor hidden services,
typically ssh on my servers behind NAT, some times become unavailable.
The only fix I have found so far is to restart the tor daemon (or
sometimes ask for the machine to be rebooted if restarting tor is out of
reach).  I have also experienced the Debian APT services behind .onion
addresses becoming unavailable, and wondered if this is the same
problem.


With this background, I asked my local artificial idiocy setup (OpenCode
using local llama.cpp with model Qwen 3.6) to analyze the code and see
if it could find the cause and perhaps provide a fix, as well as create
a synthetic test that could trigger the problem and demonstrate that the
fix actually work.  As far as I can tell, it was able to come up with a
explanation, a fix and a test, and I will test it on my servers in the
near future to see if it improves reliability.  The problem is that it
is quite unpredictable when one of my servers become unavailable, so it
is hard to know if the fix worked.  Because of this, I decided to share
the findings here right now, in case someone else can help me test it.

I plan to submit this patch upstream too, and have tried to request
access to the upstream gitlab, but got a 500 server error on my access
application and am not sure it made it through.  I will wait a few days
to see fi I get any response on the request.

In any case, attached are the patches, one for the test case and another
for the fix, for your consideration.  Feel free to pass it upstream if
you believe it is the right fix.

I also asked the bullshit generator to explain its findings, which
resulted in the markdown text.  Sharing it here as background
information.  Note the "original analysis" mentioned was done by Claw
Code using the same llama.cpp setup and model, and did do not seem to be
reach as good a solution as OpenCode.

# Tor Hidden Service Gradual Failure — Corrected Root Cause Analysis

## Problem Statement

Tor hidden services (v3 onion services) gradually stop working after running 
for some time. This affects both custom SSH services and standard ones. The 
failure is progressive rather than sudden: the service becomes increasingly 
unreachable until it stops entirely, requiring a restart of Tor to recover.

## Repository Context

- **Repo**: `tor-packaging` — Debian/Ubuntu packaging of Tor
- **Branch**: `debian-main`
- **Source**: Full Tor source tree under `src/` (Debian patch format)

---

## Corrected Root Cause: Intro Point `circuit_retries` Never Resets on Success

### The Bug

The per-intro-point field `ip->circuit_retries` (`hs_service.h:89`) is 
incremented every time a circuit is launched for an intro point, in 
`launch_intro_point_circuits()` at `hs_service.c:3005`:

```c
ip->circuit_retries++;
if (hs_circ_launch_intro_point(service, ip, ei, direct_conn) < 0) { ... }
```

This counter is **never reset to zero when the circuit succeeds**. It is only 
ever checked against `MAX_INTRO_POINT_CIRCUIT_RETRIES` (3, per `or.h`) in 
`should_remove_intro_point()` at `hs_service.c:2549-2550`:

```c
bool has_no_retries = (ip->circuit_retries > MAX_INTRO_POINT_CIRCUIT_RETRIES);
```

### Why This Causes Gradual Failure

The lifecycle of a single intro point under normal relay churn:

1. Intro point selected, circuit launched → `circuit_retries = 1`, circuit 
succeeds
2. ~30–60 minutes later, the circuit times out naturally (relay churn, 
hibernation, etc.)
3. Scheduled event triggers rebuild → `circuit_retries = 2`, circuit succeeds 
again
4. Another natural timeout → rebuild → `circuit_retries = 3`
5. Another timeout → rebuild → `circuit_retries = 4`, which exceeds 
`MAX_INTRO_POINT_CIRCUIT_RETRIES (3)` → **intro point removed** 
(`hs_service.c:2615`)

Each intro point can survive roughly 3–4 circuit lifecycles before being 
discarded. With a default of 3 intro points and circuits lasting ~30–60 
minutes, after several hours all three intro points will have been eliminated, 
leaving the hidden service with zero functional introduction points.

A new intro point is eventually selected to replace the removed one (via 
descriptor regeneration), but it too accumulates retries across its own 
lifecycle, eventually getting dropped again. The net effect is a gradually 
degrading service: intro points are lost faster than they can be stably 
maintained.

### Why Restarting Tor Fixes It

Restarting Tor creates fresh `hs_service_intro_point_t` objects with 
`circuit_retries = 0`, resetting the counter for all intro points. This aligns 
exactly with the observed symptom that a restart recovers service.

---

## What Is NOT the Root Cause (Corrected from Original Analysis)

### The Retry Budget Window Is 5 Minutes, Not 3 Hours

The original analysis claimed `IntroPointPeriod` (~3 hours) gates the circuit 
launch budget. This is incorrect. The circuit launch rate-limiting uses 
`INTRO_CIRC_RETRY_PERIOD = 300 seconds` (5 minutes), defined at 
`hs_common.h:38`. The counter resets every 5 minutes (`hs_service.c:3065-3069`):

```c
if (now > (service->state.intro_circ_retry_started_time + 
INTRO_CIRC_RETRY_PERIOD)) {
    service->state.num_intro_circ_launched = 0;
}
```

The `IntroPointPeriod` (~3 hours) controls when intro points are scheduled for 
rotation — a separate mechanism.

### The Budget Is Generous (28 circuits per window with defaults)

With default config (`NumIntroPoints = 3`, two descriptors):

- Base: `(3 + 2)` extra = 5
- Retries: `3 × MAX_INTRO_POINT_CIRCUIT_RETRIES(3)` = 9
- Multiplier (two descriptors): × 2
- **Total: 28 circuits per 5-minute window** (`hs_service.c:3018-3022`)

Exhausting this budget requires all 28 attempts to fail within a single 
5-minute window, which is unlikely under normal conditions. Budget exhaustion 
is not the primary cause of gradual failure.

### Default `NumIntroPoints` Is 3, Not 2

The default is `NUM_INTRO_POINTS_DEFAULT = 3` (`hs_common.h:30`). The original 
analysis stated 2.

---

## Code Paths

| File | Line(s) | Role |
|------|---------|------|
| `src/feature/hs/hs_service.h` | 85–89 | `circuit_retries` field declaration 
and comment |
| `src/feature/hs/hs_service.c` | 3005 | `ip->circuit_retries++` — incremented 
on every launch |
| `src/feature/hs/hs_service.c` | 2549–2550 | Comparison against 
`MAX_INTRO_POINT_CIRCUIT_RETRIES` |
| `src/feature/hs/hs_service.c` | 2615–2617 | Intro point removal when retries 
exceeded |
| `src/feature/hs/hs_circuit.c` | 1087+ | Circuit open callback — where fix 
belongs |
| `src/feature/hs/hs_common.h` | 38, 41 | `INTRO_CIRC_RETRY_PERIOD`, 
`MAX_INTRO_CIRCS_PER_PERIOD` |

---

## Diagnosing the Bug in Production

### Log Message to Look For

When an intro point is removed due to retry exhaustion, 
`should_remove_intro_point()` logs at `LOG_INFO` level 
(`hs_service.c:2578-2583`):

```
Intro point <desc> (retried: N times). Removing it.
```

**Distinguishing the bug from other removal reasons:** The log suffix tells you 
why:

| Suffix | Cause | Bug or normal? |
|--------|-------|----------------|
| *(none)* — just `(retried: N times)` | Retry exhaustion | **This is the bug** 
|
| ` has expired` | Time or INTRODUCE2 count limit reached | Normal rotation |
| ` fell off the consensus` | Relay disappeared from network | Normal churn |

So look for lines matching this pattern where there's **no suffix** and `N >= 
4`:

```bash
grep 'Intro point.*retried: [0-9]\+ times\)\. Removing it\.' /var/log/tor/log
```

Each such line means an intro point was removed solely because its retry 
counter accumulated across successful circuit builds rather than actual 
failures. When all three intro points are removed this way, the service has 
zero functional introduction points and becomes unreachable to clients.

### Enabling Info-Level Logging

By default Tor only logs `notice` level. To see these messages, add to `torrc`:

```
Log info file /var/log/tor/hs-info.log
```

Or use syslog with appropriate facility configuration.

---

## Synthetic Test

A test reproducing the bug was added to `src/test/test_hs_service.c` as 
`test_circuit_retries_reset_on_open`. It:

1. Creates a service with 3 intro points (default).
2. Mocks `node_get_by_id` so nodes are always found — isolating 
retry-exhaustion removal from "fell off consensus" removal.
3. Runs **6 rounds** of circuit turnover (`MAX_INTRO_POINT_CIRCUIT_RETRIES + 3 
= 6`): each round increments retries (simulating 
`launch_intro_point_circuits()` at `hs_service.c:3005`) then opens the circuit 
via `hs_service_circuit_has_opened()`.
4. Runs **`run_housekeeping_event()`** — the actual production housekeeping 
that calls `should_remove_intro_point()` → `cleanup_intro_points()`.
5. Asserts all 3 intro points survive.

**Without the fix:** retries accumulate to 6 (> MAX=3), so housekeeping removes 
every intro point, leaving zero — service unreachable. Test fails with 
`tt_u64_op(remaining, OP_EQ, num_ips)` assertion at end of test.
**With the fix:** each successful open resets retries to 0, so housekeeping 
never sees retries exceeding the limit. All 3 intro points survive. Test passes.

---

## Fix

Reset `ip->circuit_retries = 0` when an intro circuit successfully opens, in 
`hs_circ_service_intro_has_opened()` at `src/feature/hs/hs_circuit.c`. A 
successful circuit open means the relay is reachable — the counter should only 
track consecutive failures.

---

## Debian-Specific Patches Checked

No Debian patches modify hidden service circuit logic. The issue exists 
upstream.

-- 
Happy hacking
Petter Reinholdtsen

Description: Add test to demonstrate hidden service becoming unavailable.
 This try to demonstrate how repeated failures in connections to the
 intro points can cause a service to disappear completely.
 .
 This patch was created with help from OpenCode using local llama.cpp
 server with Qwen 3.6.
Author: Petter Reinholdtsen <[email protected]>
Forwarded: no
Last-Update: 2026-05-31

--- a/src/test/test_hs_service.c
+++ b/src/test/test_hs_service.c
@@ -2768,6 +2768,100 @@ test_cannot_upload_descriptors(void *arg)
   UNMOCK(get_or_state);
 }
 
+/** Test that a hidden service survives repeated circuit turnover cycles without
+ * losing all its intro points. Without the fix, each successful circuit build
+ * increments ip->circuit_retries but never resets it on success, so after enough
+ * natural timeout→rebuild cycles, retries exceed MAX_INTRO_POINT_CIRCUIT_RETRIES
+ * and housekeeping removes every intro point — making the service unreachable.
+ * With the fix, retries reset to 0 on each successful open, so the service
+ * maintains its full set of intro points indefinitely. */
+static void
+test_circuit_retries_reset_on_open(void *arg)
+{
+  int flags = CIRCLAUNCH_NEED_UPTIME | CIRCLAUNCH_IS_INTERNAL;
+  hs_service_t *service = NULL;
+
+  (void) arg;
+
+  hs_init();
+  MOCK(circuit_mark_for_close_, mock_circuit_mark_for_close);
+  MOCK(relay_send_command_from_edge_, mock_relay_send_command_from_edge);
+  /* Ensure nodes are always found so housekeeping doesn't remove IPs for
+   * "fell off consensus" — we only want to test retry-exhaustion removal. */
+  MOCK(node_get_by_id, mock_node_get_by_id);
+
+  service = helper_create_service();
+  tt_assert(service);
+
+  /* Create the full set of intro points (default: NUM_INTRO_POINTS_DEFAULT=3). */
+  int num_ips = service->config.num_intro_points;
+  for (int i = 0; i < num_ips; i++) {
+    hs_service_intro_point_t *ip = helper_create_service_ip();
+    tt_assert(ip);
+    service_intro_point_add(service->desc_current->intro_points.map, ip);
+  }
+
+  /* Verify we started with the expected number of intro points. */
+  tt_u64_op(digest256map_size(service->desc_current->intro_points.map),
+            OP_EQ, num_ips);
+
+  /* Simulate multiple rounds of natural circuit turnover: each round, every IP
+   * gets its retries incremented (as launch_intro_point_circuits() does at
+   * hs_service.c:3005) and then the circuit opens successfully via
+   * hs_service_circuit_has_opened(). After several rounds, we run housekeeping
+   * which calls should_remove_intro_point() → cleanup_intro_points(). */
+  int num_rounds = MAX_INTRO_POINT_CIRCUIT_RETRIES + 3;
+  for (int round = 0; round < num_rounds; round++) {
+    /* For each intro point, simulate: launch (retries++) → open. */
+    DIGEST256MAP_FOREACH(service->desc_current->intro_points.map, key,
+                         hs_service_intro_point_t *, ip) {
+      /* Simulate circuit launch incrementing retry counter. */
+      ip->circuit_retries++;
+
+      origin_circuit_t *circ = helper_create_origin_circuit(
+        CIRCUIT_PURPOSE_S_ESTABLISH_INTRO, flags);
+      tt_assert(circ);
+
+      ed25519_pubkey_copy(&circ->hs_ident->identity_pk,
+                          &service->keys.identity_pk);
+      ed25519_pubkey_copy(&circ->hs_ident->intro_auth_pk,
+                          &ip->auth_key_kp.pubkey);
+
+      /* Circuit opens successfully. With the fix, this resets circuit_retries
+       * to 0; without it, retries accumulate across rounds. */
+      setup_full_capture_of_logs(LOG_INFO);
+      hs_service_circuit_has_opened(circ);
+      teardown_capture_of_logs();
+
+      circuit_free_(TO_CIRCUIT(circ));
+    } DIGEST256MAP_FOREACH_END;
+  }
+
+  /* Now run housekeeping, which calls cleanup_intro_points() and removes any
+   * intro points exceeding MAX_INTRO_POINT_CIRCUIT_RETRIES. */
+  time_t now = approx_time();
+  setup_full_capture_of_logs(LOG_INFO);
+  run_housekeeping_event(now);
+  teardown_capture_of_logs();
+
+  /* With the fix: all intro points survive because retries reset to 0 each
+   * successful open, so should_remove_intro_point() never sees retries > max.
+   * Without the fix: every IP has accumulated num_rounds retries (> MAX) and
+   * gets removed, leaving zero — the service becomes unreachable. */
+  unsigned int remaining = digest256map_size(service->desc_current->intro_points.map);
+  tt_u64_op(remaining, OP_EQ, num_ips);
+
+ done:
+  if (service) {
+    remove_service(get_hs_service_map(), service);
+    hs_service_free(service);
+  }
+  hs_free_all();
+  UNMOCK(circuit_mark_for_close_);
+  UNMOCK(relay_send_command_from_edge_);
+  UNMOCK(node_get_by_id);
+}
+
 struct testcase_t hs_service_tests[] = {
   { "e2e_rend_circuit_setup", test_e2e_rend_circuit_setup, TT_FORK,
     NULL, NULL },
@@ -2811,9 +2905,11 @@ struct testcase_t hs_service_tests[] = {
     NULL, NULL },
   { "authorized_client_config_equal", test_authorized_client_config_equal,
     TT_FORK, NULL, NULL },
-  { "export_client_circuit_id", test_export_client_circuit_id, TT_FORK,
-    NULL, NULL },
+ { "export_client_circuit_id", test_export_client_circuit_id, TT_FORK,
+     NULL, NULL },
   { "intro2_handling", test_intro2_handling, TT_FORK, NULL, NULL },
+  { "circuit_retries_reset_on_open", test_circuit_retries_reset_on_open,
+    TT_FORK, NULL, NULL },
 
   END_OF_TESTCASES
 };

Description: Reset retry counter on successful connects to avoid hidden service disappearing.
 This make sure the intro points are not taken out of circulation
 after three tries over the life time of the daemon, and instead
 only take them out after three consecutive failures.
 .
 This patch was created with help from OpenCode using local llama.cpp
 server with Qwen 3.6.
Author: Petter Reinholdtsen <[email protected]>
Forwarded: no
Last-Update: 2026-05-31

--- a/src/feature/hs/hs_circuit.c
+++ b/src/feature/hs/hs_circuit.c
@@ -1140,6 +1140,14 @@ hs_circ_service_intro_has_opened(hs_service_t *service,
            safe_str_client(service->onion_address));
   circuit_log_path(LOG_INFO, LD_REND, circ);
 
+  /* Reset the retry counter since this intro point proved reachable. The
+   * counter should only track consecutive failures; a successful circuit open
+   * means the relay is functional. Without this reset, natural circuit turnover
+   * (relay hibernation, timeouts) would accumulate retries across successive
+   * successful builds until MAX_INTRO_POINT_CIRCUIT_RETRIES is exceeded and
+   * the intro point gets removed. */
+  ip->circuit_retries = 0;
+
   /* Time to send an ESTABLISH_INTRO cell on this circuit. On error, this call
    * makes sure the circuit gets closed. */
   send_establish_intro(service, ip, circ);
--- a/src/feature/hs/hs_service.h
+++ b/src/feature/hs/hs_service.h
@@ -82,10 +82,11 @@ typedef struct hs_service_intro_point_t {
   /** The time at which this intro point should expire and stop being used. */
   time_t time_to_expire;
 
-  /** The amount of circuit creation we've made to this intro point. This is
-   * incremented every time we do a circuit relaunch on this intro point which
-   * is triggered when the circuit dies but the node is still in the
-   * consensus. After MAX_INTRO_POINT_CIRCUIT_RETRIES, we give up on it. */
+ /** The amount of circuit creation attempts we've made to this intro point
+    * since its last successful circuit open. This is incremented every time we
+    * launch a circuit for this intro point and reset to zero when the circuit
+    * successfully opens. After MAX_INTRO_POINT_CIRCUIT_RETRIES, we give up on
+    * it. */
   uint32_t circuit_retries;
 
   /** Replay cache recording the encrypted part of an INTRODUCE2 cell that the

Bug#1138503: tor: Hidden services disappear after a while

Reply via email to