Re: [Sks-devel] SKS apocalypse mitigation

Andrew Gallagher Wed, 23 May 2018 09:32:47 -0700

Hi, all.

There has been a lot of chatter re possible improvements to SKS on the
list lately, and lots of ideas thrown around. So I thought I'd summarise
the proposals here, and try to separate them out into digestible chunks.


I've ordered them from less to more controversial. My personal
preference is for the first two sections (resiliency, type filters) to
be implemented, and for the rest to be parked.

This has turned out to be a much longer document than I expected. I
don't intend to spend any further time or energy on local blacklisting,
as its technical complexity increases every time I think about it, and
its politics and effectiveness are questionable.

A.


Concrete proposals
==================

Version 1.X: Resiliency
-----------------------

These are ideas that fell out of the other discussions, but are
applicable independently. If we want to make backwards-incompatible
changes, then automatic verification of status, versions etc will
probably be necessary to prevent recon failure.

### JSON status

A standardised JSON status page could be served by all SKS-speaking
services. This would ease fault detection and pool management, and is a
prerequisite for reliable sanity callbacks.

### Default initial_stat=true

Also a prerequisite for sanity callbacks. Otherwise useful for debugging
and fault detection.

### Sanity callbacks

Currently, a server has no way to determine if its peers are correctly
set up or have key deltas within the recon limit. If each host served a
JSON status page, peers could perform a sanity check against it before
allowing recon to continue. This would help contain the effects of some
of the more common failure modes.

### Empty db protection

If the number of keys in the local database is less than a configured
threshold, an sks server should disable recon, and throw a warning. The
particular threshold could be set in the conf file, and a sensible
default provided in the distro. This should prevent new servers from
attempting recon until a reasonable dump is loaded.


Version 2.0: Type filters with version ratchet
----------------------------------------------

This proposal seems to have the most support in principle. It is
relatively easy to implement, and directly addresses both illegal
content and database bloat. It does however require precise choreography.

It should be possible to alter the sks code during a version bump so that:

1. All objects of an expanded but hardcoded set of types (private keys,
localsigs, photo IDs, ...) are silently dropped if submitted
2. Any existing objects in the database of these types are treated as
nonexistent for all operations (queries, recon, dumps, ...)
3. The above rules are only enabled on a future flag day, say 180 days
after the release date of the new version
4. The version criterion for pool membership is bumped a few days in
advance of the flag day
5. A crufty database could be cleaned by dumping and reloading the db
locally, or a database cleaner could be run on a schedule from within
SKS itself

This would purge the pool of the most obviously objectionable content
(child porn, copyrighted material), with minimal collateral damage.

The disadvantage is that any noncompliant peer would fail recon after
flag day due to excessive delta, and thus would need to be either
depeered manually, or have its recon attempts denied by a sanity callback.

Other implementations (i.e. hockeypuck) would have to move in lockstep
or be depeered.


Future speculation
==================

Future A: Policy blacklisting
-----------------------------

Pay attention, kid. This is where it gets complicated.

Version ratchets may not be flexible or responsive enough to deal with
specific legal issues. Policy-based blacklisting gives server operators
a fine-grained tool to clean their databases of all sorts of content
without having to move in lockstep with their peers.

These proposals are more controversial, given that individual operators
will have hands-on responsibility for managing policy, and thereby
potentially be more exposed legally. It should be noted however that
technical debt may not be a valid defence against legal liability. IANAL.

All of the changes in this section must be made simultaneously,
otherwise various forms of recon failure are inevitable. This will
involve a major rewrite of the code, which may not be considered a good
use of time.

If type filters have been implemented (see above), the need for local
policy would be considerably reduced. If however type filters were not
used, then policy blacklists would be the main method for filtering
objectionable content, which might be prohibitive.

Note that locally-divergent blacklist policies have the potential to
break eventual consistency across the graph (see below).

### Local blacklist

An SKS server may maintain a local blacklist of hashes that it does not
want to store. At submission time, any object found in the blacklist is
silently dropped.

Any requests for objects in the blacklist should return `310 Gone`.

### Local dumps

When an SKS server is making a dump, it should dump all of its
databases, including blacklist, peer_bl_cache and limbo (see below).
This is useful for a) restoring state locally after a disaster, but also
b) helping new servers bootstrap themselves to a low-delta state.

### Bootstrap limbo

When restoring from a dump, a server may simply restore the dumped
blacklist and continue. But if the new server has a different policy
than the source, this is not sufficient. Hashes that were added to the
original blacklist for violating policies that the new server does not
enforce should not be blacklisted on the new server. But they cannot be
added to the local database either, because the actual data will not be
found in the dump.

Instead, these hashes are added to a `limbo` database that will be
progressively drained as and when the hashes are encountered again
during submission or catchup. This is important to ensure that recon can
start immediately with a complete set of hashes.

Any requests for objects in limbo should return `404 Not Found`. If an
object is successfully submitted or fetched that matches a hash in
limbo, then the hash will be removed from limbo before the object is
processed by policy.

### Peer blacklist cache

When fetching new objects from a peer during catchup, the peer may throw
`310 Gone` - if this happens then we know that the peer has blacklisted
it and we should not request it again from that peer for some time. We
store the triple `(hash, peer, timestamp)` in the database `peer_bl_cache`.

Similarly, if we receive `404 Not Found` during catchup, then this
object is in the remote server's limbo. We add it to `peer_bl_cache` as
if it were a `310`. Cache invalidation should reap it eventually.

### Fake recon

The recon algorithm is modified to operate against the set of unique hashes:

```
(SELECT hash FROM local_db) JOIN
(SELECT hash FROM local_bl) JOIN
(SELECT hash FROM limbo) JOIN
(SELECT hash FROM peer_bl_cache WHERE peer="$PEER");
```

This ensures that deltas are kept to a minimum.

Note that this may cause the remote server to request items that it does
not have but are in our blacklist or our limbo. This should only happen
once, after which the offending hash should be stored in the peer's
blacklist cache against our hostname.

If the remote server requests an object that we have stored in our
`peer_bl_cache` against its name, then our cache is obviously invalid
and we should remove that entry from the cache and respond with our copy
of the object, if we have one.

### Conditional catchup

Instead of requesting the N missing hashes from the delta, the server
will request the following hashes:

```
(SELECT hash FROM missing_hashes)
JOIN (SELECT hash FROM peer_bl_cache
        WHERE peer="$PEER" ORDER BY timestamp LIMIT a)
JOIN (SELECT hash FROM limbo LIMIT b*N);
```

where `a` is small, perhaps even a weighted random integer from (0,1),
and `b` is O(1). These parameters will be adjusted so that a balance is
maintained between (on one hand) timely cache invalidation and limbo
draining; and (on the other) the impact upon the remote peer of
excessive requests.

### Policy enforcement

Each server would be able to define its own policy. The simplest policy
would be one that bans certain packet types (e.g. photo IDs).

During both catchup and submission (but after limbo draining), the new
object is compared with local policy. If it offends then its hash is
added to the local blacklist with a reference to the offending policy,
and the data is silently dropped.

Policy should be defined in a canonical form, so that a) local policy
can be reported on the status pages and b) remote dumps can be compared
with local policy to minimise the number of hashes that need to be
placed in limbo during bootstrap.

### Local database cleaner

If policy changes, there will in general be objects left behind in the
db that violate the new policy. A cleaner routine should periodically
walk the database and remove any offending objects, adding their hashes
to the local blacklist as if they had been submitted. This could be
implemented as an extension of the type-filter database cleaner above.


Open problem: Eventual Consistency
----------------------------------

Any introduction of blacklists opens the possibility of "policy
firewalls", where servers with permissive policies may be effectively
isolated from each other if all of the recon pathways between them pass
through servers with more restrictive policies. Policy would therefore
not only prevent the storage of violating objects locally, but prevent
their propagation across the network. The only way to break this
firewall is to create a new recon pathway that bypasses it. This could
be done manually, but this places responsibility on operators to
understand the policies of all other servers on the graph.

### Recon for all

It might be possible to move from a recon whitelist to a recon blacklist
model. Servers would spider the graph to find peers and automatically
try to peer with them. This would ensure that eventual consistency is
obtained quickly, by maximising the core graph of servers that are
mutually directly connected (and thus immune to firewalling).

The main objection is that moving from a whitelist to blacklist recon
model opens up a significant attack surface. Sanity callbacks could be
used to mitigate against human error, but not sabotage.

### Hard core

Alternatively, a group of servers that do not intend to introduce any
policy restrictions could agree to remain mutually well-connected, and
stay open to peering requests from all comers (subject to good
behaviour). This would effectively operate as a clearing house for objects.

The main objections are a) these servers must all operate in
jurisdictions where the universality of their databases is legally sound
(e.g. no right to be forgotten), and b) some animals would be more equal
than others.


Future B: Austerity
-------------------

In an extreme scenario, handling of any user IDs may be impossible due
to data protection regulations. On the same grounds, it may not even be
possible to store third-party signatures as these leak relationship
data. In such a case, it may still be possible to run an austere
keyserver network for self-signatures (i.e. expiry dates) and
revocations only. This would require a further version ratchet with a
type filter permitting a minumum of packet types, shorn of all personal
identifying information.

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Sks-devel mailing list
Sks-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/sks-devel

Re: [Sks-devel] SKS apocalypse mitigation

Reply via email to