[Lightning-dev] A New Routing Paradigm:Ant Routing +`getroutequick` + Trampoline

ZmnSCPxj via Lightning-dev Mon, 10 Feb 2020 02:35:58 -0800

Overview of Components
======================

Ant Routing
-----------


https://lists.linuxfoundation.org/pipermail/lightning-dev/2020-February/002505.html

Ant Routing is a distributed pathfinding algorithm where nodes emit 
"pheromones", which are broadcasted over the entire network.
When a node receives a pheromone (a UUID plus a distance counter) from a peer 
it has a channel with, it records that pheromone locally, then broadcasts it to 
other peers it has channels with, but with the distance counter increased by 
one.

Subsequently, a payee can then provide the pheromone identifier to a payer, 
which can check if it has received that pheromone, and from which channel it 
received it from.
It can the forward the payment via the channel.
The next node can itself perform this operation, looking up the pheromone 
identifier and determining which channel it came from, and so on, until the 
payment reaches the destination.

`getroutequick`
---------------

https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-August/002095.html
https://zmnscpxj.github.io/activity/2019-10-18/presentation.odp

The overall `getroutequick` algorithm simply uses a Greedy Best First Search, 
which requires a heuristic that returns a distance to the target.
The heuristic is generated from cached data: the cache itself is generated by 
first performing a Dijkstra on the entire routemap, recording the distance from 
our own node at each mapped node.
The `getroutequick` algorithm then starts at one end of the route, and the 
other end must be our own node; this typically makes sense since both ends of 
the route are the payer and the payee, and have an interest in the payment 
succeeding by having a route available.

The power of `getroutequick` lies in pre-caching the Dijkstra run; Dijkstra is 
heavy as it requires traversing the entire routemap, but its result can be 
cached and then used in Greedy Best First Search, which is likely to traverse 
only the shortest route.
This moves the heavy lifting from the point in time in which a payment is 
initiated, improving payment experience by reducing the amount of time needed 
to find a route to a particular payee.

Trampoline
----------

https://github.com/lightningnetwork/lightning-rfc/pull/654

Trampoline simply means doing onion routing at a level distinct from the 
individual channel level.
An onion route of "trampoline" nodes is formed, without specifying the exact 
channels in the route between them.
Currently, the idea is that there is an "outer" onion that describes a path to 
the next trampoline node, and an "inner" onion that describes the next-next 
trampoline node.

Digression: On the Current Gossip System
========================================

Currently, we use a gossip system where nodes and channels publicly show that 
particular outputs on the blockchain layer are actually funding transaction 
outputs that back channels.

Whenever a new node joins the Lightning network, it anchors its existence to 
the blockchain by having a channel opened with it (either it funds, or is 
funded to).
Then, on the basis of this anchored channel, it announces its existence.
This implies that bandwidth usage for this gossip system is bounded by the 
bandwidth limits we have imposed in the blockchain layer.
Thus, even though this gossip system requires the data to be broadcast globally 
to all nodes, the utilized resource is still bounded by the space limits of the 
Bitcoin blockchain layer.

Further, the gossip system allows announcements to be refreshed periodically.
However, ultimately the amount of bandwidth used for routing information 
broadcast via gossip is bounded by the amount of bandwidth we have allowed the 
blockchain layer to use (i.e. the block size and block rate parameters of the 
blockchain).

Combined Proposal
=================

Combining Ant Routing and `getroutequick`
-----------------------------------------

Ant routing, we can observe, is equivalent to a distributed form of Dijkstra 
algorithm.
Its major drawback is that every payment causes the emission of a pheromone, 
which must be broadcasted to the entire network.
This implies using resources proportional to the total number of payments 
globally, unlike the current Lightning network where each node only gets data 
regarding payments it is involved in routing.
This reverses the advantage of Lightning network, making it as inefficient (in 
terms of big-O, not necessarily in the lower-level details) as a blockchain.

`getroutequick` works by pre-caching the result of a Dijkstra run, with the 
result then used as the heuristic in a subsequent, faster Greedy Best First 
Search.

This leads to the realization that we could have a possible payee release a 
pheromone *once* (equivalent to the pre-cached Dijkstra run, now made 
distributed) and then subsequent payments to that payee just use the pre-cached 
Dijkstra by referring to that pheromone.
(This reduces privacy as the pheromone persistently identifies the node, but I 
fix this later; please bear with me.)

### Pheromone Handling

So what we do is, every node that wants to be a forwarding node keeps a 
mapping, from a non-self node ID to a distance counter.
A pheromone is then composed of:

* Node ID (33 bytes).
* A reference to a channel that the node is a participant of (8 bytes, a short 
channel ID).
* A proof that the Node ID is indeed a participant in the indicated channel (65 
bytes for the node signature, 33 bytes for the node ID of the counterparty of 
the node, maybe some more bytes for whatever details on how the channel script 
is derived or whatever).
* A distance counter (1 byte can probably fit; do we expect Lightning diameter 
to exceed 255? maybe in some really insane conditions? how attackable is this?).

When a node receives a pheromone from a peer it has a channel with, it checks 
if the node already exists in its mapping.
If it exists, and the distance counter in the pheromone is larger than or equal 
to the distance counter in the mapping, it ignores the pheromone and does not 
propagate it.
However, if the distance counter in the pheromone is less than that in the 
mapping, or the mapping does not contain that pheromone, it updates its 
mapping, then sends the pheromone to its peers it has channels with, but with 
the distance counter one greater than what it has now.

### Distance Queries and Payment Routing

A node can ask its direct peers about the distance of that peer to a specific 
node ID.
(the peer could lie, but I cover this case later; please bear with me.)

When a payer wants to route, the payer knows its (supposed) distance to the 
payee.
It then creates an HTLC whose timelock limit is limited only to the given 
distance to the payee.
The payer then queries each online peer for its mapping of the payee node ID 
and distance counter.
It selects any online peer with sufficient payer-side capacity in the channel, 
whose distance counter is lower than its own and sends the HTLC out via a 
channel with that peer.
If there are no such, it could select any peer with the lowest distance counter 
(and update its own distance counter to be one higher than that peer as well).

The same operation is done by each forwarding node.
It queries each peer, where it has sufficient capacity to forward to, to get 
the distance counters for that node.
Then it forwards to any peer that has lower distance counter than itself, to 
which it has sufficient capacity.

### Forwarding Failures

There are only two failures:

* The destination has absolutely no idea of what the fayment you are talking 
about.
  * This case should include a proof that it comes from the destination, e.g. a 
signature.
* A peer is unable to forward because the HTLC it receives has insufficient 
timelock remaining for it to forward to the destination, and it reports back 
how many more hops needs to be added (this has to be 1 or higher).

Failures are not encrypted, i.e. they are in plaintext.

Suppose a forwarding node currently thinks it is 2 hops away from the payee.
However, the channel to the node that is 1 hop away is closed, or that node 
goes offline, etc.

Since the forwarding node also propagated the pheromone outwards such that it 
is 2 hops away from the payee, the payer has allocated it just enough timelock 
in the HTLC that reaches it to be worth 2 hops of timelock.
The forwarding node then checks its *other* peers, in the hope that one of them 
is 1 hop away from the payee.
If it finds a peer that is one hop away, all is well with the world and it just 
forwards to that peer.

On the other hand, suppose it finds that the peer it has that is nearest to the 
payee is 4 hops away from the payee.
It updates its own distance counter to 5 (one higher than the peer that is 
nearest), then (because the HTLC has too little timelock left for that number 
of hops) it propagates a failure back to the incoming HTLC, with the indication 
+3 (i.e. it had to add +3 to its previous distance counter of 2 to get its 
current distance counter of 5).
When the previous forwarding node (which presumably had a distance counter of 
3) receives this failure, it can instead try to search other peers that have a 
distance counter of 2 or less and silently retry with those, or if there are 
none it could look for those that are still nearer than the adjusted counter of 
the failing peer (it was 2, but with the +3 is now at 5, so any peer it has 
that is 4 or less could be nearer and it can lower the adjustment of +3 to 
something lower), or if there are still none, propagate the +3 hops failure 
back to the previous node, and so on.

Similarly, when the ultimate payer receives this +N hops failure, it can 
readjust its own sense of how distant it is from that payee, then retry with 
the same peer, or a different peer that is now nearer than the updated distance 
of the peer that propagated the failure.

This case partially handles liars who underreport their distance to the payee.
They cannot forward the payment and expect the payment to push through, since 
some node beyond it will see that it has too little time left in the HTLC to 
propagate and report this as a +N hops failure.
The lying node can then only admit its lie and report back a +N hops failure, 
or keep mum and keep the HTLC until it is about to time out (the latter 
behavior is something current forwarding nodes can already do).
(But note: a lying node could effectively censor a particular node, by 
misreporting its distance to that node as 1 and then keeping the HTLC around 
until near the locktime limit, then admit just a +1 hops failure (if it did 
not, its peer will drop the channel with the lying node onchain in order to 
enforce the locktime limit and will then search among *its* peers for how 
distant it truly is), slowly reducing its censorship but still able to do so 
among a number of nodes on the network; this will settle over time, but would 
still be a nasty attack.)
(this could be mitigated if we had a clever mathematical construct such that if 
I have proof that somebody committed to N, I can forge a proof that the same 
somebody committed to N+1 but cannot forge a proof that somebody committed to 
N-1; maybe sinking signatures, with "height" in sinking signatures being some 
constant minus the distance? but then if you get N you could propagate N: and 
you could as well also be running multiple nodes that you propagate N to but 
not actually have channels with.)

### Periodic Updates

Periodically, a node may want to update the Dijkstra run for itself, by 
re-emitting a new pheromone.
This refreshes the sense of the rest of the network of what their distance is 
to each node on the network.

We already allow the current gossip protocol to publish similar periodic 
updates to the channel and node announcements, thus we expect the big-O 
bandwidth use to be similar with periodic pheromone re-emission.

Plugging Privacy With Trampoline Routing
----------------------------------------

The core idea of trampoline routing is simply to add an onion routing to a 
level above individual channel hops.

In current Lightning, we use onion routing to encode individual channel hops.
Individual channel hops, with our new Ant Routing + `getroutequick` scheme, are 
handled by the distributed `getroutequick`.
This has the major drawback that the `getroutequick` needs to know the exact 
destination.

For this new routing paradigm, we simply add an onion routing on a layer above 
the channel-level routing, in much the same way that Tor adds onion routing on 
top of the TCP router-leving routing.
That is, we move the onion packet from the hop level to the trampoline node 
level.

An onion packet is sent to the destination that we admit to the `getroutequick` 
channel-level routing layer.
This onion packet indicates openly how much of the HTLC timelock is allocated 
to reach that destination, and how much is allocated to the destination itself.
On reaching the destination, it unwraps the onion by one layer, and it might 
find that it is the final destination, or that it has to use the allocated 
timelock as well to forward to the next destination.

### Payer Allocation

Each node already has a mapping of nodes to distance counters.
Thus, to make a payment to a node, it need only find some number of nodes in 
this mapping.

Ideally, it finds two other nodes that are nearer to it than the payee node 
(i.e. has distance counters lower than the distance to the actual payee) in the 
hope that they will be in "the same" direction or area as the final destination.

Suppose that it currently has a destination, and selects an intermediate node 
in the trampoline onion.
It got both nodes from its local map of node ID to distance, thus it knows the 
distance from itself to both nodes (assuming this data is not too stale).
By considering itself, and the two nodes, as a triangle, we can conclude that 
the maximum distance between those two nodes is equal to the sum of the 
distances between itself and each of the nodes.
So for example if the trampoline node is 5 hops away and the payee is 3 hops 
away, then the maximum distance between both is 8 hops.
So, the payer will allocate 8 hops for the route from the intermediate node to 
the current destination.

The payer can repeat that however many times it wants, in order to add more 
layers to the trampoline onion.
For example, it could add second trampoline, with a distance of 6 hops away, 
and it gets the 5 hops distance of the previous trampoline node and adds those 
together, allocating 11 hops for the route from second to first trampoline, and 
allocating the above 8 hops for the first trampoline to the payee.
Then, it sends out an HTLC of a total of 6 (distance to second trampoline) + 11 
(route from second to first trampoline) + 8 (route from first trampoline to 
destination) hops of timelock and fees.

### Trampoline-level Failures

A trampoline node might, for some reason, not be able to find the next 
trampoline.
It might also find that the payer information is so stale that the budgeted 
number of hops for that trampoline hop is insufficient.
In those cases, the trampoline node can report back a failure.

These would have a distinct failure code compared to the +N hops needed failure 
at the channel hop level.
However, it may be possible to add the destination "I have no idea what the 
fayment you are talking about" to these failures.
Like the onion-wrapped failures in the current Lightning onion routing, it may 
be possible to also onion-wrap the failures at the trampoline node level.
Then failures seen at the channel hop level are either "some onion-wrapped 
failure" or "add +N hops plz".
Of note is that a trampoline node that finds it has been given insufficient 
budget should not use the "add +N hops plz" message, but an onion-wrapped 
failure (which itself might be just "add +N hops plz, but to my budget").

### Payment Point Rotation

With payment point+scalar, we can have "path decorrelation", where each hop in 
the onion route changes the outgoing PTLC by a constant scalar that is secretly 
shared between itself and the original payer.

This will still work by moving the onion routing one level higher, from the 
level of channel hops to the level of trampoline nodes.

### Hidden Destinations

Another advantage of trampoline onions is that the payee can provide a 
pre-encrypted trampoline onion which the payer can prepend its own trampoline 
nodes to, i.e. hidden destinations.
This feature remains unchanged when done on top of the hop-level Ant Routing + 
`getroutequick` scheme.


Advantages and Disadvantages
----------------------------

Advantages:

* Far fewer channels have to be published offchain; a node that wants to 
receive in the future needs only publish *one* of its channels in order to act 
as proof that it is on Lightning.
  A node that expects many incoming payments can have multiple channels opened 
towards it, but only admit ownership of one channel.
  Two cooperating nodes that have a channel between them can use that channel 
to prove their existence in their individual pheromones, thereby revealing to 
the rest of the network only a small number of channels.
  This makes it much harder to map out the full network, but with this hiding 
much more consistently distributed among nodes, for better risk-sharing.
  * Nodes that do not expect to receive in the future, except from direct 
peers, do not need to publish *any* channels.
    Even if they do not publish *any* channels, however, they can still 
participate in forwarding, since forwarding nodes perform local queries to 
determine the next hop, and participation in forwarding is an important privacy 
technique; a node that regularly forwards can always claim that its payment is 
actually forwarded from somebody else, a property lost with the current 
"private"^H^H^H^H^H^H^H^H^Hunpublished channels.

Disadvantages:

* Censorship attacks are trivial: just lie about your distance to the node you 
want to censor and say you know them, then just HODL on to the HTLC that should 
go to them.
  This is a major disadvantage, thus possibly a good reason to avoid this 
paradigm.
  My hope is someone else can come up with some way to mitigate or remove this 
weakness.
* We would need to impose a consistent feerate and CLTV-delta at each node.
  * On the other hand, one can argue that it is precisely the ability of 
forwarding nodes to give arbitrary feerates that allows surveillors to 
effectively "pay to surveill" by giving below-market-rate feerates, so 
***maybe*** central planning of this parameter (yuck!) would be useful... nah 
non-free-market, boo.

Regards,
ZmnSCPxj
_______________________________________________
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev

[Lightning-dev] A New Routing Paradigm:Ant Routing +`getroutequick` + Trampoline

Reply via email to