Hi Elias,

Thanks for re-emphasizing the importance of being privacy-conscious
as we look into this work - we completely agree!

> clarify upfront whether you intend to time-box the collection period,
where the data would be stored, and who would have access to it

Our ideal collection period would be limited to a 6 month period. One
of our main aims in defining a common data format is to ensure that we
can provide node operators with tooling that they can run locally, so
that they do not need to export the data _at all_, only very aggregated
results.

In the case where folks are comfortable sharing their data with us,
we will follow best practices handling this sensitive information and
will not share the data onwards at all. Fields will also be anonymized
as described in the original email. Re your concerns around timestamps,
we can also fuzz timestamps, as only the resolution period matters to
our work (thanks for flagging!).

I hope that addresses your concerns. Research based on real world data
is always a difficult line to walk, but we believe worthwhile in this
case.

Cheers,
Carla + Clara


On Thu, Aug 3, 2023 at 4:54 AM Elias Rohrer <l...@tnull.de> wrote:

> Hi Carla + Clara,
>
> I want to prefix this by saying that I'm very familiar with how limiting
> the lack of available real-world datasets can be for conducting significant
> simulations and empirical experiments on Lightning.
>
> However, it may be noteworthy that long-term collection of the proposed
> fields could potentially allow to re-identify the anonymized channel
> counterparties based off some heuristics correlating with the public graph
> data, especially when datasets from multiple (possibly neighbouring)
> collection points will end up being combined. Subsequently, this might
> allow to draw further conclusions on transferred amounts, channel
> liquidities at particular times, and, as HTLC settlement/failure timestamps
> are recorded in nanosecond resolution, potentially even the payment
> destination's identity (cf. 1 <https://arxiv.org/pdf/2006.12143.pdf>).
>
> As surrendering this kind of data therefore requires a good level of trust
> in the researchers, it might be helpful (and best practise) if you could
> clarify upfront whether you intend to time-box the collection period, where
> the data would be stored, and who would have access to it. From my point of
> view clearly defining the collection period would also be mandatory as we
> don't want to incentivise node operators to collect and store HTLC data
> longer-term, especially if it's to this degree of detail.
>
> Best,
>
> Elias
>
> ### 1. Collect Anonymized Data
> We're aware that we are dealing with sensitive and private information.
> For this reason, we propose defining a common data format so that
> analysis tooling can be built around, so that node operators can run
> the analysis locally if desired. Fields marked with [P] *MUST* be
> randomized if exported to researching teams.
>
> The proposed format is a CSV file with the following fields:
> * version (uint8): set to 1, included to future-proof ourselves
> against the need to change this format.
> * channel_in (uint64)[P]: the short channel ID of the incoming channel
> that forwarded the HLTC.
> * channel_out (uint64)[P]: the short channel ID of the outgoing
> channel that forwarded the HTLC.
> * peer_in (hex string)[P]: the hex encoded pubkey of the remote peer
> for the channel_in.
> * peer_out (hex_string)[P]: the hex encoded pubkey of the remote peer
> for the channel_out.
> * fee_msat(uint64): the fee offered by the HTLC, expressed in msat.
> * outgoing_liquidity (float64): the portion of
> `max_htlc_value_in_flight` that is occupied on channel_out after the
> HTLC has been forwarded.
> * outgoing_slots (float64): the portion of `max_accepted_htlcs` that
> is occupied on channel_out after the HTLC has been forwarded.
> * ts_added_ns (uint64): the unix timestamp that the HTLC was added,
> expressed in nanoseconds.
> * ts_removed_ns (uint64): the unix timestamp that the HLTC was
> removed, expressed in nanoseconds.
> * htlc_settled (bool): set to 0 if the HTLC failed, and 1 if it was
> settled.
> * incoming_endorsed (int16): an integer indicating the endorsement
> status of the incoming HTLC (-1 if not present, otherwise set to the
> value in the incoming endorsement TLV).
> * outgoing_endorsed (int16): an integer indicating the endorsement
> status of the outgoing HTLC (-1 if not set, otherwise set to the
> value set in the outgoing endorsement TLV).
>
> Before we add endorsement signaling and setting via an experimental
> TLV, the last two values here will always be -1. The data is still
> incredibly useful in the meantime, and allows for easy update once the
>
> TLV is propagated through the network.
>
>
_______________________________________________
Lightning-dev mailing list
Lightning-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev

Reply via email to