sbp opened a new pull request #22: URL: https://github.com/apache/incubator-ponymail-foal/pull/22
This PR adds a new generator called **blakey**, based on [BLAKE3](https://github.com/BLAKE3-team/BLAKE3). It deprecates the old **dkim** generator. The new generator is fast and produces short output, at just 32 characters. It uses a deterministic, cryptographically secure hash of the conjunction of the `lid` and the raw message source received by `archiver.py`, with 160 bits of security. The encoding is the same as the dkim generator, which means it avoids the likelihood of taboo substrings. The main difference between the new and old generators is that blakey does *not* attempt to deduplicate inputs. In other words, whereas the dkim generator was intended to simulate a DKIM signature session input, giving the same output for different message sources which were nevertheless regarded as equivalent according to the algorithm, the blakey generator is more like the **full** generator in that it generates a different output for every `lid` and message source pair. The dkim generator was an early version of the one developed and discussed in [PR 517 of Ponymail](https://github.com/apache/incubator-ponymail/pull/517). That PR generated considerable disagreement as to the best method for judging two messages to be equivalent and encoding the result. There were several underlying problems, but the main one was the tension between wanting to deduplicate as thoroughly as possible by dropping information, and wanting to provide pristine archives by retaining information. The blakey generator sidesteps this issue by retaining all information. The only benefit it shares in common with dkim is that the output is short and therefore better suited for shareable permalinks. Since all information is retained in the new generator, naturally this means that input messages are no longer deduplicated. If this PR is accepted, the idea is that deduplication could instead occur as a separate process implemented either in `archiver.py` or in a background task. When a new message is received, existing messages could be checked for similarity. If a similar enough existing message is found, the new message can either be stored in the archive with a reference to the existing message, or discarded. Such a check would be quite efficient because, for example, the dkim generator considered messages with the same `message-id` to be equivalent, and a `message-id` query would dramatically narrow down the candidates to check for similarity. The interface can then redirect or ignore requests for messages that have a reference to an existing message. This PR is, therefore, the first part in what could be a sequence of updates to Foal. This PR introduces blakey, and deprecates dkim. Next it would be useful to add a tool which would migrate databases using dkim to use blakey instead. Then various deduplication tools could be added, to `archiver.py` or as a background process, or both. The significant advantage of this approach is that when deduplication takes place as a separate task with aliasing, it can not only be performed in a non-destructive way but it can also be flexible. The approach can be changed on the fly, and existing emails are unaffected. It does not require there to be a consensus on the difficult problem of what emails are equivalent. And it allows such approaches to be added incrementally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
