sbp opened a new pull request #22:
URL: https://github.com/apache/incubator-ponymail-foal/pull/22


   This PR adds a new generator called **blakey**, based on 
[BLAKE3](https://github.com/BLAKE3-team/BLAKE3). It deprecates the old **dkim** 
generator.
   
   The new generator is fast and produces short output, at just 32 characters. 
It uses a deterministic, cryptographically secure hash of the conjunction of 
the `lid` and the raw message source received by `archiver.py`, with 160 bits 
of security. The encoding is the same as the dkim generator, which means it 
avoids the likelihood of taboo substrings.
   
   The main difference between the new and old generators is that blakey does 
*not* attempt to deduplicate inputs. In other words, whereas the dkim generator 
was intended to simulate a DKIM signature session input, giving the same output 
for different message sources which were nevertheless regarded as equivalent 
according to the algorithm, the blakey generator is more like the **full** 
generator in that it generates a different output for every `lid` and message 
source pair.
   
   The dkim generator was an early version of the one developed and discussed 
in [PR 517 of Ponymail](https://github.com/apache/incubator-ponymail/pull/517). 
That PR generated considerable disagreement as to the best method for judging 
two messages to be equivalent and encoding the result. There were several 
underlying problems, but the main one was the tension between wanting to 
deduplicate as thoroughly as possible by dropping information, and wanting to 
provide pristine archives by retaining information.
   
   The blakey generator sidesteps this issue by retaining all information. The 
only benefit it shares in common with dkim is that the output is short and 
therefore better suited for shareable permalinks.
   
   Since all information is retained in the new generator, naturally this means 
that input messages are no longer deduplicated. If this PR is accepted, the 
idea is that deduplication could instead occur as a separate process 
implemented either in `archiver.py` or in a background task. When a new message 
is received, existing messages could be checked for similarity. If a similar 
enough existing message is found, the new message can either be stored in the 
archive with a reference to the existing message, or discarded. Such a check 
would be quite efficient because, for example, the dkim generator considered 
messages with the same `message-id` to be equivalent, and a `message-id` query 
would dramatically narrow down the candidates to check for similarity. The 
interface can then redirect or ignore requests for messages that have a 
reference to an existing message.
   
   This PR is, therefore, the first part in what could be a sequence of updates 
to Foal. This PR introduces blakey, and deprecates dkim. Next it would be 
useful to add a tool which would migrate databases using dkim to use blakey 
instead. Then various deduplication tools could be added, to `archiver.py` or 
as a background process, or both.
   
   The significant advantage of this approach is that when deduplication takes 
place as a separate task with aliasing, it can not only be performed in a 
non-destructive way but it can also be flexible. The approach can be changed on 
the fly, and existing emails are unaffected. It does not require there to be a 
consensus on the difficult problem of what emails are equivalent. And it allows 
such approaches to be added incrementally.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to