On Sat, 19 Mar 2022 at 13:39, Daniel Gruno <[email protected]> wrote: > > On 19/03/2022 12.48, sebb wrote: > > AFAICT, all distinct email sources are currently stored in the > > database, because the the id is derived from a hash of the source. (*) > > > > However, that does not mean that they are recoverable. > > > > The current database design requires that source entries are retrieved > > via the corresponding mbox entry. > > If a second email is received that hashes to the same mbox index, the > > pointer back to the existing source entry will be overwritten. > > > > Such duplicates are not unknown; mail transport glitches can result in > > duplication of email content (but different ezmlm archive numbers and > > some other headers). > > > > In such cases, it is no longer possible to recover the original source. > > I think we discussed this before. One solution is to change the behavior > at > https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L728 > - if an email is found to already exist with the same DKIM ID, we should > fetch it and append the new mbox_source ID to the existing document. As > ElasticSearch doesn't care if something is a string or an array of > strings, this should be fine. A check for the source could then perhaps > result in a HTTP 300 Multiple Choices response?
Yes, something like that should fix it. > > > > I think this could be fixed, but until it is, I don't think Pony Mail > > can be considered as a complete archival application, as it does not > > give access to all the emails received by a mailing list. > > We have to manage expectations and define what we mean by "archive". In Exactly. I expect an archive of an email list to contain the same emails as I receive as a subscriber. > my world, Pony Mail exists as a searchable/interactive archive for users > to find _content_ and _intentions_, not necessarily as a bit-for-bit > verbatim backup for system administrators. If people wish to insure > against disasters, there are ways of doing that. That is not the point. > I find it sufficient that as long as you can find your email in the > right place, it does not matter if it's technically a "de-duplicated > duplicate". > I don't find it sufficient. If I cannot find all my emails in the archive, I don't consider it complete. > > > > Sebb > > (*) discounting hash collisions, which should be vanishingly small >
