On Tue, 25 Aug 2020 at 20:27, Daniel Gruno <[email protected]> wrote: > > On 25/08/2020 21.24, sebb wrote: > > On Tue, 25 Aug 2020 at 20:11, Daniel Gruno <[email protected]> wrote: > >> > >> On 25/08/2020 20.54, sebb wrote: > >>> On Tue, 25 Aug 2020 at 19:42, Daniel Gruno <[email protected]> wrote: > >>>> > >>>> On 25/08/2020 20.35, sebb wrote: > >>>>> On Tue, 25 Aug 2020 at 19:23, Daniel Gruno <[email protected]> wrote: > >>>>>> > >>>>>> On 25/08/2020 20.15, sebb wrote: > >>>>>>> AFAICT this will generate different hashes for the same message if > >>>>>>> they are loaded from a different source. > >>>>>> > >>>>>> Yeah, it will - at present, that is on purpose. We can look at doing > >>>>>> something like using Sean's DKIM parser for this, and only hashing the > >>>>>> output from that, with the x-archived-list-id added in from the command > >>>>>> line --lid argument if different from the canonical list id. > >>>>>> > >>>>>>> > >>>>>>> Whilst it should ensure that distinct messages don't clash, it won't > >>>>>>> weed out actual duplicates. > >>>>>> > >>>>>> Right, aware of that. In most cases, if you are reloading, you are > >>>>>> doing > >>>>>> so with a fresh DB, and it won't matter much. In cases where you are > >>>>>> "cascading" mbox files, it would make duplicates, but that's only a > >>>>>> question of disk space for now, having duplicate source files won't > >>>>>> cause malfunctions, just a few more bytes used and source alternatives. > >>>>> > >>>>> This has implications for the API and the UI. > >>>>> > >>>>> If there are multiple matches for a Permalink, in general one cannot > >>>>> say which is correct, so all will have to be returned and displayed. > >>>> > >>>> I'm pondering how to address this. Currently, the prototype will return > >>>> the first hit it finds that matches. This should really be fine, as they > >>>> are all valid sources, so returning one or the other would not matter > >>>> for the end-user. > >>> > >>> This assumes that the Permalink is sufficiently unique. > >>> That is not true for some of the current designs. > >>> > >> > >> This would be the case only if you lost your database and decided to > >> re-image everything from scratch using foal with an older generator > >> instead of the original pony mail, and two or more emails had collisions. > >> > >> I would strongly recommend against doing this unless you have no other > >> choice or do not care about older permalinks that much. > >> > >> Foal is not meant as a drop-in replacement for the current Pony Mail. If > >> you lose your old database and want complete assuredness against this, > >> you should re-image using the old version first, and then migrate > >> across. There will be differences in both the archiver and the UI that > >> are not fully backwards compatible, as the 'old ways' are bugged here > >> and there. > >> > >> The migrator will, once it's done, migrate everything over verbatim, so > >> any overrides you had in the old system will apply to the new one as > >> well, and you won't see multiple choices for old emails, only newly > >> archived ones done with the foal archiver or importer. > > > > If Foal is to support non-unique generators, it must use their > > Permalinks as the database Id, or it must support multiple matches. > > > > I'm strongly in favor of ripping them out of the system altogether, and > only supporting full and dkim for future operations. I haven't quite > gotten around to it yet :)
The full generator is only useful for messages that always come from the same source. Unless all the headers are identical, it will produce a different output. And until the recent removal of the archived-at header, it would not even produce the same result twice for identical archiver inputs. Nor would import-mbox produce the same result as the archiver for the same message. It is the least stable generator. I think it would be a mistake to keep it unless you are keeping them all.
