On Tue, 25 Aug 2020 at 19:42, Daniel Gruno <[email protected]> wrote: > > On 25/08/2020 20.35, sebb wrote: > > On Tue, 25 Aug 2020 at 19:23, Daniel Gruno <[email protected]> wrote: > >> > >> On 25/08/2020 20.15, sebb wrote: > >>> AFAICT this will generate different hashes for the same message if > >>> they are loaded from a different source. > >> > >> Yeah, it will - at present, that is on purpose. We can look at doing > >> something like using Sean's DKIM parser for this, and only hashing the > >> output from that, with the x-archived-list-id added in from the command > >> line --lid argument if different from the canonical list id. > >> > >>> > >>> Whilst it should ensure that distinct messages don't clash, it won't > >>> weed out actual duplicates. > >> > >> Right, aware of that. In most cases, if you are reloading, you are doing > >> so with a fresh DB, and it won't matter much. In cases where you are > >> "cascading" mbox files, it would make duplicates, but that's only a > >> question of disk space for now, having duplicate source files won't > >> cause malfunctions, just a few more bytes used and source alternatives. > > > > This has implications for the API and the UI. > > > > If there are multiple matches for a Permalink, in general one cannot > > say which is correct, so all will have to be returned and displayed. > > I'm pondering how to address this. Currently, the prototype will return > the first hit it finds that matches. This should really be fine, as they > are all valid sources, so returning one or the other would not matter > for the end-user.
This assumes that the Permalink is sufficiently unique. That is not true for some of the current designs. > For those that *do* care, which might be sysadmins looking into whether > an email was received twice or such, I'm thinking we might utilize the > 300 return code (multiple choices) for this. This would mean we are not > duplicating it for the user in the UI, but that it's still available for > those that want to find it. > > > > > Note: in case there need to be changes to the hash input, I suggest a > > prefix is added to identify the hash type. > > It's not a problem if the key is slightly longer. > > > >>> > >>> Ideally the parts that vary between duplicates from different sources > >>> should be omitted from the hash. > >>> > >>> However it is better than the current situation. > >>> > >>> On Tue, 25 Aug 2020 at 19:02, <[email protected]> wrote: > >>>> > >>>> This is an automated email from the ASF dual-hosted git repository. > >>>> > >>>> humbedooh pushed a commit to branch master > >>>> in repository > >>>> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git > >>>> > >>>> > >>>> The following commit(s) were added to refs/heads/master by this push: > >>>> new 8a1baf1 store sources as sha3-256 of themselves, add a > >>>> permalink reference to the digested doc. > >>>> 8a1baf1 is described below > >>>> > >>>> commit 8a1baf16c64933f1fb7a9d281df8ebec418cf771 > >>>> Author: Daniel Gruno <[email protected]> > >>>> AuthorDate: Tue Aug 25 20:02:44 2020 +0200 > >>>> > >>>> store sources as sha3-256 of themselves, add a permalink reference > >>>> to the digested doc. > >>>> > >>>> Conform to what archiver.py does. > >>>> > >>>> This allows us to store multiple copies of the same digested > >>>> email, if > >>>> received more than once (and if the raw data differs). > >>>> --- > >>>> tools/import-mbox.py | 7 ++++--- > >>>> 1 file changed, 4 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/tools/import-mbox.py b/tools/import-mbox.py > >>>> index 482a2b8..663e6da 100755 > >>>> --- a/tools/import-mbox.py > >>>> +++ b/tools/import-mbox.py > >>>> @@ -38,6 +38,7 @@ from os.path import isdir, isfile, join > >>>> from threading import Lock, Thread > >>>> from urllib.request import urlopen > >>>> > >>>> + > >>>> import archiver > >>>> from plugins.elastic import Elastic > >>>> > >>>> @@ -208,6 +209,7 @@ class SlurpThread(Thread): > >>>> for key in messages.iterkeys(): > >>>> message = messages.get(key) > >>>> message_raw = messages.get_bytes(key) > >>>> + sha3 = hashlib.sha3_256(message_raw).hexdigest() > >>>> # If --filter is set, discard any messages not > >>>> matching by continuing to next email > >>>> if ( > >>>> fromFilter > >>>> @@ -313,9 +315,8 @@ class SlurpThread(Thread): > >>>> try: # temporary hack to try and find an > >>>> encoding issue > >>>> # needs to be replaced by proper exception > >>>> handling > >>>> json_source = { > >>>> - "mid": json[ > >>>> - "mid" > >>>> - ], # needed for bulk-insert only, not > >>>> needed in database > >>>> + "permalink": json["mid"], > >>>> + "mid": sha3, > >>>> "message-id": json["message-id"], > >>>> "source": archiver.mbox_source(raw_msg), > >>>> } > >>>> > >> >
