On Tue, 25 Aug 2020 at 19:42, Daniel Gruno <[email protected]> wrote:
>
> On 25/08/2020 20.35, sebb wrote:
> > On Tue, 25 Aug 2020 at 19:23, Daniel Gruno <[email protected]> wrote:
> >>
> >> On 25/08/2020 20.15, sebb wrote:
> >>> AFAICT this will generate different hashes for the same message if
> >>> they are loaded from a different source.
> >>
> >> Yeah, it will - at present, that is on purpose. We can look at doing
> >> something like using Sean's DKIM parser for this, and only hashing the
> >> output from that, with the x-archived-list-id added in from the command
> >> line --lid argument if different from the canonical list id.
> >>
> >>>
> >>> Whilst it should ensure that distinct messages don't clash, it won't
> >>> weed out actual duplicates.
> >>
> >> Right, aware of that. In most cases, if you are reloading, you are doing
> >> so with a fresh DB, and it won't matter much. In cases where you are
> >> "cascading" mbox files, it would make duplicates, but that's only a
> >> question of disk space for now, having duplicate source files won't
> >> cause malfunctions, just a few more bytes used and source alternatives.
> >
> > This has implications for the API and the UI.
> >
> > If there are multiple matches for a Permalink, in general one cannot
> > say which is correct, so all will have to be returned and displayed.
>
> I'm pondering how to address this. Currently, the prototype will return
> the first hit it finds that matches. This should really be fine, as they
> are all valid sources, so returning one or the other would not matter
> for the end-user.

This assumes that the Permalink is sufficiently unique.
That is not true for some of the current designs.

> For those that *do* care, which might be sysadmins looking into whether
> an email was received twice or such, I'm thinking we might utilize the
> 300 return code (multiple choices) for this. This would mean we are not
> duplicating it for the user in the UI, but that it's still available for
> those that want to find it.
>
> >
> > Note: in case there need to be changes to the hash input, I suggest a
> > prefix is added to identify the hash type.
> > It's not a problem if the key is slightly longer.
> >
> >>>
> >>> Ideally the parts that vary between duplicates from different sources
> >>> should be omitted from the hash.
> >>>
> >>> However it is better than the current situation.
> >>>
> >>> On Tue, 25 Aug 2020 at 19:02, <[email protected]> wrote:
> >>>>
> >>>> This is an automated email from the ASF dual-hosted git repository.
> >>>>
> >>>> humbedooh pushed a commit to branch master
> >>>> in repository 
> >>>> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
> >>>>
> >>>>
> >>>> The following commit(s) were added to refs/heads/master by this push:
> >>>>        new 8a1baf1  store sources as sha3-256 of themselves, add a 
> >>>> permalink reference to the digested doc.
> >>>> 8a1baf1 is described below
> >>>>
> >>>> commit 8a1baf16c64933f1fb7a9d281df8ebec418cf771
> >>>> Author: Daniel Gruno <[email protected]>
> >>>> AuthorDate: Tue Aug 25 20:02:44 2020 +0200
> >>>>
> >>>>       store sources as sha3-256 of themselves, add a permalink reference 
> >>>> to the digested doc.
> >>>>
> >>>>       Conform to what archiver.py does.
> >>>>
> >>>>       This allows us to store multiple copies of the same digested 
> >>>> email, if
> >>>>       received more than once (and if the raw data differs).
> >>>> ---
> >>>>    tools/import-mbox.py | 7 ++++---
> >>>>    1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/tools/import-mbox.py b/tools/import-mbox.py
> >>>> index 482a2b8..663e6da 100755
> >>>> --- a/tools/import-mbox.py
> >>>> +++ b/tools/import-mbox.py
> >>>> @@ -38,6 +38,7 @@ from os.path import isdir, isfile, join
> >>>>    from threading import Lock, Thread
> >>>>    from urllib.request import urlopen
> >>>>
> >>>> +
> >>>>    import archiver
> >>>>    from plugins.elastic import Elastic
> >>>>
> >>>> @@ -208,6 +209,7 @@ class SlurpThread(Thread):
> >>>>                for key in messages.iterkeys():
> >>>>                    message = messages.get(key)
> >>>>                    message_raw = messages.get_bytes(key)
> >>>> +                sha3 = hashlib.sha3_256(message_raw).hexdigest()
> >>>>                    # If --filter is set, discard any messages not 
> >>>> matching by continuing to next email
> >>>>                    if (
> >>>>                        fromFilter
> >>>> @@ -313,9 +315,8 @@ class SlurpThread(Thread):
> >>>>                        try:  # temporary hack to try and find an 
> >>>> encoding issue
> >>>>                            # needs to be replaced by proper exception 
> >>>> handling
> >>>>                            json_source = {
> >>>> -                            "mid": json[
> >>>> -                                "mid"
> >>>> -                            ],  # needed for bulk-insert only, not 
> >>>> needed in database
> >>>> +                            "permalink": json["mid"],
> >>>> +                            "mid": sha3,
> >>>>                                "message-id": json["message-id"],
> >>>>                                "source": archiver.mbox_source(raw_msg),
> >>>>                            }
> >>>>
> >>
>

Reply via email to