On Tue, 25 Aug 2020 at 19:23, Daniel Gruno <[email protected]> wrote:
>
> On 25/08/2020 20.15, sebb wrote:
> > AFAICT this will generate different hashes for the same message if
> > they are loaded from a different source.
>
> Yeah, it will - at present, that is on purpose. We can look at doing
> something like using Sean's DKIM parser for this, and only hashing the
> output from that, with the x-archived-list-id added in from the command
> line --lid argument if different from the canonical list id.
>
> >
> > Whilst it should ensure that distinct messages don't clash, it won't
> > weed out actual duplicates.
>
> Right, aware of that. In most cases, if you are reloading, you are doing
> so with a fresh DB, and it won't matter much. In cases where you are
> "cascading" mbox files, it would make duplicates, but that's only a
> question of disk space for now, having duplicate source files won't
> cause malfunctions, just a few more bytes used and source alternatives.

This has implications for the API and the UI.

If there are multiple matches for a Permalink, in general one cannot
say which is correct, so all will have to be returned and displayed.

Note: in case there need to be changes to the hash input, I suggest a
prefix is added to identify the hash type.
It's not a problem if the key is slightly longer.

> >
> > Ideally the parts that vary between duplicates from different sources
> > should be omitted from the hash.
> >
> > However it is better than the current situation.
> >
> > On Tue, 25 Aug 2020 at 19:02, <[email protected]> wrote:
> >>
> >> This is an automated email from the ASF dual-hosted git repository.
> >>
> >> humbedooh pushed a commit to branch master
> >> in repository 
> >> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
> >>
> >>
> >> The following commit(s) were added to refs/heads/master by this push:
> >>       new 8a1baf1  store sources as sha3-256 of themselves, add a 
> >> permalink reference to the digested doc.
> >> 8a1baf1 is described below
> >>
> >> commit 8a1baf16c64933f1fb7a9d281df8ebec418cf771
> >> Author: Daniel Gruno <[email protected]>
> >> AuthorDate: Tue Aug 25 20:02:44 2020 +0200
> >>
> >>      store sources as sha3-256 of themselves, add a permalink reference to 
> >> the digested doc.
> >>
> >>      Conform to what archiver.py does.
> >>
> >>      This allows us to store multiple copies of the same digested email, if
> >>      received more than once (and if the raw data differs).
> >> ---
> >>   tools/import-mbox.py | 7 ++++---
> >>   1 file changed, 4 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/tools/import-mbox.py b/tools/import-mbox.py
> >> index 482a2b8..663e6da 100755
> >> --- a/tools/import-mbox.py
> >> +++ b/tools/import-mbox.py
> >> @@ -38,6 +38,7 @@ from os.path import isdir, isfile, join
> >>   from threading import Lock, Thread
> >>   from urllib.request import urlopen
> >>
> >> +
> >>   import archiver
> >>   from plugins.elastic import Elastic
> >>
> >> @@ -208,6 +209,7 @@ class SlurpThread(Thread):
> >>               for key in messages.iterkeys():
> >>                   message = messages.get(key)
> >>                   message_raw = messages.get_bytes(key)
> >> +                sha3 = hashlib.sha3_256(message_raw).hexdigest()
> >>                   # If --filter is set, discard any messages not matching 
> >> by continuing to next email
> >>                   if (
> >>                       fromFilter
> >> @@ -313,9 +315,8 @@ class SlurpThread(Thread):
> >>                       try:  # temporary hack to try and find an encoding 
> >> issue
> >>                           # needs to be replaced by proper exception 
> >> handling
> >>                           json_source = {
> >> -                            "mid": json[
> >> -                                "mid"
> >> -                            ],  # needed for bulk-insert only, not needed 
> >> in database
> >> +                            "permalink": json["mid"],
> >> +                            "mid": sha3,
> >>                               "message-id": json["message-id"],
> >>                               "source": archiver.mbox_source(raw_msg),
> >>                           }
> >>
>

Reply via email to