On Tue, 25 Aug 2020 at 19:23, Daniel Gruno <[email protected]> wrote: > > On 25/08/2020 20.15, sebb wrote: > > AFAICT this will generate different hashes for the same message if > > they are loaded from a different source. > > Yeah, it will - at present, that is on purpose. We can look at doing > something like using Sean's DKIM parser for this, and only hashing the > output from that, with the x-archived-list-id added in from the command > line --lid argument if different from the canonical list id. > > > > > Whilst it should ensure that distinct messages don't clash, it won't > > weed out actual duplicates. > > Right, aware of that. In most cases, if you are reloading, you are doing > so with a fresh DB, and it won't matter much. In cases where you are > "cascading" mbox files, it would make duplicates, but that's only a > question of disk space for now, having duplicate source files won't > cause malfunctions, just a few more bytes used and source alternatives.
This has implications for the API and the UI. If there are multiple matches for a Permalink, in general one cannot say which is correct, so all will have to be returned and displayed. Note: in case there need to be changes to the hash input, I suggest a prefix is added to identify the hash type. It's not a problem if the key is slightly longer. > > > > Ideally the parts that vary between duplicates from different sources > > should be omitted from the hash. > > > > However it is better than the current situation. > > > > On Tue, 25 Aug 2020 at 19:02, <[email protected]> wrote: > >> > >> This is an automated email from the ASF dual-hosted git repository. > >> > >> humbedooh pushed a commit to branch master > >> in repository > >> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git > >> > >> > >> The following commit(s) were added to refs/heads/master by this push: > >> new 8a1baf1 store sources as sha3-256 of themselves, add a > >> permalink reference to the digested doc. > >> 8a1baf1 is described below > >> > >> commit 8a1baf16c64933f1fb7a9d281df8ebec418cf771 > >> Author: Daniel Gruno <[email protected]> > >> AuthorDate: Tue Aug 25 20:02:44 2020 +0200 > >> > >> store sources as sha3-256 of themselves, add a permalink reference to > >> the digested doc. > >> > >> Conform to what archiver.py does. > >> > >> This allows us to store multiple copies of the same digested email, if > >> received more than once (and if the raw data differs). > >> --- > >> tools/import-mbox.py | 7 ++++--- > >> 1 file changed, 4 insertions(+), 3 deletions(-) > >> > >> diff --git a/tools/import-mbox.py b/tools/import-mbox.py > >> index 482a2b8..663e6da 100755 > >> --- a/tools/import-mbox.py > >> +++ b/tools/import-mbox.py > >> @@ -38,6 +38,7 @@ from os.path import isdir, isfile, join > >> from threading import Lock, Thread > >> from urllib.request import urlopen > >> > >> + > >> import archiver > >> from plugins.elastic import Elastic > >> > >> @@ -208,6 +209,7 @@ class SlurpThread(Thread): > >> for key in messages.iterkeys(): > >> message = messages.get(key) > >> message_raw = messages.get_bytes(key) > >> + sha3 = hashlib.sha3_256(message_raw).hexdigest() > >> # If --filter is set, discard any messages not matching > >> by continuing to next email > >> if ( > >> fromFilter > >> @@ -313,9 +315,8 @@ class SlurpThread(Thread): > >> try: # temporary hack to try and find an encoding > >> issue > >> # needs to be replaced by proper exception > >> handling > >> json_source = { > >> - "mid": json[ > >> - "mid" > >> - ], # needed for bulk-insert only, not needed > >> in database > >> + "permalink": json["mid"], > >> + "mid": sha3, > >> "message-id": json["message-id"], > >> "source": archiver.mbox_source(raw_msg), > >> } > >> >
