Re: [incubator-ponymail-foal] branch master updated: store sources as sha3-256 of themselves, add a permalink reference to the digested doc.

Daniel Gruno Tue, 25 Aug 2020 11:43:25 -0700

On 25/08/2020 20.35, sebb wrote:

On Tue, 25 Aug 2020 at 19:23, Daniel Gruno <[email protected]> wrote:


On 25/08/2020 20.15, sebb wrote:

AFAICT this will generate different hashes for the same message if
they are loaded from a different source.


Yeah, it will - at present, that is on purpose. We can look at doing
something like using Sean's DKIM parser for this, and only hashing the
output from that, with the x-archived-list-id added in from the command
line --lid argument if different from the canonical list id.


Whilst it should ensure that distinct messages don't clash, it won't
weed out actual duplicates.


Right, aware of that. In most cases, if you are reloading, you are doing
so with a fresh DB, and it won't matter much. In cases where you are
"cascading" mbox files, it would make duplicates, but that's only a
question of disk space for now, having duplicate source files won't
cause malfunctions, just a few more bytes used and source alternatives.


This has implications for the API and the UI.

If there are multiple matches for a Permalink, in general one cannot
say which is correct, so all will have to be returned and displayed.

I'm pondering how to address this. Currently, the prototype will returnthe first hit it finds that matches. This should really be fine, as theyare all valid sources, so returning one or the other would not matterfor the end-user.

For those that *do* care, which might be sysadmins looking into whetheran email was received twice or such, I'm thinking we might utilize the300 return code (multiple choices) for this. This would mean we are notduplicating it for the user in the UI, but that it's still available forthose that want to find it.


Note: in case there need to be changes to the hash input, I suggest a
prefix is added to identify the hash type.
It's not a problem if the key is slightly longer.


Ideally the parts that vary between duplicates from different sources
should be omitted from the hash.

However it is better than the current situation.

On Tue, 25 Aug 2020 at 19:02, <[email protected]> wrote:


This is an automated email from the ASF dual-hosted git repository.

humbedooh pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git


The following commit(s) were added to refs/heads/master by this push:
       new 8a1baf1  store sources as sha3-256 of themselves, add a permalink 
reference to the digested doc.
8a1baf1 is described below

commit 8a1baf16c64933f1fb7a9d281df8ebec418cf771
Author: Daniel Gruno <[email protected]>
AuthorDate: Tue Aug 25 20:02:44 2020 +0200

      store sources as sha3-256 of themselves, add a permalink reference to the 
digested doc.

      Conform to what archiver.py does.

      This allows us to store multiple copies of the same digested email, if
      received more than once (and if the raw data differs).
---
   tools/import-mbox.py | 7 ++++---
   1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/import-mbox.py b/tools/import-mbox.py
index 482a2b8..663e6da 100755
--- a/tools/import-mbox.py
+++ b/tools/import-mbox.py
@@ -38,6 +38,7 @@ from os.path import isdir, isfile, join
   from threading import Lock, Thread
   from urllib.request import urlopen

+
   import archiver
   from plugins.elastic import Elastic

@@ -208,6 +209,7 @@ class SlurpThread(Thread):
               for key in messages.iterkeys():
                   message = messages.get(key)
                   message_raw = messages.get_bytes(key)
+                sha3 = hashlib.sha3_256(message_raw).hexdigest()
                   # If --filter is set, discard any messages not matching by 
continuing to next email
                   if (
                       fromFilter
@@ -313,9 +315,8 @@ class SlurpThread(Thread):
                       try:  # temporary hack to try and find an encoding issue
                           # needs to be replaced by proper exception handling
                           json_source = {
-                            "mid": json[
-                                "mid"
-                            ],  # needed for bulk-insert only, not needed in 
database
+                            "permalink": json["mid"],
+                            "mid": sha3,
                               "message-id": json["message-id"],
                               "source": archiver.mbox_source(raw_msg),
                           }

Re: [incubator-ponymail-foal] branch master updated: store sources as sha3-256 of themselves, add a permalink reference to the digested doc.

Reply via email to