On 25/08/2020 20.35, sebb wrote:
On Tue, 25 Aug 2020 at 19:23, Daniel Gruno <[email protected]> wrote:
On 25/08/2020 20.15, sebb wrote:
AFAICT this will generate different hashes for the same message if
they are loaded from a different source.
Yeah, it will - at present, that is on purpose. We can look at doing
something like using Sean's DKIM parser for this, and only hashing the
output from that, with the x-archived-list-id added in from the command
line --lid argument if different from the canonical list id.
Whilst it should ensure that distinct messages don't clash, it won't
weed out actual duplicates.
Right, aware of that. In most cases, if you are reloading, you are doing
so with a fresh DB, and it won't matter much. In cases where you are
"cascading" mbox files, it would make duplicates, but that's only a
question of disk space for now, having duplicate source files won't
cause malfunctions, just a few more bytes used and source alternatives.
This has implications for the API and the UI.
If there are multiple matches for a Permalink, in general one cannot
say which is correct, so all will have to be returned and displayed.
I'm pondering how to address this. Currently, the prototype will return
the first hit it finds that matches. This should really be fine, as they
are all valid sources, so returning one or the other would not matter
for the end-user.
For those that *do* care, which might be sysadmins looking into whether
an email was received twice or such, I'm thinking we might utilize the
300 return code (multiple choices) for this. This would mean we are not
duplicating it for the user in the UI, but that it's still available for
those that want to find it.
Note: in case there need to be changes to the hash input, I suggest a
prefix is added to identify the hash type.
It's not a problem if the key is slightly longer.
Ideally the parts that vary between duplicates from different sources
should be omitted from the hash.
However it is better than the current situation.
On Tue, 25 Aug 2020 at 19:02, <[email protected]> wrote:
This is an automated email from the ASF dual-hosted git repository.
humbedooh pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
The following commit(s) were added to refs/heads/master by this push:
new 8a1baf1 store sources as sha3-256 of themselves, add a permalink
reference to the digested doc.
8a1baf1 is described below
commit 8a1baf16c64933f1fb7a9d281df8ebec418cf771
Author: Daniel Gruno <[email protected]>
AuthorDate: Tue Aug 25 20:02:44 2020 +0200
store sources as sha3-256 of themselves, add a permalink reference to the
digested doc.
Conform to what archiver.py does.
This allows us to store multiple copies of the same digested email, if
received more than once (and if the raw data differs).
---
tools/import-mbox.py | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/tools/import-mbox.py b/tools/import-mbox.py
index 482a2b8..663e6da 100755
--- a/tools/import-mbox.py
+++ b/tools/import-mbox.py
@@ -38,6 +38,7 @@ from os.path import isdir, isfile, join
from threading import Lock, Thread
from urllib.request import urlopen
+
import archiver
from plugins.elastic import Elastic
@@ -208,6 +209,7 @@ class SlurpThread(Thread):
for key in messages.iterkeys():
message = messages.get(key)
message_raw = messages.get_bytes(key)
+ sha3 = hashlib.sha3_256(message_raw).hexdigest()
# If --filter is set, discard any messages not matching by
continuing to next email
if (
fromFilter
@@ -313,9 +315,8 @@ class SlurpThread(Thread):
try: # temporary hack to try and find an encoding issue
# needs to be replaced by proper exception handling
json_source = {
- "mid": json[
- "mid"
- ], # needed for bulk-insert only, not needed in
database
+ "permalink": json["mid"],
+ "mid": sha3,
"message-id": json["message-id"],
"source": archiver.mbox_source(raw_msg),
}