To add some further thought to this, we should probably move away from the .as_bytes() as much as we can, as we have no guarantee it's going to be a stable return value (its code-base is outside our jurisdiction).

There are better alternatives for mbox files that uses the raw bytes with a seek and read, for import-mbox.py

On 25/08/2020 18.47, Daniel Gruno wrote:
On 25/08/2020 18.37, sebb wrote:
-1

This will likely change the generated ID for emails that already have
archived-at headers

No, it will not change that, as they already will have it. This is fixing an issue with randomness in the archiving.

See https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L816

What happens is, an archived-at with *the current timestamp* will get added to all messages that do not have it, so it's not really helping at all with anything when you use the .as_bytes, as that data will always be different unless there already is such a header.

The raw_msg will deliver a much more reliable result going forward, as it doesn't add "random" data to the mix on each reload (because it is the original raw data without headers appended on the fly).

I hope this clears matters up.


Far better to ensure the archived-at header is not added to the parsed
mail before the generator is used.

On Tue, 25 Aug 2020 at 17:32, <[email protected]> wrote:

This is an automated email from the ASF dual-hosted git repository.

humbedooh pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git


The following commit(s) were added to refs/heads/master by this push:
      new 6dfb9d8  Use raw_msg instead of as_bytes, as the latter has a new archived-at appended sometimes (and we don't want that)
6dfb9d8 is described below

commit 6dfb9d83b1fa6e0ae83bc61446a27fe555751f8c
Author: Daniel Gruno <[email protected]>
AuthorDate: Tue Aug 25 18:32:33 2020 +0200

     Use raw_msg instead of as_bytes, as the latter has a new archived-at appended sometimes (and we don't want that)
---
  tools/plugins/generators.py | 8 ++++----
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/plugins/generators.py b/tools/plugins/generators.py
index 949a3b6..344f115 100644
--- a/tools/plugins/generators.py
+++ b/tools/plugins/generators.py
@@ -151,7 +151,7 @@ def dkim(_msg, _body, lid, _attachments, raw_msg):
  # Full generator: uses the entire email (including server-dependent data)
  # Used by default until August 2020.
  # See 'dkim' for recommended generation.
-def full(msg, _body, lid, _attachments, _raw_msg):
+def full(_msg, _body, lid, _attachments, raw_msg):
      """
      Full generator: uses the entire email
      (including server-dependent data)
@@ -159,15 +159,15 @@ def full(msg, _body, lid, _attachments, _raw_msg):
      but different copies of the message are likely to have different headers, thus ids

      Parameters:
-    msg - the parsed message
+    msg - the parsed message (not used)
      _body - the parsed text content (not used)
      lid - list id
      _attachments - list of attachments (not used)
-    _raw_msg - the original message bytes (not used)
+    raw_msg - the original message bytes

      Returns: "<hash>@<lid>" where hash is sha224 of message bytes
      """
-    mid = "%s@%s" % (hashlib.sha224(msg.as_bytes()).hexdigest(), lid)
+    mid = "%s@%s" % (hashlib.sha224(raw_msg).hexdigest(), lid)
      return mid





Reply via email to