On Tue, 25 Aug 2020 at 17:47, Daniel Gruno <[email protected]> wrote: > > On 25/08/2020 18.37, sebb wrote: > > -1 > > > > This will likely change the generated ID for emails that already have > > archived-at headers > > No, it will not change that, as they already will have it. This is > fixing an issue with randomness in the archiving. > > See > https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L816
Yes, that adds an archived-at header if the mail does not already have one. Since this is added before the generator is called, it may cause generators to produce different output. This particularly affects the full generator. If the message already has an archived-at header, then the full generator output should be stable, provided it is given the same input. But if the input is switched to raw bytes from the reconstructed message, this will most likely change it, unnecessarily changing at least some ids. > > What happens is, an archived-at with *the current timestamp* will get > added to all messages that do not have it, so it's not really helping at > all with anything when you use the .as_bytes, as that data will always > be different unless there already is such a header. > > The raw_msg will deliver a much more reliable result going forward, as > it doesn't add "random" data to the mix on each reload (because it is > the original raw data without headers appended on the fly). > > I hope this clears matters up. > > > > > Far better to ensure the archived-at header is not added to the parsed > > mail before the generator is used. > > > > On Tue, 25 Aug 2020 at 17:32, <[email protected]> wrote: > >> > >> This is an automated email from the ASF dual-hosted git repository. > >> > >> humbedooh pushed a commit to branch master > >> in repository > >> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git > >> > >> > >> The following commit(s) were added to refs/heads/master by this push: > >> new 6dfb9d8 Use raw_msg instead of as_bytes, as the latter has a > >> new archived-at appended sometimes (and we don't want that) > >> 6dfb9d8 is described below > >> > >> commit 6dfb9d83b1fa6e0ae83bc61446a27fe555751f8c > >> Author: Daniel Gruno <[email protected]> > >> AuthorDate: Tue Aug 25 18:32:33 2020 +0200 > >> > >> Use raw_msg instead of as_bytes, as the latter has a new archived-at > >> appended sometimes (and we don't want that) > >> --- > >> tools/plugins/generators.py | 8 ++++---- > >> 1 file changed, 4 insertions(+), 4 deletions(-) > >> > >> diff --git a/tools/plugins/generators.py b/tools/plugins/generators.py > >> index 949a3b6..344f115 100644 > >> --- a/tools/plugins/generators.py > >> +++ b/tools/plugins/generators.py > >> @@ -151,7 +151,7 @@ def dkim(_msg, _body, lid, _attachments, raw_msg): > >> # Full generator: uses the entire email (including server-dependent data) > >> # Used by default until August 2020. > >> # See 'dkim' for recommended generation. > >> -def full(msg, _body, lid, _attachments, _raw_msg): > >> +def full(_msg, _body, lid, _attachments, raw_msg): > >> """ > >> Full generator: uses the entire email > >> (including server-dependent data) > >> @@ -159,15 +159,15 @@ def full(msg, _body, lid, _attachments, _raw_msg): > >> but different copies of the message are likely to have different > >> headers, thus ids > >> > >> Parameters: > >> - msg - the parsed message > >> + msg - the parsed message (not used) > >> _body - the parsed text content (not used) > >> lid - list id > >> _attachments - list of attachments (not used) > >> - _raw_msg - the original message bytes (not used) > >> + raw_msg - the original message bytes > >> > >> Returns: "<hash>@<lid>" where hash is sha224 of message bytes > >> """ > >> - mid = "%s@%s" % (hashlib.sha224(msg.as_bytes()).hexdigest(), lid) > >> + mid = "%s@%s" % (hashlib.sha224(raw_msg).hexdigest(), lid) > >> return mid > >> > >> > >> >
