On Tue, 25 Aug 2020 at 18:49, Daniel Gruno <[email protected]> wrote: > > On 25/08/2020 19.43, sebb wrote: > > On Tue, 25 Aug 2020 at 17:47, Daniel Gruno <[email protected]> wrote: > >> > >> On 25/08/2020 18.37, sebb wrote: > >>> -1 > >>> > >>> This will likely change the generated ID for emails that already have > >>> archived-at headers > >> > >> No, it will not change that, as they already will have it. This is > >> fixing an issue with randomness in the archiving. > >> > >> See > >> https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L816 > > > > Yes, that adds an archived-at header if the mail does not already have one. > > Since this is added before the generator is called, it may cause > > generators to produce different output. > > This particularly affects the full generator. > > Right, if I have an email without the archived-at header, then reloading > that will produce a different result every time. That's what I was > hoping to avoid.
Which is easy to do - don't add the archived-at header, at least not until the generator has been called. This will also fix the issue for other generators that may make use of the header. > > > > If the message already has an archived-at header, then the full > > generator output should be stable, provided it is given the same > > input. > > > > But if the input is switched to raw bytes from the reconstructed > > message, this will most likely change it, unnecessarily changing at > > least some ids. > > As I just have discovered, this is in fact happening, sort of, because > .as_bytes does (unintended) normalization of certain headers :\ > > I guess we'll have to rely on .as_bytes to never change that behavior > instead, for the full generator. Unrelated to your -1, but still > annoying as heck. Yes, that's a separate issue. Future versions may or may not be different. But at least if users are happy to stay on the current version the output will be the same. > I've also noticed that the unit tests are spitting out a toooon of error > messages that I can't quite understand. unrelated to the test results. > > > > >> > >> What happens is, an archived-at with *the current timestamp* will get > >> added to all messages that do not have it, so it's not really helping at > >> all with anything when you use the .as_bytes, as that data will always > >> be different unless there already is such a header. > >> > >> The raw_msg will deliver a much more reliable result going forward, as > >> it doesn't add "random" data to the mix on each reload (because it is > >> the original raw data without headers appended on the fly). > >> > >> I hope this clears matters up. > >> > >>> > >>> Far better to ensure the archived-at header is not added to the parsed > >>> mail before the generator is used. > >>> > >>> On Tue, 25 Aug 2020 at 17:32, <[email protected]> wrote: > >>>> > >>>> This is an automated email from the ASF dual-hosted git repository. > >>>> > >>>> humbedooh pushed a commit to branch master > >>>> in repository > >>>> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git > >>>> > >>>> > >>>> The following commit(s) were added to refs/heads/master by this push: > >>>> new 6dfb9d8 Use raw_msg instead of as_bytes, as the latter has a > >>>> new archived-at appended sometimes (and we don't want that) > >>>> 6dfb9d8 is described below > >>>> > >>>> commit 6dfb9d83b1fa6e0ae83bc61446a27fe555751f8c > >>>> Author: Daniel Gruno <[email protected]> > >>>> AuthorDate: Tue Aug 25 18:32:33 2020 +0200 > >>>> > >>>> Use raw_msg instead of as_bytes, as the latter has a new > >>>> archived-at appended sometimes (and we don't want that) > >>>> --- > >>>> tools/plugins/generators.py | 8 ++++---- > >>>> 1 file changed, 4 insertions(+), 4 deletions(-) > >>>> > >>>> diff --git a/tools/plugins/generators.py b/tools/plugins/generators.py > >>>> index 949a3b6..344f115 100644 > >>>> --- a/tools/plugins/generators.py > >>>> +++ b/tools/plugins/generators.py > >>>> @@ -151,7 +151,7 @@ def dkim(_msg, _body, lid, _attachments, raw_msg): > >>>> # Full generator: uses the entire email (including server-dependent > >>>> data) > >>>> # Used by default until August 2020. > >>>> # See 'dkim' for recommended generation. > >>>> -def full(msg, _body, lid, _attachments, _raw_msg): > >>>> +def full(_msg, _body, lid, _attachments, raw_msg): > >>>> """ > >>>> Full generator: uses the entire email > >>>> (including server-dependent data) > >>>> @@ -159,15 +159,15 @@ def full(msg, _body, lid, _attachments, _raw_msg): > >>>> but different copies of the message are likely to have different > >>>> headers, thus ids > >>>> > >>>> Parameters: > >>>> - msg - the parsed message > >>>> + msg - the parsed message (not used) > >>>> _body - the parsed text content (not used) > >>>> lid - list id > >>>> _attachments - list of attachments (not used) > >>>> - _raw_msg - the original message bytes (not used) > >>>> + raw_msg - the original message bytes > >>>> > >>>> Returns: "<hash>@<lid>" where hash is sha224 of message bytes > >>>> """ > >>>> - mid = "%s@%s" % (hashlib.sha224(msg.as_bytes()).hexdigest(), lid) > >>>> + mid = "%s@%s" % (hashlib.sha224(raw_msg).hexdigest(), lid) > >>>> return mid > >>>> > >>>> > >>>> > >> >
