On Tue, 25 Aug 2020 at 18:49, Daniel Gruno <[email protected]> wrote:
>
> On 25/08/2020 19.43, sebb wrote:
> > On Tue, 25 Aug 2020 at 17:47, Daniel Gruno <[email protected]> wrote:
> >>
> >> On 25/08/2020 18.37, sebb wrote:
> >>> -1
> >>>
> >>> This will likely change the generated ID for emails that already have
> >>> archived-at headers
> >>
> >> No, it will not change that, as they already will have it. This is
> >> fixing an issue with randomness in the archiving.
> >>
> >> See
> >> https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L816
> >
> > Yes, that adds an archived-at header if the mail does not already have one.
> > Since this is added before the generator is called, it may cause
> > generators to produce different output.
> > This particularly affects the full generator.
>
> Right, if I have an email without the archived-at header, then reloading
> that will produce a different result every time. That's what I was
> hoping to avoid.

Which is easy to do - don't add the archived-at header, at least not
until the generator has been called.

This will also fix the issue for other generators that may make use of
the header.

> >
> > If the message already has an archived-at header, then the full
> > generator output should be stable, provided it is given the same
> > input.
> >
> > But if the input is switched to raw bytes from the reconstructed
> > message, this will most likely change it, unnecessarily changing at
> > least some ids.
>
> As I just have discovered, this is in fact happening, sort of, because
> .as_bytes does (unintended) normalization of certain headers :\
>
> I guess we'll have to rely on .as_bytes to never change that behavior
> instead, for the full generator. Unrelated to your -1, but still
> annoying as heck.

Yes, that's a separate issue.
Future versions may or may not be different.
But at least if users are happy to stay on the current version the
output will be the same.

> I've also noticed that the unit tests are spitting out a toooon of error
> messages that I can't quite understand. unrelated to the test results.
>
> >
> >>
> >> What happens is, an archived-at with *the current timestamp* will get
> >> added to all messages that do not have it, so it's not really helping at
> >> all with anything when you use the .as_bytes, as that data will always
> >> be different unless there already is such a header.
> >>
> >> The raw_msg will deliver a much more reliable result going forward, as
> >> it doesn't add "random" data to the mix on each reload (because it is
> >> the original raw data without headers appended on the fly).
> >>
> >> I hope this clears matters up.
> >>
> >>>
> >>> Far better to ensure the archived-at header is not added to the parsed
> >>> mail before the generator is used.
> >>>
> >>> On Tue, 25 Aug 2020 at 17:32, <[email protected]> wrote:
> >>>>
> >>>> This is an automated email from the ASF dual-hosted git repository.
> >>>>
> >>>> humbedooh pushed a commit to branch master
> >>>> in repository 
> >>>> https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
> >>>>
> >>>>
> >>>> The following commit(s) were added to refs/heads/master by this push:
> >>>>        new 6dfb9d8  Use raw_msg instead of as_bytes, as the latter has a 
> >>>> new archived-at appended sometimes (and we don't want that)
> >>>> 6dfb9d8 is described below
> >>>>
> >>>> commit 6dfb9d83b1fa6e0ae83bc61446a27fe555751f8c
> >>>> Author: Daniel Gruno <[email protected]>
> >>>> AuthorDate: Tue Aug 25 18:32:33 2020 +0200
> >>>>
> >>>>       Use raw_msg instead of as_bytes, as the latter has a new 
> >>>> archived-at appended sometimes (and we don't want that)
> >>>> ---
> >>>>    tools/plugins/generators.py | 8 ++++----
> >>>>    1 file changed, 4 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/tools/plugins/generators.py b/tools/plugins/generators.py
> >>>> index 949a3b6..344f115 100644
> >>>> --- a/tools/plugins/generators.py
> >>>> +++ b/tools/plugins/generators.py
> >>>> @@ -151,7 +151,7 @@ def dkim(_msg, _body, lid, _attachments, raw_msg):
> >>>>    # Full generator: uses the entire email (including server-dependent 
> >>>> data)
> >>>>    # Used by default until August 2020.
> >>>>    # See 'dkim' for recommended generation.
> >>>> -def full(msg, _body, lid, _attachments, _raw_msg):
> >>>> +def full(_msg, _body, lid, _attachments, raw_msg):
> >>>>        """
> >>>>        Full generator: uses the entire email
> >>>>        (including server-dependent data)
> >>>> @@ -159,15 +159,15 @@ def full(msg, _body, lid, _attachments, _raw_msg):
> >>>>        but different copies of the message are likely to have different 
> >>>> headers, thus ids
> >>>>
> >>>>        Parameters:
> >>>> -    msg - the parsed message
> >>>> +    msg - the parsed message (not used)
> >>>>        _body - the parsed text content (not used)
> >>>>        lid - list id
> >>>>        _attachments - list of attachments (not used)
> >>>> -    _raw_msg - the original message bytes (not used)
> >>>> +    raw_msg - the original message bytes
> >>>>
> >>>>        Returns: "<hash>@<lid>" where hash is sha224 of message bytes
> >>>>        """
> >>>> -    mid = "%s@%s" % (hashlib.sha224(msg.as_bytes()).hexdigest(), lid)
> >>>> +    mid = "%s@%s" % (hashlib.sha224(raw_msg).hexdigest(), lid)
> >>>>        return mid
> >>>>
> >>>>
> >>>>
> >>
>

Reply via email to