On 06/09/2020 14.04, Daniel Gruno wrote:
On 05/09/2020 17.33, sebb wrote:
On Sat, 5 Sep 2020 at 08:54, <[email protected]> wrote:

This is an automated email from the ASF dual-hosted git repository.

humbedooh pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git


The following commit(s) were added to refs/heads/master by this push:
      new fafc765  Refactor, drop the double decode attempt.
fafc765 is described below

commit fafc7651d9d02dfde727bd1f0da13722de8b3c38
Author: Daniel Gruno <[email protected]>
AuthorDate: Sat Sep 5 09:54:03 2020 +0200

     Refactor, drop the double decode attempt.

     We should assume US-ASCII, but if it's not, it's quicker,
     processing-wise, to immediately fall back to utf-8 instead of trying to
     first determine if it is indeed UTF-8-worthy. Either it'll work as
     US-ASCII, or it will work with the UTF-8 with 'replace'.

This info belongs in the code.


And it was put in the code as well. If you don't like me doing long git comments, just say that.


I'll add that the comment mainly referred to me time-testing the code and figuring out that this was faster.

That it's 0.1s quicker than the old version doesn't belong in the code itself. imho.

You could argue that figuring out a speed improvement could be mentioned on dev@, but I did not think it worth the time for such a small change.


---
  tools/archiver.py | 13 ++++++++-----
  1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/tools/archiver.py b/tools/archiver.py
index f875cc3..c52207b 100755
--- a/tools/archiver.py
+++ b/tools/archiver.py
@@ -192,12 +192,15 @@ class Body:
                          break
                      except UnicodeDecodeError:
                          pass
+            # If no character set was defined, the email MUST be US-ASCII by RFC822 defaults
+            # This isn't always the case, as we're about to discover.
              if not self.string:
-                self.string = self.bytes.decode("us-ascii", errors="replace")
-                if valid_encodings:
-                    self.character_set = "us-ascii"
-                # If no character encoding, but we find non-ASCII chars, assume bytes were UTF-8 -                elif len(self.bytes) != len(self.bytes.decode("us-ascii", "ignore")):
+                try:
+                    self.string = self.bytes.decode("us-ascii", errors="strict")
+                    if valid_encodings:
+                        self.character_set = "us-ascii"
+                except UnicodeDecodeError:
+                    # If us-ascii strict fails, it's probably undeclared UTF-8.                       # Set the .string, but not a character set, as we don't know it for sure.                       # This is mainly so the older generators won't barf.                       self.string = self.bytes.decode("utf-8", "replace")



Reply via email to