On 2022-05-08 19:15, Barry Scott wrote:


On 7 May 2022, at 22:31, Chris Angelico <ros...@gmail.com> wrote:

On Sun, 8 May 2022 at 07:19, Stefan Ram <r...@zedat.fu-berlin.de> wrote:

MRAB <pyt...@mrabarnett.plus.com> writes:
On 2022-05-07 19:47, Stefan Ram wrote:
...
def encoding( name ):
  path = pathlib.Path( name )
  for encoding in( "utf_8", "latin_1", "cp1252" ):
      try:
          with path.open( encoding=encoding, errors="strict" )as file:
              text = file.read()
          return encoding
      except UnicodeDecodeError:
          pass
  return "ascii"
Yes, it's potentially slow and might be wrong.
The result "ascii" might mean it's a binary file.
"latin-1" will decode any sequence of bytes, so it'll never try
"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
anyway because the file could contain 0x80..0xFF, which aren't supported
by that encoding.

 Thank you! It's working for my specific application where
 I'm reading from a collection of text files that should be
 encoded in either utf_8, latin_1, or ascii.


In that case, I'd exclude ASCII from the check, and just check UTF-8,
and if that fails, decode as Latin-1. Any ASCII files will decode
correctly as UTF-8, and any file will decode as Latin-1.

I've used this exact fallback system when decoding raw data from
Unicode-naive servers - they accept and share bytes, so it's entirely
possible to have a mix of encodings in a single stream. As long as you
can define the span of a single "unit" (say, a line, or a chunk in
some form), you can read as bytes and do the exact same "decode as
UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
perfectly ideal, but it's about as good as you'll get with a lot of
US-based servers. (Depending on context, you might use CP-1252 instead
of Latin-1, but you might need errors="replace" there, since
Windows-1252 has some undefined byte values.)

There is a very common error on Windows that files and especially web pages that
claim to be utf-8 are in fact CP-1252.

There is logic in the HTML standards to try utf-8 and if it fails fall back to 
CP-1252.

Its usually the left and "smart" quote chars that cause the issue as they code
as an invalid utf-8.

Is it CP-1252 or ISO-8859-1 (Latin-1)?
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to