On 2022-05-07 19:35, Marco Sulla wrote:
On Sat, 7 May 2022 at 19:02, MRAB <pyt...@mrabarnett.plus.com> wrote:
>
> On 2022-05-07 17:28, Marco Sulla wrote:
> > On Sat, 7 May 2022 at 16:08, Barry <ba...@barrys-emacs.org> wrote:
> >> You need to handle the file in bin mode and do the handling of line 
endings and encodings yourself. It’s not that hard for the cases you wanted.
> >
> >>>> "\n".encode("utf-16")
> > b'\xff\xfe\n\x00'
> >>>> "".encode("utf-16")
> > b'\xff\xfe'
> >>>> "a\nb".encode("utf-16")
> > b'\xff\xfea\x00\n\x00b\x00'
> >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > b'\n\x00'
> >
> > Can I use the last trick to get the encoding of a LF or a CR in any 
encoding?
>
> In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> could be little-endian or big-endian.
>
> As you didn't specify which you wanted, it defaulted to little-endian
> and added a BOM (U+FEFF).
>
> If you specify which endianness you want with "utf-16le" or "utf-16be",
> it won't add the BOM:
>
>  >>> # Little-endian.
>  >>> "\n".encode("utf-16le")
> b'\n\x00'
>  >>> # Big-endian.
>  >>> "\n".encode("utf-16be")
> b'\x00\n'

Well, ok, but I need a generic method to get LF and CR for any
encoding an user can input.
Do you think that

"\n".encode(encoding).lstrip("".encode(encoding))

is good for any encoding?
'.lstrip' is the wrong method to use because it treats its argument as a set of characters, so it might strip off too many characters. A better choice is '.removeprefix'.
Furthermore, is there a way to get the encoding of an opened file object?

How was the file opened?


If it was opened as a text file, use the '.encoding' attribute (which just tells you what encoding was specified when it was opened, and you'd be assuming that it's the correct one).


If it was opened as a binary file, all you know is that it contains bytes, and determining the encoding (assuming that it is a text file) is down to heuristics (i.e. guesswork).

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to