On 2022-08-22 00:45:56 -0000, Jon Ribbens via Python-list wrote: > With the offset though, BeautifulSoup made an arbitrary decision to > use ISO-8859-1 encoding and so when you chopped the bytestring at > that offset it only worked because BeautifulSoup had happened to > choose a 1-byte-per-character encoding. Ironically, *without* the > "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.
Actually it would. The unit is bytes if you feed it with bytes, and characters if you feed it with str. So in any case you can use the offset on the data you fed to the parser. Maybe not what you expected, but seems quite useful for what Chris has in mind. (OTOH it seems that the html parser doesn't heed any <meta charset> tags, which seems less than ideal for more pedestrian purposes.) > > So I would probably just let this one go through as 8859-1. > > It looks like BeautifulSoup is doing something like that, yes. > Personally I would be nervous about some of my files being parsed > as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather > than some of the files actually *being* ISO-8859-1 ;-) ) Since none of the syntactically meaningful characters have a code >= 0x80, you can parse HTML at the byte level if you know that it's encoded in a strict superset of ASCII (which all of the ISO-8859 family and UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16 (or Shift-JIS or EUC, if I remember correctly) then you have to know the the character set. (By parsing I mean only "create a syntax tree". Obviously you have to know the encoding to know whether to display «c3 bc» as «ü» or «Ã¼».) hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list