>> So when it is time to guess [at the character encoding of a file], >> a source of good guesses is an important battery to include.
> The barrier for entry to the standard library is higher than mere > usefulness. Agreed. But "most programs will need it, and people will either include (the same) 3rd-party library themselves, or write their own workaround, or have buggy code" *is* sufficient. The points of contention are (1) How many programs have to deal with documents written outside their control -- and probably originating on another system. I'm not ready to say "most" programs in general, but I think that barrier is met for both web clients (for which we already supply several batteries) and quick-and-dirty utilities. (2) How serious are the bugs / How annoying are the workarounds? As someone who mostly sticks to English, and who tends to manually ignore stray bytes when dealing with a semi-binary file format, the bugs aren't that serious for me personally. So I may well choose to write buggy programs, and the bug may well never get triggered on my own machine. But having a batch process crash one run in ten (where it didn't crash at all under Python 2) is a bad thing. There are environments where (once I knew about it) I would add chardet (if I could get approval for the 3rd-party component). (3) How clearcut is the *right* answer? As I said, at one point (several years ago), the w3c and whatwg started to standardize the "right" answer. They backed that out, because vendors wanted the option to improve their detection in the future without violating standards. There are certainly situations where local knowledge can do better than a global solution like chardet, but ... the "right" answer is clear most of the time. Just ignoring the problem is still a 99% answer, because most text in ASCII-mostly environments really is "close enough". But that is harder (and the One Obvious Way is less reliable) under Python 3 than it was under Python 2. An alias for "open" that defaulted to surrogate-escape (or returned the new "ASCIIstr" bytes hybrid) would probably be sufficient to get back (almost) to Python 2 levels of ease and reliability. But it would tend to encourage ASCII/English-only assumptions. You could fix most of the remaining problems by scripting a web browser, except that scripting the browser in a cross-platform manner is slow and problematic, even with webbrowser.py. "Whatever a recent Firefox does" is (almost by definition) good enough, and is available ... but maybe not in a convenient form, which is one reason that chardet was created as a port thereof. Also note that firefox assumes you will update more often than Python does. "Whatever chardet said at the time the Python release was cut" is almost certainly good enough too. The browser makers go to great lengths to match each other even in bizarre corner cases. (Which is one reason there aren't more competing solutions.) But that doesn't mean it is *impossible* to construct a test case where they disagree -- or even one where a recent improvement in the algorithms led to regressions for one particular document. That said, such regressions should be limited to documents that were not properly labeled in the first place, and should be rare even there. Think of the changes as obscure bugfixes, akin to a program starting to handle NaN properly, in a place where it "should" not ever see one. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com