On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote: > > Why? What's the use case? [byte paths] > > Allowing library developers who support POSIX and Windows to just use > bytes everywhere to represent paths.
Okay, how is that use case impacted by it being mbcs instead of utf-8? What about only doing the deprecation warning if non-ascii bytes are present in the value? > > For reading, I assume. When opened for writing, it should probably be > > utf-8-sig [if it's not mbcs] to match what Notepad does. What about > > files opened for appending or updating? In theory it could ingest the > > whole file to see if it's valid UTF-8, but that has a time cost. > > Writing out the BOM automatically basically makes your files > incompatible with other platforms, which rarely expect a BOM. Yes but you're not running on other platforms, you're running on the platform you're running on. If files need to be moved between platforms, converting files with a BOM to without ought to be the responsibility of the same tool that converts CRLF line endings to LF. > By > omitting it but writing and reading UTF-8 we ensure that Python can > handle its own files on any platform, while potentially upsetting some > older applications on Windows or platforms that don't assume UTF-8 as a > default. Okay, you haven't addressed updating and appending. I realized after posting that updating should be in binary, but that leaves appending. Should we detect BOMs and/or attempt to detect the encoding by other means in those cases? > > Notepad, if there's no BOM, checks the first 256 bytes of the file for > > whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK], > > and can get it wrong for certain very short files [i.e. the infamous > > "this app can break"] > > Yeah, this is a pretty horrible idea :) Eh, maybe the utf-16 because it can give some hilariously bad results, but using it to differentiate between utf-8 and mbcs might not be so bad. But what to do if all we see is ascii? > > What to do on opening a pipe or device? [Is os.fstat able to detect > > these cases?] > > We should be able to detect them, but why treat them any differently > from a file? Eh, I was mainly concerned about if the first few bytes aren't a BOM? What about blocking waiting for them? But if this is delayed until the first read then it's fine. > It probably also entails opening the file descriptor in bytes mode, > which might break programs that pass the fd directly to CRT functions. > Personally I wish they wouldn't, but it's too late to stop them now. The only thing O_TEXT does rather than O_BINARY is convert CRLF line endings (and maybe end on ^Z), and I don't think we even expose the constants for the CRT's unicode modes. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/