Albert Hopkins wrote: > On Tue, 2010-11-30 at 11:52 +0100, Peter Otten wrote: > Dan Stromberg wrote: >> >> > I've got a couple of programs that read filenames from stdin, and > then >> > open those files and do things with them. These programs sort of do >> > the *ix xargs thing, without requiring xargs. >> > >> > In Python 2, these work well. Irrespective of how filenames are >> > encoded, things are opened OK, because it's all just a stream of >> > single byte characters. >> >> I think you're wrong. The filenames' encoding as they are read from stdin >> must be the same as the encoding used by the file system. If the file >> system expects UTF-8 and you feed it ISO-8859-1 you'll run into errors. >> > I think this is wrong. In Unix there is no concept of filename > encoding. Filenames can have any arbitrary set of bytes (except '/' and > '\0'). But the filesystem itself neither knows nor cares about > encoding.
I think you misunderstood what I was trying to say. If you write a list of filenames into files.txt, and use an encoding (ISO-8859-1, say) other than that used by the shell to display file names (on Linux typically UTF-8 these days) and then write a Python script exist.py that reads filenames and checks for the files' existence, $ python3 exist.py < files.txt will report that a file b'\xe4\xf6\xfc.txt' doesn't exist. The user looking at his editor with the encoding set to ISO-8859-1 seeing the line äöü.txt and then going to the console typing $ ls äöü.txt will be confused even though everything is working correctly. The system may be shuffling bytes, but the user thinks in codepoints and sometimes assumes that codepoints and bytes are the same. > You always have to know either >> >> (a) both the file system's and stdin's actual encoding, or >> (b) that both encodings are the same. >> >> > If this is true, then I think that it is wrong to do in Python3. Any > language should be able to deal with the filenames that the host OS > allows. > > Anyway, going on with the OP.. can you open stdin so that you can accept > arbitrary bytes instead of strings and then open using the bytes as the > filename? You can access the underlying stdin.buffer that feeds you the raw bytes with no attempt to shoehorn them into codepoints. You can use filenames that are not valid in the encoding that the system uses to display filenames: $ ls $ python3 Python 3.1.1+ (r311:74480, Nov 2 2009, 15:45:00) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> with open(b"\xe4\xf6\xfc.txt", "w") as f: ... f.write("hello\n") ... 6 >>> $ ls ???.txt > I don't have that much experience with Python3 to say for sure. Me neither. Peter -- http://mail.python.org/mailman/listinfo/python-list