Hi, On Tue, Apr 5, 2011 at 4:12 PM, Christopher Barker <chris.bar...@noaa.gov> wrote: > On 4/5/11 3:36 PM, josef.p...@gmail.com wrote: >>> I disagree that U makes no sense for binary file reading. > > I wasn't saying that it made no sense to have a "U" mode for binary file > reading, what I meant is that by the python2 definition, it made no > sense. In Python 2, the ONLY difference between binary and text mode is > line-feed translation.
I think it's right to say that the difference between a text and a binary file in python 2 is - none for unix, and '\r\n' -> '\n' translation in windows. The difference between 'rt' and 'U' is (this is for my own benefit): For 'rt', a '\r' does not cause a line break - with 'U' - it does. For 'rt' _not_ on Windows, '\r\n' stays the same - it is stripped to '\n' with 'U'. > As for Python 3: > >>> In python 3: >>> >>> 'b' means, "return byte objects" >>> 't' means "return decoded strings" >>> >>> 'U' means two things: >>> >>> 1) When iterating by line, split lines at any of '\r', '\r\n', '\n' >>> 2) When returning lines split this way, convert '\r' and '\r\n' to '\n' > > a) 'U' is default -- it's essentially the same as 't' (in PY3), so 't' > means "return decoded and line-feed translated unicode objects" Right - my argument is that the behavior implied by 'U' and 't' is conceptually separable. 'U' is for how to do line-breaks, and line-termination translations, 't' is for whether to decode the text or not. In python 3. > b) I think the line-feed conversion is done regardless of if you are > iterating by lines, i.e. with a full-on .read(). At least that's how it > works in py2 -- not running py3 here to test. Yes, that looks right. >>> If you support returning lines from a binary file (which python 3 >>> does), then I think 'U' is a sensible thing to allow - as in this >>> case. > > but what is a "binary file"? In python 3 a binary file is a file which is not decoded, and returns bytes. It still has a concept of a 'line', as defined by line terminators - you can iterate over one, or do .readlines(). In python 2, as you say, a binary file is essentially the same as a text file, with the single exception of the windows \r\n -> \n translation. > I THINK what you are proposing is that we'd want to be able to have both > linefeed translation and no decoding done. But I think that's impossible > -- aren't the linefeeds themselves encoded differently with different > encodings? Right - so obviously if you open a utf-16 file as binary, terrible things may happen - this was what Pauli was pointing out before. His point was that utf-8 is the standard, and that we probably would not hit many other encodings. I agree with you if you are saying that it would be good to be able to deal with them if we can - presumably by allowing 'rt' file objects, producing python 3 strings. >> U looks appropriate in this case, better than the workarounds. >> However, to me the python 3.2 docs seem to say that U only works for >> text mode > > Agreed -- but I don't see the problem -- your files are either encoded > in something that might treat newlines differently (UCS32, maybe?), in > which case you'd want it decoded, or you are working with ascii or ansi > or utf-8, in which case you can specify the encoding anyway. > > I don't understand why we'd want a binary blob for text parsing -- the > parsing code is going to have to know something about the encoding to > work -- it might as well get passed in to the file open call, and work > with unicode. I suppose if we still want to assume ascii for parsing, > then we could use 't' and then re-encode to ascii to work with it. Which > I agree does seem heavy handed just for fixing newlines. > > Also, one problem I've often had with encodings is what happens if I > think I have ascii, but really have a couple characters above 127 -- > then the default is to get an error in decoding. I'd like to be able to > pass in a flag that either skips the un-decodable characters or replaces > them with something, but it doesn't look like you can do that with the > file open function in py3. > >> The line terminator is always b'\n' for binary files; > > Once you really make the distiction between text and binary, the concept > of a "line terminator" doesn't really make sense anyway. Well - I was arguing that, given we can iterate over lines in binary files, then there must be the concept of what a line is, in a binary file, and that means that we need the concept of a line terminator. I realize this is a discussion that would have to happen on the python-dev list... See you, Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion