M.-A. Lemburg wrote: > Nick Coghlan wrote: >> Antoine Pitrou wrote: >>> M.-A. Lemburg <mal <at> egenix.com> writes: >>>> Please file a bug report for this. f.readlines() (or rather >>>> the io layer) should be using Py_UNICODE_ISLINEBREAK(ch) >>>> for detecting line break characters. >>> >>> Actually, no. It has been designed from the start to only recognize the >>> "standard" line break representations found in common formats/protocols >>> (CR, LF >>> and CR+LF). >>> People wanting to split on arbitrary unicode line breaks should use >>> str.splitlines(). >> >> The fairly long-standing RFE relating to an arbitrarily selectable >> newline separator seems relevant here: >> http://bugs.python.org/issue1152248 >> >> As with the discussion there, the problem with using str.splitlines is >> that it prevents pipelining approaches that avoid reading a whole file >> into memory. >> >> While removing the validity check from readlines() completely is >> questionable (the readrecords() approach mentioned in the tracker issue >> would still be better there), loosening the validity check to be based >> on Py_UNICODE_IS_LINEBREAK seems a bit more feasible. (I'd still call it >> a feature requests rather than a bug though). > > I've had a look at the io implementation: this appears to be > based on the universal newline support idea which addresses > only a fixed set of "new line" character combinations and is > not as straight forward to extend to support all Unicode > line break characters as I thought. > > What I don't understand is why the io layer tries to reinvent > the wheel here instead of just using the codec's .readline() > method - which *does* use .splitlines() and has full support > for all Unicode line break characters (including the CRLF > combination).
... and because of this, the feature is already available if you use codecs.open() instead of the built-in open(): import codecs with codecs.open("x.txt", "w", encoding='utf-8') as f: f.write("a\nb\u2029c\n") with codecs.open("x.txt", "r", encoding='utf-8') as f: n = 1 for l in f.readlines(): print(n, repr(l)) n += 1 This prints: 1 'a\n' 2 'b\u2029' 3 'c\n' -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 06 2009) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com