Re: [Python-3000] BOM handling

2006-09-14 Thread Walter Dörwald
Josiah Carlson wrote: > Antoine Pitrou <[EMAIL PROTECTED]> wrote: >> >> Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit : >>> And is generally ignored, as per unicode spec; it's a "zero width >>> non-breaking space" - an invisible character with no effect on wrapping >>> or othe

Re: [Python-3000] BOM handling

2006-09-14 Thread Talin
Antoine Pitrou wrote: > Hi, > > Le mercredi 13 septembre 2006 à 16:14 -0700, Josiah Carlson a écrit : >> In any case, I believe that the above behavior is correct for the >> context. Why? Because utf-8 has no endianness, its 'generic' decoding >> spelling of 'utf-8' is analagous to all three 'ut

Re: [Python-3000] string C API

2006-09-14 Thread Nick Coghlan
Martin v. Löwis wrote: > Jim Jewett schrieb: >> Simply delegate such methods to a hidden per-encoding subclass. >> >> The UTF-8 methods will indeed be complex, unless the solution is >> simply "someone called indexing/slicing/len, so I have to recode after >> all." >> >> The Latin-1 encoding will h

Re: [Python-3000] string C API

2006-09-14 Thread Marcin 'Qrczak' Kowalczyk
Nick Coghlan <[EMAIL PROTECTED]> writes: > Only the first such call on a given string, though - the idea > is to use lazy decoding, not to avoid decoding altogether. > Most manipulations (len, indexing, slicing, concatenation, etc) > would require decoding to at least UCS-2 (or perhaps UCS-4). Si

Re: [Python-3000] string C API

2006-09-14 Thread Antoine
> Only the first such call on a given string, though - the idea is to use > lazy > decoding, not to avoid decoding altogether. Most manipulations (len, > indexing, > slicing, concatenation, etc) would require decoding to at least UCS-2 (or > perhaps UCS-4). My two cents: For len() you can comput

Re: [Python-3000] BOM handling

2006-09-14 Thread Paul Moore
On 9/14/06, Talin <[EMAIL PROTECTED]> wrote: > I've been reading this thread (and the ones that spawned it), and > there's something about it that's been nagging at me for a while, which > I am going to attempt to articulate. [...] > Any given Python program that I write is going to know *something

Re: [Python-3000] Pre-PEP: Easy Text File Decoding

2006-09-14 Thread Marcin 'Qrczak' Kowalczyk
David Hopwood <[EMAIL PROTECTED]> writes: > You're correct about the use of a BOM as a signature. All > Unicode-conformant applications should accept this use of a BOM in > UTF-8 (although they need not generate it); the standard is quite > clear on that. When a program generates a list of filena

Re: [Python-3000] BOM handling

2006-09-14 Thread Blake Winton
Talin wrote: >> My point was different : most programmers are not at your level (or >> Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type >> is supposed to be an abstracted textual type to make it easy to write >> unicode-friendly applications (isn't it?). > > The basic contro

Re: [Python-3000] BOM handling

2006-09-14 Thread Josiah Carlson
Blake Winton <[EMAIL PROTECTED]> wrote: [snip] > Um, what more data do we need for this use-case? I'm not going to > suggest an API, other than it would be nice if I didn't have to manually > figure out/hard code all the encodings. (It's my belief that I will > currently have to do that, or a

Re: [Python-3000] string C API

2006-09-14 Thread Josiah Carlson
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote: > Nick Coghlan <[EMAIL PROTECTED]> writes: > > > Only the first such call on a given string, though - the idea > > is to use lazy decoding, not to avoid decoding altogether. > > Most manipulations (len, indexing, slicing, concatenation, etc)

Re: [Python-3000] string C API

2006-09-14 Thread Bob Ippolito
On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote: > > Nick Coghlan <[EMAIL PROTECTED]> writes: > > > > > Only the first such call on a given string, though - the idea > > > is to use lazy decoding, not to avoid decoding altogether. > >

Re: [Python-3000] BOM handling

2006-09-14 Thread Blake Winton
Josiah Carlson wrote: > Blake Winton <[EMAIL PROTECTED]> wrote: >> I'm not going to >> suggest an API, other than it would be nice if I didn't have to manually >> figure out/hard code all the encodings. (It's my belief that I will >> currently have to do that, or at least special-case XML, to r

Re: [Python-3000] BOM handling

2006-09-14 Thread Paul Prescod
As a somewhat aside: for XML encoding detection:http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/363841  Paul Prescod ___ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail

Re: [Python-3000] BOM handling

2006-09-14 Thread Josiah Carlson
Blake Winton <[EMAIL PROTECTED]> wrote: > Josiah Carlson wrote: > > Blake Winton <[EMAIL PROTECTED]> wrote: > >> I'm not going to > >> suggest an API, other than it would be nice if I didn't have to manually > >> figure out/hard code all the encodings. (It's my belief that I will > >> currentl

Re: [Python-3000] BOM handling

2006-09-14 Thread Paul Moore
On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > So don't save it with a BOM and add a Python coding: directive to the > second line. Python and bash comments just happen to have the same # > delimiter, and if your editor doesn't suck, then it should understand > such a directive. However,

Re: [Python-3000] string C API

2006-09-14 Thread Josiah Carlson
"Bob Ippolito" <[EMAIL PROTECTED]> wrote: > The argument for UTF-8 is probably interop efficiency. Lots of C > libraries, file formats, and wire protocols use UTF-8 for interchange. > Verifying the validity of UTF-8 during string creation isn't that big > of a deal. Indeed, UTF-8 validation/creat

Re: [Python-3000] string C API

2006-09-14 Thread Bob Ippolito
On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Bob Ippolito" <[EMAIL PROTECTED]> wrote: > > The argument for UTF-8 is probably interop efficiency. Lots of C > > libraries, file formats, and wire protocols use UTF-8 for interchange. > > Verifying the validity of UTF-8 during string creat

Re: [Python-3000] BOM handling

2006-09-14 Thread Jason Orendorff
For what it's worth: in .NET, everything defaults to UTF-8, whether reading or writing. No BOM is generated when creating a new file. http://msdn2.microsoft.com/en-us/library/system.io.file.createtext.aspx Java defaults to a "default character encoding", which on Windows is the system's ANSI e

Re: [Python-3000] string C API

2006-09-14 Thread Martin v. Löwis
Nick Coghlan schrieb: > Only the first such call on a given string, though - the idea is to use > lazy decoding, not to avoid decoding altogether. Most manipulations > (len, indexing, slicing, concatenation, etc) would require decoding to > at least UCS-2 (or perhaps UCS-4). Ok. Then my objection

Re: [Python-3000] iostack, second revision

2006-09-14 Thread Anders J. Munch
Josiah Carlson wrote: > Any sane person uses os.stat(f.name) or os.fstat(f.fileno()), unless > they want to seek to the end of the file for later writing or expected > reading of data yet-to-be-written. os.fstat(f.fileno()).st_size doesn't work for file-like objects. Goodbye unit testing with S

Re: [Python-3000] BOM handling

2006-09-14 Thread Michael Urman
On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > With luck, your editor should also allow for the > non-writing of the BOM on utf-8 save (given certain conditions). If not, > contact the author(s) and request that feature. And hope they didn't write it in a language that doesn't let them c

Re: [Python-3000] BOM handling

2006-09-14 Thread Josiah Carlson
"Paul Moore" <[EMAIL PROTECTED]> wrote: > On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > So don't save it with a BOM and add a Python coding: directive to the > > second line. Python and bash comments just happen to have the same # > > delimiter, and if your editor doesn't suck, then i

Re: [Python-3000] BOM handling

2006-09-14 Thread David Hopwood
Paul Moore wrote: > On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > >>So don't save it with a BOM and add a Python coding: directive to the >>second line. Python and bash comments just happen to have the same # >>delimiter, and if your editor doesn't suck, then it should understand >>such

Re: [Python-3000] iostack, second revision

2006-09-14 Thread Greg Ewing
Anders J. Munch wrote: > (note the potential race condition in > f=mmap.mmap(f.fileno(),os.fstat(f.fileno(. Not sure anything could be done about that. Even if there were an mmap-this-file-however-big-it-is call, the size of the file could still change *after* you'd mapped it. -- Greg ___

Re: [Python-3000] iostack, second revision

2006-09-14 Thread Josiah Carlson
"Anders J. Munch" <[EMAIL PROTECTED]> wrote: > Josiah Carlson wrote: > > You were also talking about buffering writes to reduce the overhead of > > the underlying seeks and tells because of apparent "optimizations" you > > wanted to make. Here is a data integrity optimization you can make for >