Re: iterparse and unicode

2008-08-27 Thread George Sakkis
On Aug 27, 5:42 am, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > George Sakkis wrote: > >> if you meant to write "encode", you can indeed safely do > >> [s.encode('utf8') for s in strings] as long as all strings are returned > >> by an ET implementation. > > > I was replying to the general assertion

Re: iterparse and unicode

2008-08-27 Thread Fredrik Lundh
George Sakkis wrote: if you meant to write "encode", you can indeed safely do [s.encode('utf8') for s in strings] as long as all strings are returned by an ET implementation. I was replying to the general assertion that "in 2.x ASCII byte strings and unicode strings are compatible", not specif

Re: iterparse and unicode

2008-08-26 Thread George Sakkis
On Aug 25, 4:45 pm, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > George Sakkis wrote: > > It depends on what you mean by "compatible"; e.g. you can't safely do > > [s.decode('utf8') for s in strings] if you have byte strings mixed > > with unicode. > > why would you want to decode strings given to yo

Re: iterparse and unicode

2008-08-25 Thread Fredrik Lundh
George Sakkis wrote: It depends on what you mean by "compatible"; e.g. you can't safely do [s.decode('utf8') for s in strings] if you have byte strings mixed with unicode. why would you want to decode strings given to you by a library that returns decoded strings? if you meant to write "enc

Re: iterparse and unicode

2008-08-25 Thread George Sakkis
On Aug 24, 1:12 am, Stefan Behnel <[EMAIL PROTECTED]> wrote: > George Sakkis wrote: > > On Aug 21, 1:48 am, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > > >> George Sakkis wrote: > >>> It's interesting that the element text attributes after a successful > >>> parse do not necessarily have the same ty

Re: iterparse and unicode

2008-08-23 Thread Stefan Behnel
George Sakkis wrote: > It seems xml.etree.cElementTree.iterparse() is not unicode aware: > from StringIO import StringIO from xml.etree.cElementTree import iterparse s = u'\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2' for event,elem in iterparse(StringIO(s

Re: iterparse and unicode

2008-08-23 Thread Stefan Behnel
George Sakkis wrote: > On Aug 21, 1:48 am, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > >> George Sakkis wrote: >>> It's interesting that the element text attributes after a successful >>> parse do not necessarily have the same type, i.e. all be str or all >>> unicode. I ported some text extraction

Re: iterparse and unicode

2008-08-21 Thread George Sakkis
On Aug 21, 1:48 am, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > George Sakkis wrote: > > It's interesting that the element text attributes after a successful > > parse do not necessarily have the same type, i.e. all be str or all > > unicode. I ported some text extraction code from  BeautifulSoup (

Re: iterparse and unicode

2008-08-20 Thread Fredrik Lundh
George Sakkis wrote: > Thank you both for the suggestions. I made a few more experiments to > understand how iterparse behaves with respect to three dimensions: Spending time researching undefined behaviour is pretty pointless. ET parsers expect byte streams, because that's what XML files are.

Re: iterparse and unicode

2008-08-20 Thread Fredrik Lundh
George Sakkis wrote: Traceback (most recent call last): File "", line 1, in File "", line 64, in __iter__ UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-15: ordinal not in range(128) Am I using it incorrectly or it doesn't currently support unicode ? iterparse pa

Re: iterparse and unicode

2008-08-20 Thread George Sakkis
Thank you both for the suggestions. I made a few more experiments to understand how iterparse behaves with respect to three dimensions: a. Is the encoding declared in the header (if there is one) ? b. Is the text ascii-encodable (i.e. within range(128)) ? c. Does the passed file object's read() me

Re: iterparse and unicode

2008-08-20 Thread John Krukoff
On Wed, 2008-08-20 at 15:36 -0700, George Sakkis wrote: > It seems xml.etree.cElementTree.iterparse() is not unicode aware: > > >>> from StringIO import StringIO > >>> from xml.etree.cElementTree import iterparse > >>> s = > >>> u'\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2' > >>

Re: iterparse and unicode

2008-08-20 Thread John Machin
On Aug 21, 8:36 am, George Sakkis <[EMAIL PROTECTED]> wrote: > It seems xml.etree.cElementTree.iterparse() is not unicode aware: > > >>> from StringIO import StringIO > >>> from xml.etree.cElementTree import iterparse > >>> s = > >>> u'\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2'

iterparse and unicode

2008-08-20 Thread George Sakkis
It seems xml.etree.cElementTree.iterparse() is not unicode aware: >>> from StringIO import StringIO >>> from xml.etree.cElementTree import iterparse >>> s = >>> u'\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2' >>> for event,elem in iterparse(StringIO(s)): ... print elem.text ..