Re: [Python-Dev] XML codec?

2007-11-13 Thread Walter Dörwald
Fred Drake wrote:
 On Nov 12, 2007, at 8:56 AM, Walter Dörwald wrote:
 It isn't embedded. codecs.detect_xml_encoding() is callable without
 any problems (though not documented).
 
 Not documented means not available, I think.

I just din't think that someone wants the detection function, but not
the codec, so I left the function undocumented.

 Who would use such a function for what?
 
 Being able to detect the encoding can be useful anytime you want
 information about a file, actually.  In particular, presenting encoding
 information in a user interface (yes, you can call that contrived, but
 some people want to be able to see such things, and for them it's a
 requirement).

And if you want to display the XML you'd need to decode it. An example
might be a text viewer. E.g. Apples QuickLook.

 If you want to parse the XML and re-encode, it's common
 to want to re-encode in the origin encoding; it's needed for that as
 well.  If you just want to toss the text into an editor, the encoding is
 also needed.  In that case, the codec approach *might* be acceptable
 (depending on the rest of the editor implementation), but the same
 re-encoding issue applies as well.
 
 Simply, it's sometimes desired to know the encoding for purposes that
 don't require immediate decoding.  A function would be quite handing in
 these cases.

So the consensus seems to be: Add an encoding detection function
(implemented in Python) to the xml module?

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread Walter Dörwald
Martin v. Löwis wrote:
 I don't know. Is an XML document ill-formed if it doesn't contain an
 XML declaration, is not in UTF-8 or UTF-8, but there's external
 encoding info?
 
 If there is external encoding info, matching the actual encoding,
 it would be well-formed. Of course, preserving that information would
 be up to the application.

OK. When the application passes an encoding to the decoder this is
supposed to be the external encoding info, so for the decoder it makes
sense to assume that the encoding passed to the encoder is the external
encoding info and will be transmitted along with the encoded bytes.

 This looks good. Now we would have to extent the code to detect and
 replace the encoding in the XML declaration too.
 
 I'm still opposed to making this a codec. Right - for a pure Python
 solution, the processing of the XML declaration would still need to
 be implemented.
 
 I think there could be a much simpler routine to have the same 
 effect. - if it's less than 4 bytes, answer need more data.
 Can there be an XML document that is less then 4 bytes? I guess not.
 
 No, the smallest document has exactly 4 characters (e.g. f/).
 However, external entities may be smaller, such as x.
 
 But anyway: would a Python implementation of these two functions
 (detect_encoding()/fix_encoding()) be accepted?
 
 I could agree to a Python implementation of this algorithm as long
 as it's not packaged as a codec.

I still can't understand your objection to a codec. What's the
difference between UTF-16 decoding and XML decoding? In fact PEP 263
IMHO does specify how to decode Python source, so in theory it could be
a codec (in practice this probably wouldn't work because of
bootstrapping problems).

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread Walter Dörwald
Martin v. Löwis wrote:
   In case it isn't clear - this is exactly my view also.

 But is there an API to do it?  As MAL points out that API would have
 to return not an encoding, but a pair of an encoding and the rewound
 stream.  
 
 The API wouldn't operate on streams. Instead, you pass a string, and
 it either returns the detected encoding, or an information telling that
 it needs more data. No streams.

But in many cases you read the data out of a stream and pass it to an
incremental XML parser. So if you're transcoding the input (either
because the XML parser can't handle the encoding in question or because
there's an external encoding specified, but it's not possible to pass
that to the parser), a codec makes the most sense.

 For non-seekable, non-peekable streams (if any), what you'd
 need would be a stream that consisted of a concatenation of the
 buffered data used for detection and the continuation of the stream.
 
 The application would read data out of the stream, and pass it to
 the detection. It then can process it in whatever manner it meant to
 process it in the first place.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread M.-A. Lemburg
On 2007-11-11 23:22, Martin v. Löwis wrote:
 First, XML-RPC is not the only mechanism using XML over a network
 connection. Second, you don't want to do this if you're dealing
 with several 100 MB of data just because you want to figure
 out the encoding.
 That's my original claim/question: what SPECIFIC application do
 you have in mind that transfers XML over a network and where you
 would want to have such a stream codec?
 XML-based web services used for business integration, e.g. based
 on ebXML.

 A common use case from our everyday consulting business is e.g.
 passing market and trading data to portfolio pricing web services.
 
 I still don't see the need for this feature from this example.
 First, in ebXML messaging, the message are typically *not* large
 (i.e. much smaller than 100 MB). Furthermore, the typical processing
 of such a message would be to pass it directly to the XML parser,
 no need for the functionality under discussion.

I don't see the point in continuing this discussion. If you think
you know better, that's fine. Just please don't generalize this
to everyone else working with Python and XML.

 Right. However, I' will remain opposed to adding this to the
 standard library until I see why one would absolutely need to
 have that. Not every piece of code that is useful in some
 application should be added to the standard library.
 Agreed, but the application space of web services is large
 enough to warrant this.
 
 If that was the case, wouldn't the existing Python web service
 libraries already include such a functionality?

No.

To finalize this:

We have a -1 from Martin and a +1 from Walter, Guido and myself.
Pretty clear vote if you ask me. I'd say we end the discussion here
and move on.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 12 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread Fred Drake
On Nov 12, 2007, at 8:16 AM, M.-A. Lemburg wrote:
 We have a -1 from Martin and a +1 from Walter, Guido and myself.
 Pretty clear vote if you ask me. I'd say we end the discussion here
 and move on.

If we're counting, you've got a -1 on the codec from me as well.   
Martin's right: there's no value to embedding the logic of auto- 
detection into the codec.  A function somewhere in the xml package is  
all that's warranted.


   -Fred

-- 
Fred Drake   fdrake at acm.org




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread Walter Dörwald
Fred Drake wrote:

 On Nov 12, 2007, at 8:16 AM, M.-A. Lemburg wrote:
 We have a -1 from Martin and a +1 from Walter, Guido and myself.
 Pretty clear vote if you ask me. I'd say we end the discussion here
 and move on.
 
 If we're counting, you've got a -1 on the codec from me as well.   
 Martin's right: there's no value to embedding the logic of auto- 
 detection into the codec.

It isn't embedded. codecs.detect_xml_encoding() is callable without
any problems (though not documented).

 A function somewhere in the xml package is  
 all that's warranted.

Who would use such a function for what?

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread Fred Drake
On Nov 12, 2007, at 8:56 AM, Walter Dörwald wrote:
 It isn't embedded. codecs.detect_xml_encoding() is callable without
 any problems (though not documented).

Not documented means not available, I think.

 Who would use such a function for what?

Being able to detect the encoding can be useful anytime you want  
information about a file, actually.  In particular, presenting  
encoding information in a user interface (yes, you can call that  
contrived, but some people want to be able to see such things, and for  
them it's a requirement).  If you want to parse the XML and re-encode,  
it's common to want to re-encode in the origin encoding; it's needed  
for that as well.  If you just want to toss the text into an editor,  
the encoding is also needed.  In that case, the codec approach *might*  
be acceptable (depending on the rest of the editor implementation),  
but the same re-encoding issue applies as well.

Simply, it's sometimes desired to know the encoding for purposes that  
don't require immediate decoding.  A function would be quite handing  
in these cases.


   -Fred

-- 
Fred Drake   fdrake at acm.org




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread Bill Janssen
 Simply, it's sometimes desired to know the encoding for purposes that
 don't require immediate decoding.  A function would be quite handy
 in these cases.

In os.path?  os.path.encoding(location)?

Bill

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread Fred Drake
On Nov 12, 2007, at 10:54 AM, Bill Janssen wrote:
 In os.path?  os.path.encoding(location)?


I wasn't thinking it would be that general; determining the encoding  
for an arbitrary text file is a larger problem than it is for an XML  
file.

An implementation based strictly on the rules from the XML  
specification should be in the xml package (somewhere).  Determining  
that the file is an XML file is separate.

I doubt this really makes sense in os.path.


   -Fred

-- 
Fred Drake   fdrake at acm.org




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-12 Thread Andrew McNamara
On Nov 12, 2007, at 8:16 AM, M.-A. Lemburg wrote:
 We have a -1 from Martin and a +1 from Walter, Guido and myself.
 Pretty clear vote if you ask me. I'd say we end the discussion here
 and move on.

If we're counting, you've got a -1 on the codec from me as well.   
Martin's right: there's no value to embedding the logic of auto- 
detection into the codec.  A function somewhere in the xml package is  
all that's warranted.

I agree with Fred here - it should be a function in the xml package,
not a codec. -1

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-11 Thread Martin v. Löwis
 I don't know. Is an XML document ill-formed if it doesn't contain an
 XML declaration, is not in UTF-8 or UTF-8, but there's external
 encoding info?

If there is external encoding info, matching the actual encoding,
it would be well-formed. Of course, preserving that information would
be up to the application.

 This looks good. Now we would have to extent the code to detect and
 replace the encoding in the XML declaration too.

I'm still opposed to making this a codec. Right - for a pure Python
solution, the processing of the XML declaration would still need to
be implemented.

 I think there could be a much simpler routine to have the same 
 effect. - if it's less than 4 bytes, answer need more data.
 
 Can there be an XML document that is less then 4 bytes? I guess not.

No, the smallest document has exactly 4 characters (e.g. f/).
However, external entities may be smaller, such as x.

 But anyway: would a Python implementation of these two functions
 (detect_encoding()/fix_encoding()) be accepted?

I could agree to a Python implementation of this algorithm as long
as it's not packaged as a codec.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-11 Thread Martin v. Löwis
 A non-seekable stream is not all that uncommon in network processing.
 Right. But what is the relationship to XML encoding autodetection?
 
 It pops up whenever you need to detect the encoding of the
 incoming XML data on the network connection, e.g. in XML RPC
 or data upload mechanisms.

No, it doesn't. For XML-RPC, you pass the XML payload of the
HTTP request to the XML parser, and it deals with the encoding.

 It is also not always feasible to load all data into memory, so
 some form of buffering must be used.

Again, I don't see the use case. For XML-RPC, it's very feasible
and standard procedure to have the entire document in memory
(in a processed form).

 This approach is also needed if you want to stack stream codecs
 (not sure whether this is still possible in Py3, but that's how
 I designed them for Py2).

The design of the Py2 codecs is fairly flawed, unfortunately.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-11 Thread M.-A. Lemburg
On 2007-11-11 14:51, Martin v. Löwis wrote:
 A non-seekable stream is not all that uncommon in network processing.
 Right. But what is the relationship to XML encoding autodetection?
 It pops up whenever you need to detect the encoding of the
 incoming XML data on the network connection, e.g. in XML RPC
 or data upload mechanisms.
 
 No, it doesn't. For XML-RPC, you pass the XML payload of the
 HTTP request to the XML parser, and it deals with the encoding.

First, XML-RPC is not the only mechanism using XML over a network
connection. Second, you don't want to do this if you're dealing
with several 100 MB of data just because you want to figure
out the encoding.

 It is also not always feasible to load all data into memory, so
 some form of buffering must be used.
 
 Again, I don't see the use case. For XML-RPC, it's very feasible
 and standard procedure to have the entire document in memory
 (in a processed form).

You may not see the use case, but that doesn't really mean
anything if the use cases exist in real life applications,
right ?!

 This approach is also needed if you want to stack stream codecs
 (not sure whether this is still possible in Py3, but that's how
 I designed them for Py2).
 
 The design of the Py2 codecs is fairly flawed, unfortunately.

Fortunately, this sounds like a fairly flawed argument to me ;-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 11 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-11 Thread Martin v. Löwis
 First, XML-RPC is not the only mechanism using XML over a network
 connection. Second, you don't want to do this if you're dealing
 with several 100 MB of data just because you want to figure
 out the encoding.

That's my original claim/question: what SPECIFIC application do
you have in mind that transfers XML over a network and where you
would want to have such a stream codec?

If I have 100MB of XML in a file, using the detection API, I do

  f = open(filename)
  s = f.read(100)
  while True:
coding = xml.utils.detect_encoding(s)
if coding is not undetermined:
   break
s += f.read(100)
  f.close()

Having the loop here is paranoia: in my application, I might be
able to know that 100 bytes are sufficient to determine the encoding
always.

 Again, I don't see the use case. For XML-RPC, it's very feasible
 and standard procedure to have the entire document in memory
 (in a processed form).
 
 You may not see the use case, but that doesn't really mean
 anything if the use cases exist in real life applications,
 right ?!

Right. However, I' will remain opposed to adding this to the
standard library until I see why one would absolutely need to
have that. Not every piece of code that is useful in some
application should be added to the standard library.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-11 Thread M.-A. Lemburg
On 2007-11-11 18:56, Martin v. Löwis wrote:
 First, XML-RPC is not the only mechanism using XML over a network
 connection. Second, you don't want to do this if you're dealing
 with several 100 MB of data just because you want to figure
 out the encoding.
 
 That's my original claim/question: what SPECIFIC application do
 you have in mind that transfers XML over a network and where you
 would want to have such a stream codec?

XML-based web services used for business integration, e.g. based
on ebXML.

A common use case from our everyday consulting business is e.g.
passing market and trading data to portfolio pricing web services.

 If I have 100MB of XML in a file, using the detection API, I do
 
   f = open(filename)
   s = f.read(100)
   while True:
 coding = xml.utils.detect_encoding(s)
 if coding is not undetermined:
break
 s += f.read(100)
   f.close()
 
 Having the loop here is paranoia: in my application, I might be
 able to know that 100 bytes are sufficient to determine the encoding
 always.

Doing the detection with files is easy, but that was never
questioned.

 Again, I don't see the use case. For XML-RPC, it's very feasible
 and standard procedure to have the entire document in memory
 (in a processed form).
 You may not see the use case, but that doesn't really mean
 anything if the use cases exist in real life applications,
 right ?!
 
 Right. However, I' will remain opposed to adding this to the
 standard library until I see why one would absolutely need to
 have that. Not every piece of code that is useful in some
 application should be added to the standard library.

Agreed, but the application space of web services is large
enough to warrant this.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 11 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-11 Thread Martin v. Löwis
 First, XML-RPC is not the only mechanism using XML over a network
 connection. Second, you don't want to do this if you're dealing
 with several 100 MB of data just because you want to figure
 out the encoding.
 That's my original claim/question: what SPECIFIC application do
 you have in mind that transfers XML over a network and where you
 would want to have such a stream codec?
 
 XML-based web services used for business integration, e.g. based
 on ebXML.
 
 A common use case from our everyday consulting business is e.g.
 passing market and trading data to portfolio pricing web services.

I still don't see the need for this feature from this example.
First, in ebXML messaging, the message are typically *not* large
(i.e. much smaller than 100 MB). Furthermore, the typical processing
of such a message would be to pass it directly to the XML parser,
no need for the functionality under discussion.

 Right. However, I' will remain opposed to adding this to the
 standard library until I see why one would absolutely need to
 have that. Not every piece of code that is useful in some
 application should be added to the standard library.
 
 Agreed, but the application space of web services is large
 enough to warrant this.

If that was the case, wouldn't the existing Python web service
libraries already include such a functionality?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-10 Thread Martin v. Löwis
   In case it isn't clear - this is exactly my view also.
 
 But is there an API to do it?  As MAL points out that API would have
 to return not an encoding, but a pair of an encoding and the rewound
 stream.  

The API wouldn't operate on streams. Instead, you pass a string, and
it either returns the detected encoding, or an information telling that
it needs more data. No streams.

 For non-seekable, non-peekable streams (if any), what you'd
 need would be a stream that consisted of a concatenation of the
 buffered data used for detection and the continuation of the stream.

The application would read data out of the stream, and pass it to
the detection. It then can process it in whatever manner it meant to
process it in the first place.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-10 Thread Martin v. Löwis
 A non-seekable stream is not all that uncommon in network processing.

Right. But what is the relationship to XML encoding autodetection?

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-10 Thread Walter Dörwald
Martin v. Löwis sagte:

 So what if the unicode string doesn't start with an XML declaration?
 Will it add one?

 No.

 Ok. So the XML document would be ill-formed then unless the encoding is
 UTF-8, right?

I don't know. Is an XML document ill-formed if it doesn't contain an XML 
declaration, is not in UTF-8 or UTF-8, but there's
external encoding info? If it is, then yes, the document would be ill-formed.

 The point of this code is not just to return whether the string starts
 with ?xml or not. There are actually three cases:

 Still, it's overly complex for that matter:

   * The string does start with ?xml

if s.startswith(?xml):
  return Yes

   * The string starts with a prefix of ?xml, i.e. we can only
 decide if it starts with ?xml if we have more input.

if ?xml.startswith(s):
  return Maybe

   * The string definitely doesn't start with ?xml.

return No

This looks good. Now we would have to extent the code to detect and replace the 
encoding in the XML declaration too.

 What bit fiddling are you referring to specifically that you think
 is better done in C than in Python?

 The code that checks the byte signature, i.e. the first part of
 detect_xml_encoding_str().

 I can't see any *bit* fiddling there, except for the bit mask of
 candidates. For the candidate list, I cannot quite understand why
 you need a bit mask at all, since the candidates are rarely
 overlapping.

I tried many variants and that seemed to be the most straitforward one.

 I think there could be a much simpler routine to have the same
 effect.
 - if it's less than 4 bytes, answer need more data.

Can there be an XML document that is less then 4 bytes? I guess not.

 - otherwise, implement annex F literally. Make a dictionary
   of all prefixes that are exactly 4 bytes, i.e.

   prefixes4 = {\x00\x00\xFE\xFF:utf-32be, ...
   ...,\0\x3c\0\x3f:utf-16le}

   try: return prefixes4[s[:4]]
   except KeyError: pass
   if s.startswith(codecs.BOM_UTF16_BE):return utf-16be
   ...
   if s.startswith(?xml):
  return get_encoding_from_declaration(s)
   return utf-8

get_encoding_from_declaration() would have to do the same yes/no/maybe decision.

But anyway: would a Python implementation of these two functions 
(detect_encoding()/fix_encoding()) be accepted?

Servus,
   Walter


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-10 Thread Walter Dörwald
Martin v. Löwis sagte:
 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?

 Because it is the XML parser that does the decoding, not the
 application. Also, it is better to provide functionality in
 a modular manner (i.e. encoding detection separately from
 encodings),

It is separate. Detection is done by codecs.detect_xml_encoding(), decoding is 
done by the codec.

 and leaving integration of modules to the application,
 in particular if the integration is trivial.

Servus,
   Walter


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald
Martin v. Löwis wrote:

 ci = codecs.lookup(xml-auto-detect)
 p = expat.ParserCreate()
 e = utf-32
 s = (u?xml version='1.0' encoding=%r?foo/ % e).encode(e)
 s = ci.encode(ci.decode(s)[0], encoding=utf-8)[0]
 p.Parse(s, True)
 
 So how come the document being parsed is recognized as UTF-8?

Because you can force the encoder to use a specified encoding. If you do
this and the unicode string starts with an XML declaration, the encoder
will put the specified encoding into the declaration:

import codecs

e = codecs.getencoder(xml-auto-detect)
print e(u?xml version='1.0' encoding='iso-8859-1'?foo/,
encoding=utf-8)[0]

This prints:
?xml version='1.0' encoding='utf-8'?foo/

 OK, so should I put the C code into a _xml module?
 
 I don't see the need for C code at all.

Doing the bit fiddling for
Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the
right thing to do.

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald
Adam Olsen wrote:

 On 11/8/07, Walter Dörwald [EMAIL PROTECTED] wrote:
 [...]
 Furthermore encoding-detection might be part of the responsibility of
 the XML parser, but this decoding phase is totally distinct from the
 parsing phase, so why not put the decoding into a common library?
 I would not object to that - just to expose it as a codec. Adding it
 to the XML library is fine, IMO.
 But it does make sense as a codec. The decoding phase of an XML parser
 has to turn a byte stream into a unicode stream. That's the job of a codec.
 
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.

So the code is good, if it is inside an XML parser, and it's bad if it
is inside a codec?

 It's not even sufficient for
 XML:
 
 1) round-tripping a file should be done in the original encoding.
 Containing the auto-detected encoding within a codec doesn't let you
 see what it picked.

The chosen encoding is available from the incremental encoder:

import codecs

e = codecs.getincrementalencoder(xml-auto-detect)()
e.encode(u?xml version='1.0' encoding='utf-32'?foo/, True)
print e.encoding

This prints utf-32.

 2) the encoding may be specified externally from the file/stream[1].
 The xml parser needs to handle these out-of-band encodings anyway.

It does. You can pass an encoding to the stateless decoder, the
incremental decoder and the streamreader. It will then use this encoding
instead the one detected from the byte stream. It even will put the
correct encoding into the XML declaration (if there is one):

import codecs

d = codecs.getdecoder(xml-auto-detect)
print d(?xml version='1.0' encoding='iso-8859-1'?foo/,
encoding=utf-8)[0]

This prints:
?xml version='1.0' encoding='utf-8'?foo/

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Martin v. Löwis
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.
 
 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?

Exactly so. This functionality just *isn't* a codec - there is no
encoding. Instead, it is an algorithm for *detecting* an encoding.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Martin v. Löwis
 Because you can force the encoder to use a specified encoding. If you do
 this and the unicode string starts with an XML declaration

So what if the unicode string doesn't start with an XML declaration?
Will it add one? If so, what version number will it use?

 OK, so should I put the C code into a _xml module?
 I don't see the need for C code at all.
 
 Doing the bit fiddling for
 Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the
 right thing to do.

Hmm. I don't think a sequence like

+if (strlen0)
+{
+if (*str++ != '')
+return 1;
+if (strlen1)
+{
+if (*str++ != '?')
+return 1;
+if (strlen2)
+{
+if (*str++ != 'x')
+return 1;
+if (strlen3)
+{
+if (*str++ != 'm')
+return 1;
+if (strlen4)
+{
+if (*str++ != 'l')
+return 1;
+if (strlen5)
+{
+if (*str != ' '  *str != '\t'  *str !=
'\r'  *str != '\n')
+return 1;

is well-maintainable C. I feel it is much better writing

  if not s.startswith(=?xml):
 return 1

What bit fiddling are you referring to specifically that you think
is better done in C than in Python?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald
Martin v. Löwis wrote:

 Because you can force the encoder to use a specified encoding. If you do
 this and the unicode string starts with an XML declaration
 
 So what if the unicode string doesn't start with an XML declaration?
 Will it add one?

No.

 If so, what version number will it use?

If we added this we could add an extra argument version to the encoder
constructor defaulting to '1.0'.

 OK, so should I put the C code into a _xml module?
 I don't see the need for C code at all.
 Doing the bit fiddling for
 Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the
 right thing to do.
 
 Hmm. I don't think a sequence like
 
 +if (strlen0)
 +{
 +if (*str++ != '')
 +return 1;
 +if (strlen1)
 +{
 +if (*str++ != '?')
 +return 1;
 +if (strlen2)
 +{
 +if (*str++ != 'x')
 +return 1;
 +if (strlen3)
 +{
 +if (*str++ != 'm')
 +return 1;
 +if (strlen4)
 +{
 +if (*str++ != 'l')
 +return 1;
 +if (strlen5)
 +{
 +if (*str != ' '  *str != '\t'  *str !=
 '\r'  *str != '\n')
 +return 1;
 
 is well-maintainable C. I feel it is much better writing
 
   if not s.startswith(=?xml):
  return 1

The point of this code is not just to return whether the string starts
with ?xml or not. There are actually three cases:
  * The string does start with ?xml
  * The string starts with a prefix of ?xml, i.e. we can only
decide if it starts with ?xml if we have more input.
  * The string definitely doesn't start with ?xml.

 What bit fiddling are you referring to specifically that you think
 is better done in C than in Python?

The code that checks the byte signature, i.e. the first part of
detect_xml_encoding_str().

Servus,
   Walter




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald
Martin v. Löwis wrote:
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.
 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?
 
 Exactly so. This functionality just *isn't* a codec - there is no
 encoding. Instead, it is an algorithm for *detecting* an encoding.

And what do you do once you've detected the encoding? You decode the
input, so why not combine both into an XML decoder?

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread M.-A. Lemburg
On 2007-11-09 14:10, Walter Dörwald wrote:
 Martin v. Löwis wrote:
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.
 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?
 Exactly so. This functionality just *isn't* a codec - there is no
 encoding. Instead, it is an algorithm for *detecting* an encoding.
 
 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?

FWIW: I'm +1 on adding such a codec.

It makes working with XML data a lot easier: you simply don't have to
bother with the encoding of the XML data anymore and can just let the
codec figure out the details. The XML parser can then work directly
on the Unicode data.

Whether it needs to be in C or not is another question (I would have
done this in Python since performance is not really an issue), but since
the code is already written, why not use it ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 09 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald
Walter Dörwald wrote:
 Martin v. Löwis wrote:
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.
 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?
 Exactly so. This functionality just *isn't* a codec - there is no
 encoding. Instead, it is an algorithm for *detecting* an encoding.
 
 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?

In fact, we already have such a codec. The utf-16 decoder looks at the
first two bytes and then decides to forward the rest to either a
utf-16-be or a utf-16-le decoder.

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Walter Dörwald
M.-A. Lemburg wrote:

 On 2007-11-09 14:10, Walter Dörwald wrote:
 Martin v. Löwis wrote:
 Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
 codecs to do the encoding.  There's no need to create a magical
 mystery codec to pick out which though.
 So the code is good, if it is inside an XML parser, and it's bad if it
 is inside a codec?
 Exactly so. This functionality just *isn't* a codec - there is no
 encoding. Instead, it is an algorithm for *detecting* an encoding.
 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?
 
 FWIW: I'm +1 on adding such a codec.
 
 It makes working with XML data a lot easier: you simply don't have to
 bother with the encoding of the XML data anymore and can just let the
 codec figure out the details. The XML parser can then work directly
 on the Unicode data.

Exactly. I have a version of sgmlop lying around that does that.

 Whether it needs to be in C or not is another question (I would have
 done this in Python since performance is not really an issue), but since
 the code is already written, why not use it ?

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Fred Drake
On Nov 9, 2007, at 8:22 AM, M.-A. Lemburg wrote:
 FWIW: I'm +1 on adding such a codec.

I'm undecided, and really don't feel strongly either way.

 It makes working with XML data a lot easier: you simply don't have to
 bother with the encoding of the XML data anymore and can just let the
 codec figure out the details. The XML parser can then work directly
 on the Unicode data.

Which is fine if you want to write a new parser.  I've no interest in  
that myself.

 Whether it needs to be in C or not is another question (I would have
 done this in Python since performance is not really an issue), but  
 since
 the code is already written, why not use it ?

The reason not to use C is the usual one:  The implementation is more  
cross-implementation if it's written in Python.  This makes it more  
useful with Jython, IronPython, and PyPy.

That seems a pretty good reason to me.


   -Fred

-- 
Fred Drake   fdrake at acm.org




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Martin v. Löwis
 So what if the unicode string doesn't start with an XML declaration?
 Will it add one?
 
 No.

Ok. So the XML document would be ill-formed then unless the encoding is
UTF-8, right?

 The point of this code is not just to return whether the string starts
 with ?xml or not. There are actually three cases:

Still, it's overly complex for that matter:

   * The string does start with ?xml

   if s.startswith(?xml):
 return Yes

   * The string starts with a prefix of ?xml, i.e. we can only
 decide if it starts with ?xml if we have more input.

   if ?xml.startswith(s):
 return Maybe

   * The string definitely doesn't start with ?xml.

   return No

 What bit fiddling are you referring to specifically that you think
 is better done in C than in Python?
 
 The code that checks the byte signature, i.e. the first part of
 detect_xml_encoding_str().

I can't see any *bit* fiddling there, except for the bit mask of
candidates. For the candidate list, I cannot quite understand why
you need a bit mask at all, since the candidates are rarely
overlapping.

I think there could be a much simpler routine to have the same
effect.
- if it's less than 4 bytes, answer need more data.
- otherwise, implement annex F literally. Make a dictionary
  of all prefixes that are exactly 4 bytes, i.e.

  prefixes4 = {\x00\x00\xFE\xFF:utf-32be, ...
  ...,  \0\x3c\0\x3f:utf-16le}

  try: return prefixes4[s[:4]]
  except KeyError: pass
  if s.startswith(codecs.BOM_UTF16_BE):return utf-16be
  ...
  if s.startswith(?xml):
 return get_encoding_from_declaration(s)
  return utf-8

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Martin v. Löwis
 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?

Because it is the XML parser that does the decoding, not the
application. Also, it is better to provide functionality in
a modular manner (i.e. encoding detection separately from
encodings), and leaving integration of modules to the application,
in particular if the integration is trivial.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Adam Olsen
On Nov 9, 2007 6:10 AM, Walter Dörwald [EMAIL PROTECTED] wrote:

 Martin v. Löwis wrote:
  Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc
  codecs to do the encoding.  There's no need to create a magical
  mystery codec to pick out which though.
  So the code is good, if it is inside an XML parser, and it's bad if it
  is inside a codec?
 
  Exactly so. This functionality just *isn't* a codec - there is no
  encoding. Instead, it is an algorithm for *detecting* an encoding.

 And what do you do once you've detected the encoding? You decode the
 input, so why not combine both into an XML decoder?

It seems to me that parsing XML requires 3 steps:
1) determine encoding
2) decode byte stream
3) parse XML (including handling of character references)

All an xml codec does is make the first part a side-effect of the
second part.  Rather than this:

encoding = detect_encoding(raw_data)
decoded_data = raw_data.decode(encoding)
tree = parse_xml(decoded_data, encoding)  # Verifies encoding

You'd have this:

e = codecs.getincrementaldecoder(xml-auto-detect)()
decoded_data = e.decode(raw_data, True)
tree = parse_xml(decoded_data, e.encoding)  # Verifies encoding

It's clear to me that detecting an encoding is actually the simplest
part of all this (so long as there's an API to do it!)  Putting it
inside a codec seems like the wrong subdivision of responsibility.

(An example using streams would end up closer, but it still seems
wrong to me.  Encoding detection is always one way, while codecs are
always two way (even if lossy.))

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Martin v. Löwis
 It makes working with XML data a lot easier: you simply don't have to
 bother with the encoding of the XML data anymore and can just let the
 codec figure out the details. The XML parser can then work directly
 on the Unicode data.

Having the functionality indeed makes things easier. However, I don't
find

  s.decode(xml.detect_encoding(s))

particularly more difficult than

  s.decode(xml-auto-detection)

 Whether it needs to be in C or not is another question (I would have
 done this in Python since performance is not really an issue), but since
 the code is already written, why not use it ?

It's a maintenance issue.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Martin v. Löwis
 In fact, we already have such a codec. The utf-16 decoder looks at the
 first two bytes and then decides to forward the rest to either a
 utf-16-be or a utf-16-le decoder.

That's different. UTF-16 is a proper encoding that is just specified
to use the BOM. xml-auto-detection is not an encoding.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Martin v. Löwis
 It's clear to me that detecting an encoding is actually the simplest
 part of all this (so long as there's an API to do it!)  Putting it
 inside a codec seems like the wrong subdivision of responsibility.

In case it isn't clear - this is exactly my view also.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 It makes working with XML data a lot easier: you simply don't have to
 bother with the encoding of the XML data anymore and can just let the
 codec figure out the details. The XML parser can then work directly
 on the Unicode data.
 
 Having the functionality indeed makes things easier. However, I don't
 find
 
   s.decode(xml.detect_encoding(s))
 
 particularly more difficult than
 
   s.decode(xml-auto-detection)

Not really, but the codec has more control over what happens to
the stream, ie. it's easier to implement look-ahead in the codec
than to do the detection and then try to push the bytes back onto
the stream (which may or may not be possible depending on the
nature of the stream).

 Whether it needs to be in C or not is another question (I would have
 done this in Python since performance is not really an issue), but since
 the code is already written, why not use it ?
 
 It's a maintenance issue.

I'm sure Walter will do a great job in maintaining the code :-)

Regards,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 09 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Adam Olsen
On Nov 9, 2007 3:59 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
 Martin v. Löwis wrote:
  It makes working with XML data a lot easier: you simply don't have to
  bother with the encoding of the XML data anymore and can just let the
  codec figure out the details. The XML parser can then work directly
  on the Unicode data.
 
  Having the functionality indeed makes things easier. However, I don't
  find
 
s.decode(xml.detect_encoding(s))
 
  particularly more difficult than
 
s.decode(xml-auto-detection)

 Not really, but the codec has more control over what happens to
 the stream, ie. it's easier to implement look-ahead in the codec
 than to do the detection and then try to push the bytes back onto
 the stream (which may or may not be possible depending on the
 nature of the stream).

io.BufferedReader() standardizes a .peek() API, making it trivial.  I
don't see why we couldn't require it.

(As an aside, .peek() will fail to do what detect_encodings() needs if
BufferedReader's buffer size is too small.  I do wonder if that
limitation is appropriate.)


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Martin v. Löwis
 Not really, but the codec has more control over what happens to
 the stream, ie. it's easier to implement look-ahead in the codec
 than to do the detection and then try to push the bytes back onto
 the stream (which may or may not be possible depending on the
 nature of the stream).

YAGNI.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread Stephen J. Turnbull
Martin v. Löwis writes:

   It's clear to me that detecting an encoding is actually the simplest
   part of all this (so long as there's an API to do it!)  Putting it
   inside a codec seems like the wrong subdivision of responsibility.
  
  In case it isn't clear - this is exactly my view also.

But is there an API to do it?  As MAL points out that API would have
to return not an encoding, but a pair of an encoding and the rewound
stream.  For non-seekable, non-peekable streams (if any), what you'd
need would be a stream that consisted of a concatenation of the
buffered data used for detection and the continuation of the stream.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-09 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 Not really, but the codec has more control over what happens to
 the stream, ie. it's easier to implement look-ahead in the codec
 than to do the detection and then try to push the bytes back onto
 the stream (which may or may not be possible depending on the
 nature of the stream).
 
 YAGNI.

A non-seekable stream is not all that uncommon in network processing.
I usually end up either reading the complete data into memory
or doing the needed buffering by hand.

Regards,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 10 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-08 Thread Walter Dörwald
Martin v. Löwis wrote:
 Any comments?
 
 -1. First, (as already discussed on the tracker,) xml is a bad name
 for an encoding. How would you encode Hello in xml?

Then how about the suggested xml-auto-detect?

 Then, I'd claim that the problem that the codec solves doesn't really
 exist. IOW, most XML parsers implement the auto-detection of encodings,
 anyway, and this is where architecturally this functionality belongs.

But not all XML parsers support all encodings. The XML codec makes it
trivial to add this support to an existing parser.

Furthermore encoding-detection might be part of the responsibility of
the XML parser, but this decoding phase is totally distinct from the
parsing phase, so why not put the decoding into a common library?

 For a text editor, much more useful than a codec would be a routine
 (say, xml.detect_encoding) which performs the auto-detection.

There's a (currently undocumented) codecs.detect_xml_encoding() in the
patch. We could document this function and make it public. But if
there's no codec that uses it, this function IMHO doesn't belong in the
codecs module. Should this function be available from xml/__init__.py or
should be put it into something like xml/utils.py?

 Finally, I think the codec is incorrect. When saving XML to a file
 (e.g. in a text editor), there should rarely be encoding errors, since
 one could use character references in many cases.

This requires some intelligent fiddling with the errors attribute of the
encoder.

 Also, the XML
 spec talks about detecting EBCDIC, which I believe your implementation
 doesn't.

Correct, but as long as Python doesn't have an EBCDIC codec, that won't
help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
rather simple though.

Servus,
   Walter

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-08 Thread Martin v. Löwis
 Then how about the suggested xml-auto-detect?

That is better.

 Then, I'd claim that the problem that the codec solves doesn't really
 exist. IOW, most XML parsers implement the auto-detection of encodings,
 anyway, and this is where architecturally this functionality belongs.
 
 But not all XML parsers support all encodings. The XML codec makes it
 trivial to add this support to an existing parser.

I would like to question this claim. Can you give an example of a parser
that doesn't support a specific encoding and where adding such a codec
solves that problem?

In particular, why would that parser know how to process Python Unicode
strings?

 Furthermore encoding-detection might be part of the responsibility of
 the XML parser, but this decoding phase is totally distinct from the
 parsing phase, so why not put the decoding into a common library?

I would not object to that - just to expose it as a codec. Adding it
to the XML library is fine, IMO.

 There's a (currently undocumented) codecs.detect_xml_encoding() in the
 patch. We could document this function and make it public. But if
 there's no codec that uses it, this function IMHO doesn't belong in the
 codecs module. Should this function be available from xml/__init__.py or
 should be put it into something like xml/utils.py?

Either - or.

 Finally, I think the codec is incorrect. When saving XML to a file
 (e.g. in a text editor), there should rarely be encoding errors, since
 one could use character references in many cases.
 
 This requires some intelligent fiddling with the errors attribute of the
 encoder.

Much more than that, I think - you cannot use a character reference
in an XML Name. So the codec would have to parse the output stream
to know whether or not a character reference could be used.

 Correct, but as long as Python doesn't have an EBCDIC codec, that won't
 help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
 rather simple though.

But it does! cp037 is EBCDIC, and supported by Python.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-08 Thread Martin v. Löwis
 ci = codecs.lookup(xml-auto-detect)
 p = expat.ParserCreate()
 e = utf-32
 s = (u?xml version='1.0' encoding=%r?foo/ % e).encode(e)
 s = ci.encode(ci.decode(s)[0], encoding=utf-8)[0]
 p.Parse(s, True)

So how come the document being parsed is recognized as UTF-8?

 OK, so should I put the C code into a _xml module?

I don't see the need for C code at all.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-08 Thread Walter Dörwald
Martin v. Löwis wrote:

 Then how about the suggested xml-auto-detect?
 
 That is better.

OK.

 Then, I'd claim that the problem that the codec solves doesn't really
 exist. IOW, most XML parsers implement the auto-detection of encodings,
 anyway, and this is where architecturally this functionality belongs.
 But not all XML parsers support all encodings. The XML codec makes it
 trivial to add this support to an existing parser.
 
 I would like to question this claim. Can you give an example of a parser
 that doesn't support a specific encoding

It seems that e.g. expat doesn't support UTF-32:

from xml.parsers import expat

p = expat.ParserCreate()
e = utf-32
s = (u?xml version='1.0' encoding=%r?foo/ % e).encode(e)
p.Parse(s, True)

This fails with:

Traceback (most recent call last):
   File gurk.py, line 6, in module
 p.Parse(s, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, 
column 1

Replace utf-32 with utf-16 and the problem goes away.

 and where adding such a codec
 solves that problem?
 
 In particular, why would that parser know how to process Python Unicode
 strings?

It doesn't have to. You can use an XML encoder to reencode the unicode 
string into bytes (forcing an encoding that the parser knows):

import codecs
from xml.parsers import expat

ci = codecs.lookup(xml-auto-detect)
p = expat.ParserCreate()
e = utf-32
s = (u?xml version='1.0' encoding=%r?foo/ % e).encode(e)
s = ci.encode(ci.decode(s)[0], encoding=utf-8)[0]
p.Parse(s, True)

 Furthermore encoding-detection might be part of the responsibility of
 the XML parser, but this decoding phase is totally distinct from the
 parsing phase, so why not put the decoding into a common library?
 
 I would not object to that - just to expose it as a codec. Adding it
 to the XML library is fine, IMO.

But it does make sense as a codec. The decoding phase of an XML parser 
has to turn a byte stream into a unicode stream. That's the job of a codec.

 There's a (currently undocumented) codecs.detect_xml_encoding() in the
 patch. We could document this function and make it public. But if
 there's no codec that uses it, this function IMHO doesn't belong in the
 codecs module. Should this function be available from xml/__init__.py or
 should be put it into something like xml/utils.py?
 
 Either - or.

OK, so should I put the C code into a _xml module?

 Finally, I think the codec is incorrect. When saving XML to a file
 (e.g. in a text editor), there should rarely be encoding errors, since
 one could use character references in many cases.
 This requires some intelligent fiddling with the errors attribute of the
 encoder.
 
 Much more than that, I think - you cannot use a character reference
 in an XML Name. So the codec would have to parse the output stream
 to know whether or not a character reference could be used.

That's what I meant with intelligent fiddling. But I agree this is way 
beyond what a text editor should do. AFAIK it is way beyond what 
existing text editors do. However using the XML codec would at least 
guarantee that the encoding specified in the XML declaration and the 
encoding used for encoding the file stay consistent.

 Correct, but as long as Python doesn't have an EBCDIC codec, that won't
 help much. Adding *detection* of EBCDIC to detect_xml_encoding() is
 rather simple though.
 
 But it does! cp037 is EBCDIC, and supported by Python.

I didn't know that. I'm going to update the patch.

Servus,
Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] XML codec?

2007-11-07 Thread Walter Dörwald
I have a patch ready (http://bugs.python.org/issue1399) that adds an XML
codec. This codec implements encoding detection as specified in
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing and could be
used for the decoding phase of an XML parser. Other use cases are:

The codec could be used for transcoding an XML input before passing it
to the real parser, if the parser itself doesn't support the encoding in
question.

A text editor could use the codec to decode an XML file. When the user
changes the XML declaration and resaves the file, it would be saved in
the correct encoding.

I'd like to have this codec in 2.6 and 3.0.

Any comments?

Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML codec?

2007-11-07 Thread Martin v. Löwis
 Any comments?

-1. First, (as already discussed on the tracker,) xml is a bad name
for an encoding. How would you encode Hello in xml?

Then, I'd claim that the problem that the codec solves doesn't really
exist. IOW, most XML parsers implement the auto-detection of encodings,
anyway, and this is where architecturally this functionality belongs.
For a text editor, much more useful than a codec would be a routine
(say, xml.detect_encoding) which performs the auto-detection.

Finally, I think the codec is incorrect. When saving XML to a file
(e.g. in a text editor), there should rarely be encoding errors, since
one could use character references in many cases. Also, the XML
spec talks about detecting EBCDIC, which I believe your implementation
doesn't.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com