On Tue, 18 May 2010, jaayer wrote:



============ Forwarded message ============
From : jaayer<jaa...@zoho.com>
To :  <alexandre.ber...@inria.fr>
Date : Tue, 18 May 2010 16:30:06 -0700
Subject : Re: Decoding bug with XMLParser ?
============ Forwarded message ============

---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel 
<alexandre.ber...@inria.fr> wrote ----

To give a bit of context, the problem is: -=-=-=-=-=-=-=-=-=-=-=-= exampleEncodedXML     ^'<?xml version="1.0" encoding="UTF-8"?> <test-data>&#8230;</test-data> ' testDecodingCharacters     | xmlDocument element |     "XMLTokenizer testDecodingCharacters"     xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream.     element := xmlDocument firstTagNamed: #'test-data'.          self assert: element contentString first codePoint = 8230 -=-=-=-=-=-=-=-=-=-=-=-= #testDecodingCharacters goes yellow
Thinking of it, it's not really an encoding problem, rather a bug in the entity->character conversion. I guess there should be a similar test where there is an actual ellipsis character in the xml, instead of the entity.

Any idea how your test can goes green?
And now I realize our server will not be able to connect outside its DMZ, so I won't be able to use the fix :D

DMZ ? Cheers, Alexandre

Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" 
or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in 
a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would 
expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless 
Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in 
XML-Support.

(I am working on adding full DTD support with validation and refactoring and 
re-engineering the parser at the moment, which is why minor releases have 
slowed to a trickle. I will take a closer look at how character encoding is 
handled in the process.)


Another "hard to quote" message, but I hope my answer will be clear.
The "problem" is that in Pharo the leadingChar for unicode characters is still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1:
(Unicode value: 8230) codePoint. "===> 8230"

While in Pharo it's:
(Unicode value: 8230) codePoint. "===> 1069555750"
(Character value: 1069555750) charCode. "===> 8230"
(Character value: 1069555750) leadingChar. "===> 255"

So using #charCode instead of #codePoint is the solution.


Levente


_______________________________________________
Pharo-project mailing list
Pharo-project@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
_______________________________________________
Pharo-project mailing list
Pharo-project@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to