On Tue, 18 May 2010, jaayer wrote:
============ Forwarded message ============
From : jaayer<jaa...@zoho.com>
To : <alexandre.ber...@inria.fr>
Date : Tue, 18 May 2010 16:30:06 -0700
Subject : Re: Decoding bug with XMLParser ?
============ Forwarded message ============
---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel
<alexandre.ber...@inria.fr> wrote ----
To give a bit of context, the problem is:
-=-=-=-=-=-=-=-=-=-=-=-=
exampleEncodedXML
^'<?xml version="1.0" encoding="UTF-8"?>
<test-data>…</test-data>
'
testDecodingCharacters
| xmlDocument element |
"XMLTokenizer testDecodingCharacters"
xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream.
element := xmlDocument firstTagNamed: #'test-data'.
self assert: element contentString first codePoint = 8230
-=-=-=-=-=-=-=-=-=-=-=-=
#testDecodingCharacters goes yellow
Thinking of it, it's not really an encoding problem, rather a bug in
the entity->character conversion. I guess there should be a similar
test where there is an actual ellipsis character in the xml, instead
of the entity.
Any idea how your test can goes green?
And now I realize our server will not be able to connect outside its
DMZ, so I won't be able to use the fix :D
DMZ ?
Cheers,
Alexandre
Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#"
or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in
a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would
expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless
Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in
XML-Support.
(I am working on adding full DTD support with validation and refactoring and
re-engineering the parser at the moment, which is why minor releases have
slowed to a trickle. I will take a closer look at how character encoding is
handled in the process.)
Another "hard to quote" message, but I hope my answer will be clear.
The "problem" is that in Pharo the leadingChar for unicode characters is
still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1:
(Unicode value: 8230) codePoint. "===> 8230"
While in Pharo it's:
(Unicode value: 8230) codePoint. "===> 1069555750"
(Character value: 1069555750) charCode. "===> 8230"
(Character value: 1069555750) leadingChar. "===> 255"
So using #charCode instead of #codePoint is the solution.
Levente
_______________________________________________
Pharo-project mailing list
Pharo-project@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
_______________________________________________
Pharo-project mailing list
Pharo-project@lists.gforge.inria.fr
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project