Re: XmlPullParser parses strings with platform's default charset

Juergen Donnerstag Tue, 05 Jun 2012 13:54:24 -0700

Hi Martin,

XmlReader reads the markup file, interprets <?xml encoding ..> if
present, and converts the markup into a String, which in Java is
always UTF encoded. XmlPullParser uses the data provided by XmlReader.


To support unit testing XPP provide a parse(String) method which
encapsulates the string into a inputstream, in order not to circumvent
XmlReader for testing.

No xml decl (or no encoding) results in XmlReader using the JVM
default, which if the OS default not provided via -Dfile.encoding=

And since you never know on which OS in which country devs a building
or testing, providing the UTF encoded value is the save way of doing
it.

We may replace parse(string) with parse(string, "encoding") which
seems to be supported by all underlying methods, but are preset with
null (JVM default) right now. That may help you solve your problem,
and make other devs aware that the encoding might need change.

make sense?

Juergen

On Tue, Jun 5, 2012 at 9:54 AM, Juergen Donnerstag
<[email protected]> wrote:
> I'll have a look later today.
>
> Juergen
>
> On Mon, Jun 4, 2012 at 3:37 PM, Martin Grigorov
> <[email protected]> wrote:
>> Hi,
>>
>> I'm not quite sure but I think there is a bug in
>> org.apache.wicket.markup.parser.XmlPullParser#parse(CharSequence)
>> because it uses
>> string.toString().getBytes() to create a ByteArrayInputStream.
>>
>> org.apache.wicket.util.tester.BaseWicketTester#getTagById(String) uses
>> lastResponseAsString to feed XmlPullParser but lastResponseAsString's
>> encoding depends on
>> org.apache.wicket.settings.IRequestCycleSettings#getResponseRequestEncoding().
>> I.e. the string may be encoded in UTF-8 but later XmlPullParser will
>> try to process its bytes as Windows-1252 for example.
>>
>>
>> Here is a small patch that exposes the problem:
>> diff --git 
>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>> b/wicket-core/src/test/java/org/apache/wicket/markup/p
>> index 2e26d05..15fb496 100644
>> --- 
>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>> +++ 
>> b/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>> @@ -191,6 +191,13 @@ public class XmlPullParserTest extends Assert
>>                assertNull(parser.getEncoding());
>>                tag = parser.nextTag();
>>                assertNull(tag);
>> +
>> +               String expected = "äöü€";
>> +               parser.parse("<dummy>"+expected+"</dummy>");
>> +               XmlTag openTag = parser.nextTag();
>> +               XmlTag closeTag = parser.nextTag();
>> +               String actual = parser.getInput(openTag.getPos() +
>> openTag.getLength(), closeTag.getPos()).toString();
>> +               assertEquals(expected, actual);
>>        }
>>
>>        /**
>>
>> Apply this patch and run the test with -Dfile.encoding=latin1. It will
>> fail in the comparison. Run it with UTF-8 and it will pass.
>>
>> I remember Juergen had similar problem with one of Wicket core tests
>> that uses the Euro sign in an assertion and he fixed it by using
>> unicode escaped value (\uabcd).
>> But in this case the response is encoded with whatever is configured
>> at IRequestCycleSettings#getResponseRequestEncoding() and
>> XmlPullParser tries to read it with the platform default encoding.
>>
>> Is this a bug and how we can solve it ?
>>
>> --
>> Martin Grigorov
>> jWeekend
>> Training, Consulting, Development
>> http://jWeekend.com

Re: XmlPullParser parses strings with platform's default charset

Reply via email to