Re: XmlPullParser parses strings with platform's default charset

Juergen Donnerstag Thu, 07 Jun 2012 00:14:51 -0700

And there is no stable solution, except we create an artificial one
(via XML prolog and encoding parameter), since each OS and each
country charset default is different. We could ease creating that
though for testing purposes. E.g. allowing for a test specific
default, different than the OS default, for the XML prolog and the
encoding parameters.


Juergen

On Wed, Jun 6, 2012 at 12:32 PM, Martin Grigorov <[email protected]> wrote:
> Hi Juergen,
>
> Thanks for the explanation!
>
> I've tried all combinations of the following variables:
> - -Dfile.encoding=latin1
> - with and without <?xml encoding="utf-8"?> in the String to parse
> - parse(new ByteArrayInputStream(string.toString().getBytes("UTF-8")), null);
> - parse(new ByteArrayInputStream(string.toString().getBytes()), "UTF-8");
> - parse(new ByteArrayInputStream(string.toString().getBytes("UTF-8")), 
> "UTF-8");
>
> and the test passes only when the String has the prolog with the
> encoding and "parse(new
> ByteArrayInputStream(string.toString().getBytes("UTF-8")), "UTF-8");"
> is used
> any other combination produces mangled characters and the assertion fails
>
> So I cannot find a stable solution that will work on any environment.
> We can use IRequestCycleSettings#getResponseRequestEncoding() for the
> charset but if there is no XML prolog or it has no encoding attr then
> the test fails.
>
> On Tue, Jun 5, 2012 at 11:53 PM, Juergen Donnerstag
> <[email protected]> wrote:
>> Hi Martin,
>>
>> XmlReader reads the markup file, interprets <?xml encoding ..> if
>> present, and converts the markup into a String, which in Java is
>> always UTF encoded. XmlPullParser uses the data provided by XmlReader.
>>
>> To support unit testing XPP provide a parse(String) method which
>> encapsulates the string into a inputstream, in order not to circumvent
>> XmlReader for testing.
>>
>> No xml decl (or no encoding) results in XmlReader using the JVM
>> default, which if the OS default not provided via -Dfile.encoding=
>>
>> And since you never know on which OS in which country devs a building
>> or testing, providing the UTF encoded value is the save way of doing
>> it.
>>
>> We may replace parse(string) with parse(string, "encoding") which
>> seems to be supported by all underlying methods, but are preset with
>> null (JVM default) right now. That may help you solve your problem,
>> and make other devs aware that the encoding might need change.
>>
>> make sense?
>>
>> Juergen
>>
>> On Tue, Jun 5, 2012 at 9:54 AM, Juergen Donnerstag
>> <[email protected]> wrote:
>>> I'll have a look later today.
>>>
>>> Juergen
>>>
>>> On Mon, Jun 4, 2012 at 3:37 PM, Martin Grigorov
>>> <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> I'm not quite sure but I think there is a bug in
>>>> org.apache.wicket.markup.parser.XmlPullParser#parse(CharSequence)
>>>> because it uses
>>>> string.toString().getBytes() to create a ByteArrayInputStream.
>>>>
>>>> org.apache.wicket.util.tester.BaseWicketTester#getTagById(String) uses
>>>> lastResponseAsString to feed XmlPullParser but lastResponseAsString's
>>>> encoding depends on
>>>> org.apache.wicket.settings.IRequestCycleSettings#getResponseRequestEncoding().
>>>> I.e. the string may be encoded in UTF-8 but later XmlPullParser will
>>>> try to process its bytes as Windows-1252 for example.
>>>>
>>>>
>>>> Here is a small patch that exposes the problem:
>>>> diff --git 
>>>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>> b/wicket-core/src/test/java/org/apache/wicket/markup/p
>>>> index 2e26d05..15fb496 100644
>>>> --- 
>>>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>> +++ 
>>>> b/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>> @@ -191,6 +191,13 @@ public class XmlPullParserTest extends Assert
>>>>                assertNull(parser.getEncoding());
>>>>                tag = parser.nextTag();
>>>>                assertNull(tag);
>>>> +
>>>> +               String expected = "äöü€";
>>>> +               parser.parse("<dummy>"+expected+"</dummy>");
>>>> +               XmlTag openTag = parser.nextTag();
>>>> +               XmlTag closeTag = parser.nextTag();
>>>> +               String actual = parser.getInput(openTag.getPos() +
>>>> openTag.getLength(), closeTag.getPos()).toString();
>>>> +               assertEquals(expected, actual);
>>>>        }
>>>>
>>>>        /**
>>>>
>>>> Apply this patch and run the test with -Dfile.encoding=latin1. It will
>>>> fail in the comparison. Run it with UTF-8 and it will pass.
>>>>
>>>> I remember Juergen had similar problem with one of Wicket core tests
>>>> that uses the Euro sign in an assertion and he fixed it by using
>>>> unicode escaped value (\uabcd).
>>>> But in this case the response is encoded with whatever is configured
>>>> at IRequestCycleSettings#getResponseRequestEncoding() and
>>>> XmlPullParser tries to read it with the platform default encoding.
>>>>
>>>> Is this a bug and how we can solve it ?
>>>>
>>>> --
>>>> Martin Grigorov
>>>> jWeekend
>>>> Training, Consulting, Development
>>>> http://jWeekend.com
>
>
>
> --
> Martin Grigorov
> jWeekend
> Training, Consulting, Development
> http://jWeekend.com

Re: XmlPullParser parses strings with platform's default charset

Reply via email to