Re: XmlPullParser parses strings with platform's default charset

Martin Grigorov Thu, 07 Jun 2012 00:44:55 -0700

If I understand you correctly you suggest to use
IRequestCycleSettings#getResponseRequestEncoding() for:
- String.getBytes(HERE)
- new InputStreamReader(stream, HERE) - this is in XmlReader
- (the XML prolog maybe be ignored totally)


I think this should work.

On Thu, Jun 7, 2012 at 10:14 AM, Juergen Donnerstag
<[email protected]> wrote:
> And there is no stable solution, except we create an artificial one
> (via XML prolog and encoding parameter), since each OS and each
> country charset default is different. We could ease creating that
> though for testing purposes. E.g. allowing for a test specific
> default, different than the OS default, for the XML prolog and the
> encoding parameters.
>
> Juergen
>
> On Wed, Jun 6, 2012 at 12:32 PM, Martin Grigorov <[email protected]> wrote:
>> Hi Juergen,
>>
>> Thanks for the explanation!
>>
>> I've tried all combinations of the following variables:
>> - -Dfile.encoding=latin1
>> - with and without <?xml encoding="utf-8"?> in the String to parse
>> - parse(new ByteArrayInputStream(string.toString().getBytes("UTF-8")), null);
>> - parse(new ByteArrayInputStream(string.toString().getBytes()), "UTF-8");
>> - parse(new ByteArrayInputStream(string.toString().getBytes("UTF-8")), 
>> "UTF-8");
>>
>> and the test passes only when the String has the prolog with the
>> encoding and "parse(new
>> ByteArrayInputStream(string.toString().getBytes("UTF-8")), "UTF-8");"
>> is used
>> any other combination produces mangled characters and the assertion fails
>>
>> So I cannot find a stable solution that will work on any environment.
>> We can use IRequestCycleSettings#getResponseRequestEncoding() for the
>> charset but if there is no XML prolog or it has no encoding attr then
>> the test fails.
>>
>> On Tue, Jun 5, 2012 at 11:53 PM, Juergen Donnerstag
>> <[email protected]> wrote:
>>> Hi Martin,
>>>
>>> XmlReader reads the markup file, interprets <?xml encoding ..> if
>>> present, and converts the markup into a String, which in Java is
>>> always UTF encoded. XmlPullParser uses the data provided by XmlReader.
>>>
>>> To support unit testing XPP provide a parse(String) method which
>>> encapsulates the string into a inputstream, in order not to circumvent
>>> XmlReader for testing.
>>>
>>> No xml decl (or no encoding) results in XmlReader using the JVM
>>> default, which if the OS default not provided via -Dfile.encoding=
>>>
>>> And since you never know on which OS in which country devs a building
>>> or testing, providing the UTF encoded value is the save way of doing
>>> it.
>>>
>>> We may replace parse(string) with parse(string, "encoding") which
>>> seems to be supported by all underlying methods, but are preset with
>>> null (JVM default) right now. That may help you solve your problem,
>>> and make other devs aware that the encoding might need change.
>>>
>>> make sense?
>>>
>>> Juergen
>>>
>>> On Tue, Jun 5, 2012 at 9:54 AM, Juergen Donnerstag
>>> <[email protected]> wrote:
>>>> I'll have a look later today.
>>>>
>>>> Juergen
>>>>
>>>> On Mon, Jun 4, 2012 at 3:37 PM, Martin Grigorov
>>>> <[email protected]> wrote:
>>>>> Hi,
>>>>>
>>>>> I'm not quite sure but I think there is a bug in
>>>>> org.apache.wicket.markup.parser.XmlPullParser#parse(CharSequence)
>>>>> because it uses
>>>>> string.toString().getBytes() to create a ByteArrayInputStream.
>>>>>
>>>>> org.apache.wicket.util.tester.BaseWicketTester#getTagById(String) uses
>>>>> lastResponseAsString to feed XmlPullParser but lastResponseAsString's
>>>>> encoding depends on
>>>>> org.apache.wicket.settings.IRequestCycleSettings#getResponseRequestEncoding().
>>>>> I.e. the string may be encoded in UTF-8 but later XmlPullParser will
>>>>> try to process its bytes as Windows-1252 for example.
>>>>>
>>>>>
>>>>> Here is a small patch that exposes the problem:
>>>>> diff --git 
>>>>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>>> b/wicket-core/src/test/java/org/apache/wicket/markup/p
>>>>> index 2e26d05..15fb496 100644
>>>>> --- 
>>>>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>>> +++ 
>>>>> b/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>>> @@ -191,6 +191,13 @@ public class XmlPullParserTest extends Assert
>>>>>                assertNull(parser.getEncoding());
>>>>>                tag = parser.nextTag();
>>>>>                assertNull(tag);
>>>>> +
>>>>> +               String expected = "äöü€";
>>>>> +               parser.parse("<dummy>"+expected+"</dummy>");
>>>>> +               XmlTag openTag = parser.nextTag();
>>>>> +               XmlTag closeTag = parser.nextTag();
>>>>> +               String actual = parser.getInput(openTag.getPos() +
>>>>> openTag.getLength(), closeTag.getPos()).toString();
>>>>> +               assertEquals(expected, actual);
>>>>>        }
>>>>>
>>>>>        /**
>>>>>
>>>>> Apply this patch and run the test with -Dfile.encoding=latin1. It will
>>>>> fail in the comparison. Run it with UTF-8 and it will pass.
>>>>>
>>>>> I remember Juergen had similar problem with one of Wicket core tests
>>>>> that uses the Euro sign in an assertion and he fixed it by using
>>>>> unicode escaped value (\uabcd).
>>>>> But in this case the response is encoded with whatever is configured
>>>>> at IRequestCycleSettings#getResponseRequestEncoding() and
>>>>> XmlPullParser tries to read it with the platform default encoding.
>>>>>
>>>>> Is this a bug and how we can solve it ?
>>>>>
>>>>> --
>>>>> Martin Grigorov
>>>>> jWeekend
>>>>> Training, Consulting, Development
>>>>> http://jWeekend.com
>>
>>
>>
>> --
>> Martin Grigorov
>> jWeekend
>> Training, Consulting, Development
>> http://jWeekend.com



-- 
Martin Grigorov
jWeekend
Training, Consulting, Development
http://jWeekend.com

Re: XmlPullParser parses strings with platform's default charset

Reply via email to