Re: XmlPullParser parses strings with platform's default charset

Juergen Donnerstag Thu, 07 Jun 2012 01:13:12 -0700

yes, except that the xml prolog should not be ignored, but if not
present, the default provided (HERE) will be used instead of the
JVM/OS default.


And I'm not sure whether
IRequestCycleSettings#getResponseRequestEncoding() is the right
default or UTF-8. Every String in Java is UTF. So if you do
  private final String tag = "<div>äöü</div>"
  parse(tag)
 the encoding will be UTF.
Hence everywhere within wicket every String is UTF, Internally we
don't care about anything else, except when  data are transferred over
the wire back to the user (response encoding) or files are
read/written.

The reason why I used the UTF hexcode once was because the source file
has no XML prolog. Hence if the source file gets copied to a server
elsewhere (server with a different charset), maven will not properly
build and test it, because the compiler assumes a certain encoding of
the source code file. And äöü are not save in source code files but
\uXXXX is.

Juergen

On Thu, Jun 7, 2012 at 9:44 AM, Martin Grigorov <[email protected]> wrote:
> If I understand you correctly you suggest to use
> IRequestCycleSettings#getResponseRequestEncoding() for:
> - String.getBytes(HERE)
> - new InputStreamReader(stream, HERE) - this is in XmlReader
> - (the XML prolog maybe be ignored totally)
>
> I think this should work.
>
> On Thu, Jun 7, 2012 at 10:14 AM, Juergen Donnerstag
> <[email protected]> wrote:
>> And there is no stable solution, except we create an artificial one
>> (via XML prolog and encoding parameter), since each OS and each
>> country charset default is different. We could ease creating that
>> though for testing purposes. E.g. allowing for a test specific
>> default, different than the OS default, for the XML prolog and the
>> encoding parameters.
>>
>> Juergen
>>
>> On Wed, Jun 6, 2012 at 12:32 PM, Martin Grigorov <[email protected]> 
>> wrote:
>>> Hi Juergen,
>>>
>>> Thanks for the explanation!
>>>
>>> I've tried all combinations of the following variables:
>>> - -Dfile.encoding=latin1
>>> - with and without <?xml encoding="utf-8"?> in the String to parse
>>> - parse(new ByteArrayInputStream(string.toString().getBytes("UTF-8")), 
>>> null);
>>> - parse(new ByteArrayInputStream(string.toString().getBytes()), "UTF-8");
>>> - parse(new ByteArrayInputStream(string.toString().getBytes("UTF-8")), 
>>> "UTF-8");
>>>
>>> and the test passes only when the String has the prolog with the
>>> encoding and "parse(new
>>> ByteArrayInputStream(string.toString().getBytes("UTF-8")), "UTF-8");"
>>> is used
>>> any other combination produces mangled characters and the assertion fails
>>>
>>> So I cannot find a stable solution that will work on any environment.
>>> We can use IRequestCycleSettings#getResponseRequestEncoding() for the
>>> charset but if there is no XML prolog or it has no encoding attr then
>>> the test fails.
>>>
>>> On Tue, Jun 5, 2012 at 11:53 PM, Juergen Donnerstag
>>> <[email protected]> wrote:
>>>> Hi Martin,
>>>>
>>>> XmlReader reads the markup file, interprets <?xml encoding ..> if
>>>> present, and converts the markup into a String, which in Java is
>>>> always UTF encoded. XmlPullParser uses the data provided by XmlReader.
>>>>
>>>> To support unit testing XPP provide a parse(String) method which
>>>> encapsulates the string into a inputstream, in order not to circumvent
>>>> XmlReader for testing.
>>>>
>>>> No xml decl (or no encoding) results in XmlReader using the JVM
>>>> default, which if the OS default not provided via -Dfile.encoding=
>>>>
>>>> And since you never know on which OS in which country devs a building
>>>> or testing, providing the UTF encoded value is the save way of doing
>>>> it.
>>>>
>>>> We may replace parse(string) with parse(string, "encoding") which
>>>> seems to be supported by all underlying methods, but are preset with
>>>> null (JVM default) right now. That may help you solve your problem,
>>>> and make other devs aware that the encoding might need change.
>>>>
>>>> make sense?
>>>>
>>>> Juergen
>>>>
>>>> On Tue, Jun 5, 2012 at 9:54 AM, Juergen Donnerstag
>>>> <[email protected]> wrote:
>>>>> I'll have a look later today.
>>>>>
>>>>> Juergen
>>>>>
>>>>> On Mon, Jun 4, 2012 at 3:37 PM, Martin Grigorov
>>>>> <[email protected]> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm not quite sure but I think there is a bug in
>>>>>> org.apache.wicket.markup.parser.XmlPullParser#parse(CharSequence)
>>>>>> because it uses
>>>>>> string.toString().getBytes() to create a ByteArrayInputStream.
>>>>>>
>>>>>> org.apache.wicket.util.tester.BaseWicketTester#getTagById(String) uses
>>>>>> lastResponseAsString to feed XmlPullParser but lastResponseAsString's
>>>>>> encoding depends on
>>>>>> org.apache.wicket.settings.IRequestCycleSettings#getResponseRequestEncoding().
>>>>>> I.e. the string may be encoded in UTF-8 but later XmlPullParser will
>>>>>> try to process its bytes as Windows-1252 for example.
>>>>>>
>>>>>>
>>>>>> Here is a small patch that exposes the problem:
>>>>>> diff --git 
>>>>>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>>>> b/wicket-core/src/test/java/org/apache/wicket/markup/p
>>>>>> index 2e26d05..15fb496 100644
>>>>>> --- 
>>>>>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>>>> +++ 
>>>>>> b/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>>>>> @@ -191,6 +191,13 @@ public class XmlPullParserTest extends Assert
>>>>>>                assertNull(parser.getEncoding());
>>>>>>                tag = parser.nextTag();
>>>>>>                assertNull(tag);
>>>>>> +
>>>>>> +               String expected = "äöü€";
>>>>>> +               parser.parse("<dummy>"+expected+"</dummy>");
>>>>>> +               XmlTag openTag = parser.nextTag();
>>>>>> +               XmlTag closeTag = parser.nextTag();
>>>>>> +               String actual = parser.getInput(openTag.getPos() +
>>>>>> openTag.getLength(), closeTag.getPos()).toString();
>>>>>> +               assertEquals(expected, actual);
>>>>>>        }
>>>>>>
>>>>>>        /**
>>>>>>
>>>>>> Apply this patch and run the test with -Dfile.encoding=latin1. It will
>>>>>> fail in the comparison. Run it with UTF-8 and it will pass.
>>>>>>
>>>>>> I remember Juergen had similar problem with one of Wicket core tests
>>>>>> that uses the Euro sign in an assertion and he fixed it by using
>>>>>> unicode escaped value (\uabcd).
>>>>>> But in this case the response is encoded with whatever is configured
>>>>>> at IRequestCycleSettings#getResponseRequestEncoding() and
>>>>>> XmlPullParser tries to read it with the platform default encoding.
>>>>>>
>>>>>> Is this a bug and how we can solve it ?
>>>>>>
>>>>>> --
>>>>>> Martin Grigorov
>>>>>> jWeekend
>>>>>> Training, Consulting, Development
>>>>>> http://jWeekend.com
>>>
>>>
>>>
>>> --
>>> Martin Grigorov
>>> jWeekend
>>> Training, Consulting, Development
>>> http://jWeekend.com
>
>
>
> --
> Martin Grigorov
> jWeekend
> Training, Consulting, Development
> http://jWeekend.com

Re: XmlPullParser parses strings with platform's default charset

Reply via email to