Re: [jira] Reopened: (UIMA-387) XMI Serializer can write invalid control characters

Thilo Goetz Mon, 04 Jun 2007 07:05:04 -0700

Adam Lally (JIRA) wrote:
>      [ 
> https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Adam Lally reopened UIMA-387:
> -----------------------------
> 
> 
> I don't agree with this fix.  By putting the check in the XmiCasSerializer, 
> we're preventing anyone from using XML 1.1, which I don't think is good.  I'd 
> rather put the check in the org.apache.uima.util.XMLSerializer class, and 
> make it conditional on the output XML version being set to 1.0.  (Also fixing 
> "closer" to the actual XML generation code makes sense to me since I consider 
> this to be a bug in Xalan that we are working around.)


Sure, whatever.

> 
> Also I think surrogates are not implemented properly.  I saw your comment 
> that says "So it actually looks as if the surrogate case can be handled 
> correctly by just looking at individual Java chars," but I don't think that 
> is correct.

You're right, I mentally switched to exclude mode when going to the 
#x10000-#x10FFFF range.  So I'll fix
this and then assign the issue to you so you can move the check to wherever you 
can agree with.

> 
> The XML spec disallows characters such as &#xD800 (a high surrogate).  But 
> that doesn't mean you can exclude any Java char that is 0xD800.  If it is 
> followed by a low surrogate, then the XML serializer will convert this to a 
> valid XML character in the &#x10000 - &#x10FFF range.  At least, the current 
> version of Xalan does.  Depending on which version bundled with whatever JVM 
> someone's using, it might have a bug and not do it right.  But as implemented 
> know the XMISerializer doesn't allow surrogate pairs at all.
> 
> Finally, in addition to having the serializer catch invalid characters, I 
> think we still need to do more to address the issue, perhaps making the 
> XmiWriter cas consumer capable of using XML 1.1, maybe even by default.
> 
>> XMI Serializer can write invalid control characters
>> ---------------------------------------------------
>>
>>                 Key: UIMA-387
>>                 URL: https://issues.apache.org/jira/browse/UIMA-387
>>             Project: UIMA
>>          Issue Type: Bug
>>          Components: Core Java Framework
>>    Affects Versions: 2.1
>>            Reporter: Adam Lally
>>            Assignee: Adam Lally
>>             Fix For: 2.2
>>
>>
>> On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote:
>>> Hello,
>>>
>>> While trying to open an xmi file after processing in xml view, an
>>> error pops up telling me that there is an invalid &#26 xml character.
>>> the error comes from the sax parser. Below is the stack trace. Thanks
>>> very much for your help,
>>>
>> Most control characters are not allowed in XML 1.0, even if they are
>> escaped with &#xxx.  If your input document contains such characters,
>> the XMI CAS serializer is writing them to the output XMI document,
>> making it unreadable.
>> I checked that if you edit the XMI document and change the first line to:
>> <?xml version="1.1" encoding="UTF-8"?>
>> The problem goes away, because XML version 1.1 does allow escaped
>> control characters.
>> So one possibility for us to fix this in UIMA is to have the XMI CAS
>> Serializer generate XML version 1.1 tag by default.  (I think we
>> considered that before and decided not to for some reason, maybe we
>> were worried that other applications might not be able to consume XML
>> 1.1?  I can't remember. :)
>> Another possibility would be to have the XMI serializer automatically
>> replace these characters with spaces.  The XCAS (not XMI) serializer
>> does that, but only for the document text, not for feature values.  We
>> could also serialize the XMI using XML version 1.1, which allows
>> escaped control characters (but still not the 0x00 character).
>

Re: [jira] Reopened: (UIMA-387) XMI Serializer can write invalid control characters

Reply via email to