[jira] Reopened: (UIMA-387) XMI Serializer can write invalid control characters

Adam Lally (JIRA) Mon, 04 Jun 2007 06:25:57 -0700

     [ 
https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Lally reopened UIMA-387:
-----------------------------

I don't agree with this fix.  By putting the check in the XmiCasSerializer, 
we're preventing anyone from using XML 1.1, which I don't think is good.  I'd 
rather put the check in the org.apache.uima.util.XMLSerializer class, and make 
it conditional on the output XML version being set to 1.0.  (Also fixing 
"closer" to the actual XML generation code makes sense to me since I consider 
this to be a bug in Xalan that we are working around.)

Also I think surrogates are not implemented properly.  I saw your comment that 
says "So it actually looks as if the surrogate case can be handled correctly by 
just looking at individual Java chars," but I don't think that is correct.

The XML spec disallows characters such as &#xD800 (a high surrogate).  But that 
doesn't mean you can exclude any Java char that is 0xD800.  If it is followed 
by a low surrogate, then the XML serializer will convert this to a valid XML 
character in the &#x10000 - &#x10FFF range.  At least, the current version of 
Xalan does.  Depending on which version bundled with whatever JVM someone's 
using, it might have a bug and not do it right.  But as implemented know the 
XMISerializer doesn't allow surrogate pairs at all.

Finally, in addition to having the serializer catch invalid characters, I think 
we still need to do more to address the issue, perhaps making the XmiWriter cas 
consumer capable of using XML 1.1, maybe even by default.

> XMI Serializer can write invalid control characters
> ---------------------------------------------------
>
>                 Key: UIMA-387
>                 URL: https://issues.apache.org/jira/browse/UIMA-387
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>    Affects Versions: 2.1
>            Reporter: Adam Lally
>            Assignee: Adam Lally
>             Fix For: 2.2
>
>
> On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > While trying to open an xmi file after processing in xml view, an
> > error pops up telling me that there is an invalid &#26 xml character.
> > the error comes from the sax parser. Below is the stack trace. Thanks
> > very much for your help,
> >
> Most control characters are not allowed in XML 1.0, even if they are
> escaped with &#xxx.  If your input document contains such characters,
> the XMI CAS serializer is writing them to the output XMI document,
> making it unreadable.
> I checked that if you edit the XMI document and change the first line to:
> <?xml version="1.1" encoding="UTF-8"?>
> The problem goes away, because XML version 1.1 does allow escaped
> control characters.
> So one possibility for us to fix this in UIMA is to have the XMI CAS
> Serializer generate XML version 1.1 tag by default.  (I think we
> considered that before and decided not to for some reason, maybe we
> were worried that other applications might not be able to consume XML
> 1.1?  I can't remember. :)
> Another possibility would be to have the XMI serializer automatically
> replace these characters with spaces.  The XCAS (not XMI) serializer
> does that, but only for the document text, not for feature values.  We
> could also serialize the XMI using XML version 1.1, which allows
> escaped control characters (but still not the 0x00 character).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (UIMA-387) XMI Serializer can write invalid control characters

Reply via email to