[ 
https://issues.apache.org/jira/browse/UIMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493672
 ] 

Adam Lally commented on UIMA-387:
---------------------------------

>We should make sure we don't generate XMI that we can't read back in. It's 
>better to throw an exception or replace characters or whatever 
>when generating the XMI than making the user believe everything's fine and 
>later they can't get at their results. 

I agree that would be better.  It is too bad the XML Serialization code we're 
using (the built-in Xalan XSLT stuff from the JRE) doesn't already have this 
behavior.  I hate to have to implement this by scanning every string for bad 
characters prior to serialization, that just seems so wasteful.

Does that leave us with reimplementing the serialization ourselves?

> XMI Serializer can write invalid control characters
> ---------------------------------------------------
>
>                 Key: UIMA-387
>                 URL: https://issues.apache.org/jira/browse/UIMA-387
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>    Affects Versions: 2.1
>            Reporter: Adam Lally
>             Fix For: 2.2
>
>
> On 5/1/07, Leo Ferres <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > While trying to open an xmi file after processing in xml view, an
> > error pops up telling me that there is an invalid &#26 xml character.
> > the error comes from the sax parser. Below is the stack trace. Thanks
> > very much for your help,
> >
> Most control characters are not allowed in XML 1.0, even if they are
> escaped with &#xxx.  If your input document contains such characters,
> the XMI CAS serializer is writing them to the output XMI document,
> making it unreadable.
> I checked that if you edit the XMI document and change the first line to:
> <?xml version="1.1" encoding="UTF-8"?>
> The problem goes away, because XML version 1.1 does allow escaped
> control characters.
> So one possibility for us to fix this in UIMA is to have the XMI CAS
> Serializer generate XML version 1.1 tag by default.  (I think we
> considered that before and decided not to for some reason, maybe we
> were worried that other applications might not be able to consume XML
> 1.1?  I can't remember. :)
> Another possibility would be to have the XMI serializer automatically
> replace these characters with spaces.  The XCAS (not XMI) serializer
> does that, but only for the document text, not for feature values.  We
> could also serialize the XMI using XML version 1.1, which allows
> escaped control characters (but still not the 0x00 character).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to