[ 
https://issues.apache.org/jira/browse/UIMA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014974#comment-17014974
 ] 

Mario Juric commented on UIMA-6128:
-----------------------------------

Excluding xalanj is a fix for many of those issues we encountered, but the XML 
1.0 character limitations can still be a problem, since we have seen evidence 
of such outliers occurring in our input. Most of the time proper cleaning 
upfront can fix it when that is possible, but there are also edge cases where 
this can be problematic for various reasons, one of them being cost 
considerations in fixing and reprocessing larger corpora. Therefore serializing 
XMI using XML 1.1 is still relevant as an additional aid in mitigating such 
issues. We are not using XCAS ourselves, but if the community finds this format 
still relevant to support then I think we should include it as well.

> Allow XMI to be optionally serialized with XML 1.1 instead of only 1.0
> ----------------------------------------------------------------------
>
>                 Key: UIMA-6128
>                 URL: https://issues.apache.org/jira/browse/UIMA-6128
>             Project: UIMA
>          Issue Type: New Feature
>          Components: UIMA
>            Reporter: Mario Juric
>            Priority: Major
>         Attachments: OddFeatureText.java, SimpleTypeSystem_TS.xml
>
>
> Some unicode characters are not handled by XML 1.0 and it can require some 
> normalization or cleanup to be able to serialize the CAS to XMI, but 
> requirements may not necessarily allow all such characters to be fully 
> removed from the CAS. It can also be impossible to do such 
> normalization/cleanup without full reprocess when converting data already 
> stored as compressed binaries to XMI. Being able to optionally select XML 1.1 
> instead of the default XML 1.0 would be an easier way for some to bypass many 
> of those unicode issues.
> See also discussion on the UIMA mailing list:
> https://lists.apache.org/thread.html/7f8124b7be9ea20ab21dc616243e5661a0b7668a856532031fda71e3@%3Cuser.uima.apache.org%3E
> This feature request suggests that an additional SerialFormat is introduced, 
> e.g. XMI_1_1, which can be selected as format parameter in the 
> CasIOUtils.save methods.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to