[ https://issues.apache.org/jira/browse/UIMA-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010504#comment-17010504 ]
Rune Stilling commented on UIMA-6128: ------------------------------------- We have found the root of the problem. When serializing a CAS containing some characters, that in UTF-16 are encoded via surrogates, the result is invalid XML-characters in the serialized UTF-8 encoded document making it unparsable. The problem is coming from the Xalan serialization libraries that may be used in UIMA via the edu.stanford.nlp:stanford-corenlp:3.9.1 and 3.9.2 dependencies (dependent on xalan:xalan:2.7.0). The bug is described here (and has never been fixed in an official release): https://issues.apache.org/jira/browse/XALANJ-2617 We found the solution to be quite straight forward. We simply excluded the Xalan (and Xerces dependencies) so that the code uses the default Java implementation instead (org.xml.sax.ContentHandler::startElement()) We have attached two files, that may be used to reproduce the issue. If Xalan is included, the test code will throw an exception when loading the generated CASA XMI. > Allow XMI to be optionally serialized with XML 1.1 instead of only 1.0 > ---------------------------------------------------------------------- > > Key: UIMA-6128 > URL: https://issues.apache.org/jira/browse/UIMA-6128 > Project: UIMA > Issue Type: New Feature > Components: UIMA > Reporter: Mario Juric > Priority: Major > Attachments: OddFeatureText.java, SimpleTypeSystem_TS.xml > > > Some unicode characters are not handled by XML 1.0 and it can require some > normalization or cleanup to be able to serialize the CAS to XMI, but > requirements may not necessarily allow all such characters to be fully > removed from the CAS. It can also be impossible to do such > normalization/cleanup without full reprocess when converting data already > stored as compressed binaries to XMI. Being able to optionally select XML 1.1 > instead of the default XML 1.0 would be an easier way for some to bypass many > of those unicode issues. > See also discussion on the UIMA mailing list: > https://lists.apache.org/thread.html/7f8124b7be9ea20ab21dc616243e5661a0b7668a856532031fda71e3@%3Cuser.uima.apache.org%3E > This feature request suggests that an additional SerialFormat is introduced, > e.g. XMI_1_1, which can be selected as format parameter in the > CasIOUtils.save methods. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)