Hi, I investigated the XMI issue as promised and these are my findings. It is related to special unicode characters that are not handled by XMI serialisation, and there seems to be two distinct categories of issues we have identified so far. 1) The document text of the CAS contains special unicode characters 2) Annotations with String features have values containing special unicode characters In both cases we could for sure solve the problem if we did a better clean up job upstream, but with the amount and variety of data we receive there is always a chance something passes through, and some of it may in the general case even be valid content. The first case is easy to reproduce with the OddDocumentText example I attached. In this example the text is a snippet taken from the content of a parsed XML document. The other case was not possible to reproduce with the OddFeatureText example, because I am getting slightly different output to what I have in our real setup. The OddFeatureText example is based on the simple type system I shared previously. The name value of a FeatureRecord contains special unicode characters that I found in a similar data structure in our actual CAS. The value comes from an external knowledge base holding some noisy strings, which in this case is a hieroglyph entity. However, when I write the CAS to XMI using the small example it only outputs the first of the two characters in "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but in our actual setup both character values are written as "𓂣�”. This means that the attached example will for some reason parse the XMI again, but it will not work in the case where both characters are written the way we experience it. The XMI can be manually changed, so that both character values are included the way it happens in our output, and in this case a SAXParserException happens. I don’t know whether it is outside the scope of the XMI serialiser to handle any of this, but it will be good to know in any case :) Cheers, Mario |
OddDocumentText.java
Description: Binary data
OddFeatureText.java
Description: Binary data
|
