Hi,

I investigated the XMI issue as promised and these are my findings.

It is related to special unicode characters that are not handled by XMI serialisation, and there seems to be two distinct categories of issues we have identified so far.

1) The document text of the CAS contains special unicode characters
2) Annotations with String features have values containing special unicode characters

In both cases we could for sure solve the problem if we did a better clean up job upstream, but with the amount and variety of data we receive there is always a chance something passes through, and some of it may in the general case even be valid content.

The first case is easy to reproduce with the OddDocumentText example I attached. In this example the text is a snippet taken from the content of a parsed XML document.

The other case was not possible to reproduce with the OddFeatureText example, because I am getting slightly different output to what I have in our real setup. The OddFeatureText example is based on the simple type system I shared previously. The name value of a FeatureRecord contains special unicode characters that I found in a similar data structure in our actual CAS. The value comes from an external knowledge base holding some noisy strings, which in this case is a hieroglyph entity. However, when I write the CAS to XMI using the small example it only outputs the first of the two characters in "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but in our actual setup both character values are written as "𓂣�”. This means that the attached example will for some reason parse the XMI again, but it will not work in the case where both characters are written the way we experience it. The XMI can be manually changed, so that both character values are included the way it happens in our output, and in this case a SAXParserException happens.

I don’t know whether it is outside the scope of the XMI serialiser to handle any of this, but it will be good to know in any case :)

Cheers,
Mario







Attachment: OddDocumentText.java
Description: Binary data

Attachment: OddFeatureText.java
Description: Binary data







On 17 Sep 2019, at 09:36 , Mario Juric <[email protected]> wrote:

Thank you very much for looking into this. It is really appreciated and I think it touches upon something important, which is about data migration in general.

I agree that some of these solutions can appear specific, awkward or complex and the way forward is not to address our use case alone. I think there is a need for a compact and efficient binary serialization format for the CAS when dealing with large amounts of data because this is directly visible in costs of processing and storing, and I found the compressed binary format to be much better than XMI in this regard, although I have to admit it’s been a while since I benchmarked this. Given that UIMA already has a well described type system then maybe it just lacks a way to describe schema evolution similar to Apache Avro or similar serialisation frameworks. I think a more formal approach to data migration would be critical to any larger operational setup.

Regarding XMI I like to provide some input to the problem we are observing, so that it can be solved. We are primarily using XMI for inspection/debugging purposes, and we are sometimes not able to do this because of this error. I will try to extract a minimum example to avoid involving parts that has to do with our pipeline and type system, and I think this would also be the best way to illustrate that the problem exists outside of this context. However, converting all our data to XMI first in order to do the conversion in our example would not be very practical for us, because it involves a large amount of data.

Cheers,
Mario













On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected]> wrote:

In this case, the original looks kind-of like this:

Container
   features -> FSArray of FeatureAnnotation each of which
                             has 5 slots: sofaRef, begin, end, name, value

the new TypeSystem has

Container
   features -> FSArray of FeatureRecord each of which
                              has 2 slots: name, value

The deserializer code would need some way to decide how to
   1) create an FSArray of FeatureRecord,
   2) for each element,
      map the FeatureAnnotation to a new instance of FeatureRecord

I guess I could imagine a default mapping (for item 2 above) of
  1) change the type from A to B
  2) set equal-named features from A to B, drop other features

This mapping would need to apply to a subset of the A's and B's, namely, only
those referenced by the FSArray where the element type changed.  Seems complex
and specific to this use case though.

-Marshall


On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
On 16. Sep 2019, at 19:05, Marshall Schor <[email protected]> wrote:
I can reproduce the problem, and see what is happening.  The deserialization
code compares the two type systems, and allows for some mismatches (things
present in one and not in the other), but it doesn't allow for having a feature
whose range (value) is type XXXX in one type system and type YYYY in the other.
See CasTypeSystemMapper lines 299 - 315.
Without reading the code in detail - could we not relax this check such that the element type of FSArrays is not checked and the code simply assumes that the source element type has the same features as the target element type (with the usual lenient handling of missing features in the target type)? - Kind of a "duck typing" approach?

Cheers,

-- Richard


Reply via email to