Re: Migrating type system of form 6 compressed CAS binaries

Marshall Schor Fri, 20 Sep 2019 11:53:17 -0700

In the test "OddDocumentText", this produces a "throw" due to an invalid xml
char, which is the \u0002.


This is in part because the xml version being used is xml 1.0.

XML 1.1 expanded the set of valid characters to include \u0002.

Here's a snip from the XmiCasSerializerTest class which serializes with xml 1.1:

        XmiCasSerializer xmiCasSerializer = new
XmiCasSerializer(jCas.getTypeSystem());
        OutputStream out = new FileOutputStream(new File 
("odd-doc-txt-v11.xmi"));
        try {
          XMLSerializer xml11Serializer = new XMLSerializer(out);
          xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
          xmiCasSerializer.serialize(jCas.getCas(),
xml11Serializer.getContentHandler());
        }
        finally {
          out.close();
        }

This succeeds and serializes this using xml 1.1.

I also tried serializing some doc text which includes \u77987.  That did not
serialize correctly.
I could see it in the code while tracing up to some point down in the innards of
some internal
sax java code
com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it was
"Correct" in the Java string.

When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.

This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
encoding:
        1110 xxxx 10xx xxxx 10xx xxxx

of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.

But I think it's out of our hands - it's somewhere deep in the sax transform
java code.

I looked for a bug report and found some
https://bugs.openjdk.java.net/browse/JDK-8058175

Bottom line, is, I think to clean out these characters early :-) .

-Marshall


On 9/20/2019 1:28 PM, Marshall Schor wrote:
> here's an idea.
>
> If you have a string, with the surrogate pair &#77987 at position 10, and you
> have some Java code, which is iterating through the string and getting the
> code-point at each character offset, then that code will produce:
>
> at position 10:  the code-point 77987
> at position 11:  the code-point 56483
>
> Of course, it's a "bug" to iterate through a string of characters, assuming 
> you
> have characters at each point, if you don't handle surrogate pairs.
>
> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
> https://tools.ietf.org/html/rfc2781 )
>
> I worry that even tools like the CVD or similar may not work properly, since
> they're not designed to handle surrogate pairs, I think, so I have no idea if
> they would work well enough for you.
>
> I'll poke around some more to see if I can enable the conversion for document
> strings.
>
> -Marshall
>
> On 9/20/2019 11:09 AM, Mario Juric wrote:
>> Thanks Marshall,
>>
>> Encoding the characters like you suggest should work just fine for us as 
>> long as we can serialize and deserialise the XMI, so that we can open the 
>> content in a tool like the CVD or similar. These characters are just noise 
>> from the original content that happen to remain in the CAS, but they are not 
>> visible in our final output because they are basically filtered out one way 
>> or the other by downstream components. They become a problem though when 
>> they make it more difficult for us to inspect the content.
>>
>> Regarding the feature name issue: Might you have an idea why we are getting 
>> a different XMI output for the same character in our actual pipeline, where 
>> it results in "&#77987;&#56483;”? I investigated the value in the debugger 
>> again, and like you are illustrating it is also just a single codepoint with 
>> the value 77987. We are simply not able to load this XMI because of this, 
>> but unfortunately I couldn’t reproduce it in my small example.
>>
>> Cheers,
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>
>>> The odd-feature-text seems to work OK, but has some unusual properties, due 
>>> to
>>> that unicode character.
>>>
>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>
>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>> xmi:id="18"
>>> name="&#77987;" value="1.0"/>
>>> which seems correct.  The name field only has 1 (extended)unicode character
>>> (taking 2 Java characters to represent),
>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>
>>> When read in, the name field is assigned to a String, that string says it 
>>> has a
>>> length of 2 (but that's because it takes 2 java chars to represent this 
>>> char).
>>> If you have the name string in a variable "n", and do
>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>
>>> So, the string value serialization and deserialization seems to be 
>>> "working".
>>>
>>> The other code - for the sofa (document) serialization, is throwing that 
>>> error,
>>> because as currently designed, the
>>> serialization code checks for these kinds of characters, and if found throws
>>> that exception.  The code checking is
>>> in XMLUtils.checkForNonXmlCharacters
>>>
>>> This is because it's highly likely that "fixing this" in the same way as the
>>> other, would result in hard-to-diagnose
>>> future errors, because the subject of analysis string is processed with 
>>> begin /
>>> end offset all over the place, and makes
>>> the assumption that the characters are all not coded as surrogate pairs.
>>>
>>> We could change the code to output these like the name, as, e.g.,  &#77987; 
>>>
>>> Would that help in your case, or do you imagine other kinds of things might
>>> break (due to begin/end offsets no longer
>>> being on character boundaries, for example).
>>>
>>> -Marshall
>>>
>>>
>>>
>>>
>>>
>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>> Hi,
>>>>
>>>> I investigated the XMI issue as promised and these are my findings.
>>>>
>>>> It is related to special unicode characters that are not handled by XMI
>>>> serialisation, and there seems to be two distinct categories of issues we 
>>>> have
>>>> identified so far.
>>>>
>>>> 1) The document text of the CAS contains special unicode characters
>>>> 2) Annotations with String features have values containing special unicode
>>>> characters
>>>>
>>>> In both cases we could for sure solve the problem if we did a better clean 
>>>> up
>>>> job upstream, but with the amount and variety of data we receive there is
>>>> always a chance something passes through, and some of it may in the general
>>>> case even be valid content.
>>>>
>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>> attached. In this example the text is a snippet taken from the content of a
>>>> parsed XML document.
>>>>
>>>> The other case was not possible to reproduce with the OddFeatureText 
>>>> example,
>>>> because I am getting slightly different output to what I have in our real
>>>> setup. The OddFeatureText example is based on the simple type system I 
>>>> shared
>>>> previously. The name value of a FeatureRecord contains special unicode
>>>> characters that I found in a similar data structure in our actual CAS. The
>>>> value comes from an external knowledge base holding some noisy strings, 
>>>> which
>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>> using the small example it only outputs the first of the two characters in
>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>> means that the attached example will for some reason parse the XMI again, 
>>>> but
>>>> it will not work in the case where both characters are written the way we
>>>> experience it. The XMI can be manually changed, so that both character 
>>>> values
>>>> are included the way it happens in our output, and in this case a
>>>> SAXParserException happens.
>>>>
>>>> I don’t know whether it is outside the scope of the XMI serialiser to 
>>>> handle
>>>> any of this, but it will be good to know in any case :)
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] 
>>>>> <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>
>>>>> wrote:
>>>>>
>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>> think it touches upon something important, which is about data migration 
>>>>> in
>>>>> general.
>>>>>
>>>>> I agree that some of these solutions can appear specific, awkward or 
>>>>> complex
>>>>> and the way forward is not to address our use case alone. I think there 
>>>>> is a
>>>>> need for a compact and efficient binary serialization format for the CAS 
>>>>> when
>>>>> dealing with large amounts of data because this is directly visible in 
>>>>> costs
>>>>> of processing and storing, and I found the compressed binary format to be
>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>> described
>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>> similar to Apache Avro or similar serialisation frameworks. I think a more
>>>>> formal approach to data migration would be critical to any larger 
>>>>> operational
>>>>> setup.
>>>>>
>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>> observing,
>>>>> so that it can be solved. We are primarily using XMI for 
>>>>> inspection/debugging
>>>>> purposes, and we are sometimes not able to do this because of this error. 
>>>>> I
>>>>> will try to extract a minimum example to avoid involving parts that has 
>>>>> to do
>>>>> with our pipeline and type system, and I think this would also be the best
>>>>> way to illustrate that the problem exists outside of this context. 
>>>>> However,
>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>> example would not be very practical for us, because it involves a large
>>>>> amount of data.
>>>>>
>>>>> Cheers,
>>>>> Mario
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>
>>>>>> In this case, the original looks kind-of like this:
>>>>>>
>>>>>> Container
>>>>>>    features -> FSArray of FeatureAnnotation each of which
>>>>>>                              has 5 slots: sofaRef, begin, end, name, 
>>>>>> value
>>>>>>
>>>>>> the new TypeSystem has
>>>>>>
>>>>>> Container
>>>>>>    features -> FSArray of FeatureRecord each of which
>>>>>>                               has 2 slots: name, value
>>>>>>
>>>>>> The deserializer code would need some way to decide how to
>>>>>>    1) create an FSArray of FeatureRecord,
>>>>>>    2) for each element,
>>>>>>       map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>
>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>   1) change the type from A to B
>>>>>>   2) set equal-named features from A to B, drop other features
>>>>>>
>>>>>> This mapping would need to apply to a subset of the A's and B's, namely, 
>>>>>> only
>>>>>> those referenced by the FSArray where the element type changed.  Seems 
>>>>>> complex
>>>>>> and specific to this use case though.
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>>
>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] 
>>>>>>> <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>> deserialization
>>>>>>>> code compares the two type systems, and allows for some mismatches 
>>>>>>>> (things
>>>>>>>> present in one and not in the other), but it doesn't allow for having a
>>>>>>>> feature
>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in 
>>>>>>>> the
>>>>>>>> other.
>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>> assumes that the source element type has the same features as the target
>>>>>>> element type (with the usual lenient handling of missing features in the
>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> -- Richard

Re: Migrating type system of form 6 compressed CAS binaries

Reply via email to