Re: Migrating type system of form 6 compressed CAS binaries

Marshall Schor Mon, 23 Sep 2019 07:33:36 -0700

Re: serializing using XML 1.1

This was not thought of, when setting up the CasIOUtils.


The way it was done (above) was using some more "primitive/lower level" APIs,
rather than the CasIOUtils.

Please open a Jira ticket for this, with perhaps some suggestions on how it
might be specified in the CasIOUtils APIs.

Thanks! -Marshall

On 9/23/2019 3:45 AM, Mario Juric wrote:
> Hi Marshall,
>
> Thanks for the thorough and excellent investigation.
>
> We are looking into possible normalisation/cleanup of whitespace/invisible 
> characters, but I don’t think we can necessarily do the same for some of the 
> other characters. It sounds to me though that serialising to XML 1.1 could 
> also be a simple fix right now, but can this be configured? CasIOUtils 
> doesn’t seem to have an option for this, so I assume it’s something you have 
> working in your branch.
>
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
> and after. Do you think switching to a more recent Java version would make a 
> difference? I think we can also try this out ourselves when we look into 
> migrating to UIMA 3 once our current deliveries are complete. We also like to 
> switch to Java 11, and like UIMA 3 migration it will require some thorough 
> testing.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote:
>>
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>>
>> This is in part because the xml version being used is xml 1.0.
>>
>> XML 1.1 expanded the set of valid characters to include \u0002.
>>
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>> 1.1:
>>
>>         XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>>         OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>>         try {
>>           XMLSerializer xml11Serializer = new XMLSerializer(out);
>>           xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>           xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>>         }
>>         finally {
>>           out.close();
>>         }
>>
>> This succeeds and serializes this using xml 1.1.
>>
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
>> was
>> "Correct" in the Java string.
>>
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>> encoding:
>>         1110 xxxx 10xx xxxx 10xx xxxx
>>
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>>
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>
>> Bottom line, is, I think to clean out these characters early :-) .
>>
>> -Marshall
>>
>>
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>>
>>> If you have a string, with the surrogate pair &#77987 at position 10, and 
>>> you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>>
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>>
>>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>>> you
>>> have characters at each point, if you don't handle surrogate pairs.
>>>
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>>
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea 
>>> if
>>> they would work well enough for you.
>>>
>>> I'll poke around some more to see if I can enable the conversion for 
>>> document
>>> strings.
>>>
>>> -Marshall
>>>
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>> Thanks Marshall,
>>>>
>>>> Encoding the characters like you suggest should work just fine for us as 
>>>> long as we can serialize and deserialise the XMI, so that we can open the 
>>>> content in a tool like the CVD or similar. These characters are just noise 
>>>> from the original content that happen to remain in the CAS, but they are 
>>>> not visible in our final output because they are basically filtered out 
>>>> one way or the other by downstream components. They become a problem 
>>>> though when they make it more difficult for us to inspect the content.
>>>>
>>>> Regarding the feature name issue: Might you have an idea why we are 
>>>> getting a different XMI output for the same character in our actual 
>>>> pipeline, where it results in "&#77987;&#56483;”? I investigated the value 
>>>> in the debugger again, and like you are illustrating it is also just a 
>>>> single codepoint with the value 77987. We are simply not able to load this 
>>>> XMI because of this, but unfortunately I couldn’t reproduce it in my small 
>>>> example.
>>>>
>>>> Cheers,
>>>> Mario
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>>>
>>>>> The odd-feature-text seems to work OK, but has some unusual properties, 
>>>>> due to
>>>>> that unicode character.
>>>>>
>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>
>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>>>> xmi:id="18"
>>>>> name="&#77987;" value="1.0"/>
>>>>> which seems correct.  The name field only has 1 (extended)unicode 
>>>>> character
>>>>> (taking 2 Java characters to represent),
>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>
>>>>> When read in, the name field is assigned to a String, that string says it 
>>>>> has a
>>>>> length of 2 (but that's because it takes 2 java chars to represent this 
>>>>> char).
>>>>> If you have the name string in a variable "n", and do
>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>
>>>>> So, the string value serialization and deserialization seems to be 
>>>>> "working".
>>>>>
>>>>> The other code - for the sofa (document) serialization, is throwing that 
>>>>> error,
>>>>> because as currently designed, the
>>>>> serialization code checks for these kinds of characters, and if found 
>>>>> throws
>>>>> that exception.  The code checking is
>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>
>>>>> This is because it's highly likely that "fixing this" in the same way as 
>>>>> the
>>>>> other, would result in hard-to-diagnose
>>>>> future errors, because the subject of analysis string is processed with 
>>>>> begin /
>>>>> end offset all over the place, and makes
>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>>
>>>>> We could change the code to output these like the name, as, e.g.,  
>>>>> &#77987; 
>>>>>
>>>>> Would that help in your case, or do you imagine other kinds of things 
>>>>> might
>>>>> break (due to begin/end offsets no longer
>>>>> being on character boundaries, for example).
>>>>>
>>>>> -Marshall
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>
>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>> serialisation, and there seems to be two distinct categories of issues 
>>>>>> we have
>>>>>> identified so far.
>>>>>>
>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>> 2) Annotations with String features have values containing special 
>>>>>> unicode
>>>>>> characters
>>>>>>
>>>>>> In both cases we could for sure solve the problem if we did a better 
>>>>>> clean up
>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>> always a chance something passes through, and some of it may in the 
>>>>>> general
>>>>>> case even be valid content.
>>>>>>
>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>> attached. In this example the text is a snippet taken from the content 
>>>>>> of a
>>>>>> parsed XML document.
>>>>>>
>>>>>> The other case was not possible to reproduce with the OddFeatureText 
>>>>>> example,
>>>>>> because I am getting slightly different output to what I have in our real
>>>>>> setup. The OddFeatureText example is based on the simple type system I 
>>>>>> shared
>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>> characters that I found in a similar data structure in our actual CAS. 
>>>>>> The
>>>>>> value comes from an external knowledge base holding some noisy strings, 
>>>>>> which
>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>> using the small example it only outputs the first of the two characters 
>>>>>> in
>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>> actual setup both character values are written as "&#77987;&#56483;”. 
>>>>>> This
>>>>>> means that the attached example will for some reason parse the XMI 
>>>>>> again, but
>>>>>> it will not work in the case where both characters are written the way we
>>>>>> experience it. The XMI can be manually changed, so that both character 
>>>>>> values
>>>>>> are included the way it happens in our output, and in this case a
>>>>>> SAXParserException happens.
>>>>>>
>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to 
>>>>>> handle
>>>>>> any of this, but it will be good to know in any case :)
>>>>>>
>>>>>> Cheers,
>>>>>> Mario
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] 
>>>>>>> <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thank you very much for looking into this. It is really appreciated and 
>>>>>>> I
>>>>>>> think it touches upon something important, which is about data 
>>>>>>> migration in
>>>>>>> general.
>>>>>>>
>>>>>>> I agree that some of these solutions can appear specific, awkward or 
>>>>>>> complex
>>>>>>> and the way forward is not to address our use case alone. I think there 
>>>>>>> is a
>>>>>>> need for a compact and efficient binary serialization format for the 
>>>>>>> CAS when
>>>>>>> dealing with large amounts of data because this is directly visible in 
>>>>>>> costs
>>>>>>> of processing and storing, and I found the compressed binary format to 
>>>>>>> be
>>>>>>> much better than XMI in this regard, although I have to admit it’s been 
>>>>>>> a
>>>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>>>> described
>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a 
>>>>>>> more
>>>>>>> formal approach to data migration would be critical to any larger 
>>>>>>> operational
>>>>>>> setup.
>>>>>>>
>>>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>>>> observing,
>>>>>>> so that it can be solved. We are primarily using XMI for 
>>>>>>> inspection/debugging
>>>>>>> purposes, and we are sometimes not able to do this because of this 
>>>>>>> error. I
>>>>>>> will try to extract a minimum example to avoid involving parts that has 
>>>>>>> to do
>>>>>>> with our pipeline and type system, and I think this would also be the 
>>>>>>> best
>>>>>>> way to illustrate that the problem exists outside of this context. 
>>>>>>> However,
>>>>>>> converting all our data to XMI first in order to do the conversion in 
>>>>>>> our
>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>> amount of data.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] 
>>>>>>>> <mailto:[email protected]>
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>
>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureAnnotation each of which
>>>>>>>>                             has 5 slots: sofaRef, begin, end, name, 
>>>>>>>> value
>>>>>>>>
>>>>>>>> the new TypeSystem has
>>>>>>>>
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureRecord each of which
>>>>>>>>                              has 2 slots: name, value
>>>>>>>>
>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>   1) create an FSArray of FeatureRecord,
>>>>>>>>   2) for each element,
>>>>>>>>      map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>
>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>  1) change the type from A to B
>>>>>>>>  2) set equal-named features from A to B, drop other features
>>>>>>>>
>>>>>>>> This mapping would need to apply to a subset of the A's and B's, 
>>>>>>>> namely, only
>>>>>>>> those referenced by the FSArray where the element type changed.  Seems 
>>>>>>>> complex
>>>>>>>> and specific to this use case though.
>>>>>>>>
>>>>>>>> -Marshall
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] 
>>>>>>>>> <mailto:[email protected]>
>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>>>> deserialization
>>>>>>>>>> code compares the two type systems, and allows for some mismatches 
>>>>>>>>>> (things
>>>>>>>>>> present in one and not in the other), but it doesn't allow for 
>>>>>>>>>> having a
>>>>>>>>>> feature
>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in 
>>>>>>>>>> the
>>>>>>>>>> other.
>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>> Without reading the code in detail - could we not relax this check 
>>>>>>>>> such
>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>> assumes that the source element type has the same features as the 
>>>>>>>>> target
>>>>>>>>> element type (with the usual lenient handling of missing features in 
>>>>>>>>> the
>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Reply via email to