Re: Migrating type system of form 6 compressed CAS binaries

Mario Juric Mon, 23 Sep 2019 02:17:55 -0700

Hi Marshall,

Seems the bug was already resolved for 8u92 in one of the backports:


https://bugs.openjdk.java.net/browse/JDK-8141098 
<https://bugs.openjdk.java.net/browse/JDK-8141098>

Cheers,
Mario













> On 23 Sep 2019, at 09:45 , Mario Juric <[email protected]> wrote:
> 
> Hi Marshall,
> 
> Thanks for the thorough and excellent investigation.
> 
> We are looking into possible normalisation/cleanup of whitespace/invisible 
> characters, but I don’t think we can necessarily do the same for some of the 
> other characters. It sounds to me though that serialising to XML 1.1 could 
> also be a simple fix right now, but can this be configured? CasIOUtils 
> doesn’t seem to have an option for this, so I assume it’s something you have 
> working in your branch.
> 
> Regarding the other problem. It seems that the JDK bug is fixed from Java 9 
> and after. Do you think switching to a more recent Java version would make a 
> difference? I think we can also try this out ourselves when we look into 
> migrating to UIMA 3 once our current deliveries are complete. We also like to 
> switch to Java 11, and like UIMA 3 migration it will require some thorough 
> testing.
> 
> Cheers,
> Mario
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
>> char, which is the \u0002.
>> 
>> This is in part because the xml version being used is xml 1.0.
>> 
>> XML 1.1 expanded the set of valid characters to include \u0002.
>> 
>> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
>> 1.1:
>> 
>>         XmiCasSerializer xmiCasSerializer = new
>> XmiCasSerializer(jCas.getTypeSystem());
>>         OutputStream out = new FileOutputStream(new File 
>> ("odd-doc-txt-v11.xmi"));
>>         try {
>>           XMLSerializer xml11Serializer = new XMLSerializer(out);
>>           xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>           xmiCasSerializer.serialize(jCas.getCas(),
>> xml11Serializer.getContentHandler());
>>         }
>>         finally {
>>           out.close();
>>         }
>> 
>> This succeeds and serializes this using xml 1.1.
>> 
>> I also tried serializing some doc text which includes \u77987.  That did not
>> serialize correctly.
>> I could see it in the code while tracing up to some point down in the 
>> innards of
>> some internal
>> sax java code
>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
>> was
>> "Correct" in the Java string.
>> 
>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>> 
>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
>> encoding:
>>         1110 xxxx 10xx xxxx 10xx xxxx
>> 
>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>> 
>> But I think it's out of our hands - it's somewhere deep in the sax transform
>> java code.
>> 
>> I looked for a bug report and found some
>> https://bugs.openjdk.java.net/browse/JDK-8058175 
>> <https://bugs.openjdk.java.net/browse/JDK-8058175>
>> 
>> Bottom line, is, I think to clean out these characters early :-) .
>> 
>> -Marshall
>> 
>> 
>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>> here's an idea.
>>> 
>>> If you have a string, with the surrogate pair &#77987 at position 10, and 
>>> you
>>> have some Java code, which is iterating through the string and getting the
>>> code-point at each character offset, then that code will produce:
>>> 
>>> at position 10:  the code-point 77987
>>> at position 11:  the code-point 56483
>>> 
>>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>>> you
>>> have characters at each point, if you don't handle surrogate pairs.
>>> 
>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>>> https://tools.ietf.org/html/rfc2781 )
>>> 
>>> I worry that even tools like the CVD or similar may not work properly, since
>>> they're not designed to handle surrogate pairs, I think, so I have no idea 
>>> if
>>> they would work well enough for you.
>>> 
>>> I'll poke around some more to see if I can enable the conversion for 
>>> document
>>> strings.
>>> 
>>> -Marshall
>>> 
>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>> Thanks Marshall,
>>>> 
>>>> Encoding the characters like you suggest should work just fine for us as 
>>>> long as we can serialize and deserialise the XMI, so that we can open the 
>>>> content in a tool like the CVD or similar. These characters are just noise 
>>>> from the original content that happen to remain in the CAS, but they are 
>>>> not visible in our final output because they are basically filtered out 
>>>> one way or the other by downstream components. They become a problem 
>>>> though when they make it more difficult for us to inspect the content.
>>>> 
>>>> Regarding the feature name issue: Might you have an idea why we are 
>>>> getting a different XMI output for the same character in our actual 
>>>> pipeline, where it results in "&#77987;&#56483;”? I investigated the value 
>>>> in the debugger again, and like you are illustrating it is also just a 
>>>> single codepoint with the value 77987. We are simply not able to load this 
>>>> XMI because of this, but unfortunately I couldn’t reproduce it in my small 
>>>> example.
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>>> 
>>>>> The odd-feature-text seems to work OK, but has some unusual properties, 
>>>>> due to
>>>>> that unicode character.
>>>>> 
>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>> 
>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>>>> xmi:id="18"
>>>>> name="&#77987;" value="1.0"/>
>>>>> which seems correct.  The name field only has 1 (extended)unicode 
>>>>> character
>>>>> (taking 2 Java characters to represent),
>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>> 
>>>>> When read in, the name field is assigned to a String, that string says it 
>>>>> has a
>>>>> length of 2 (but that's because it takes 2 java chars to represent this 
>>>>> char).
>>>>> If you have the name string in a variable "n", and do
>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>> 
>>>>> So, the string value serialization and deserialization seems to be 
>>>>> "working".
>>>>> 
>>>>> The other code - for the sofa (document) serialization, is throwing that 
>>>>> error,
>>>>> because as currently designed, the
>>>>> serialization code checks for these kinds of characters, and if found 
>>>>> throws
>>>>> that exception.  The code checking is
>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>> 
>>>>> This is because it's highly likely that "fixing this" in the same way as 
>>>>> the
>>>>> other, would result in hard-to-diagnose
>>>>> future errors, because the subject of analysis string is processed with 
>>>>> begin /
>>>>> end offset all over the place, and makes
>>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>>> 
>>>>> We could change the code to output these like the name, as, e.g.,  
>>>>> &#77987; 
>>>>> 
>>>>> Would that help in your case, or do you imagine other kinds of things 
>>>>> might
>>>>> break (due to begin/end offsets no longer
>>>>> being on character boundaries, for example).
>>>>> 
>>>>> -Marshall
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>> 
>>>>>> It is related to special unicode characters that are not handled by XMI
>>>>>> serialisation, and there seems to be two distinct categories of issues 
>>>>>> we have
>>>>>> identified so far.
>>>>>> 
>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>> 2) Annotations with String features have values containing special 
>>>>>> unicode
>>>>>> characters
>>>>>> 
>>>>>> In both cases we could for sure solve the problem if we did a better 
>>>>>> clean up
>>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>>> always a chance something passes through, and some of it may in the 
>>>>>> general
>>>>>> case even be valid content.
>>>>>> 
>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>> attached. In this example the text is a snippet taken from the content 
>>>>>> of a
>>>>>> parsed XML document.
>>>>>> 
>>>>>> The other case was not possible to reproduce with the OddFeatureText 
>>>>>> example,
>>>>>> because I am getting slightly different output to what I have in our real
>>>>>> setup. The OddFeatureText example is based on the simple type system I 
>>>>>> shared
>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>> characters that I found in a similar data structure in our actual CAS. 
>>>>>> The
>>>>>> value comes from an external knowledge base holding some noisy strings, 
>>>>>> which
>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>>> using the small example it only outputs the first of the two characters 
>>>>>> in
>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>>> actual setup both character values are written as "&#77987;&#56483;”. 
>>>>>> This
>>>>>> means that the attached example will for some reason parse the XMI 
>>>>>> again, but
>>>>>> it will not work in the case where both characters are written the way we
>>>>>> experience it. The XMI can be manually changed, so that both character 
>>>>>> values
>>>>>> are included the way it happens in our output, and in this case a
>>>>>> SAXParserException happens.
>>>>>> 
>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to 
>>>>>> handle
>>>>>> any of this, but it will be good to know in any case :)
>>>>>> 
>>>>>> Cheers,
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] 
>>>>>>> <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Thank you very much for looking into this. It is really appreciated and 
>>>>>>> I
>>>>>>> think it touches upon something important, which is about data 
>>>>>>> migration in
>>>>>>> general.
>>>>>>> 
>>>>>>> I agree that some of these solutions can appear specific, awkward or 
>>>>>>> complex
>>>>>>> and the way forward is not to address our use case alone. I think there 
>>>>>>> is a
>>>>>>> need for a compact and efficient binary serialization format for the 
>>>>>>> CAS when
>>>>>>> dealing with large amounts of data because this is directly visible in 
>>>>>>> costs
>>>>>>> of processing and storing, and I found the compressed binary format to 
>>>>>>> be
>>>>>>> much better than XMI in this regard, although I have to admit it’s been 
>>>>>>> a
>>>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>>>> described
>>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a 
>>>>>>> more
>>>>>>> formal approach to data migration would be critical to any larger 
>>>>>>> operational
>>>>>>> setup.
>>>>>>> 
>>>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>>>> observing,
>>>>>>> so that it can be solved. We are primarily using XMI for 
>>>>>>> inspection/debugging
>>>>>>> purposes, and we are sometimes not able to do this because of this 
>>>>>>> error. I
>>>>>>> will try to extract a minimum example to avoid involving parts that has 
>>>>>>> to do
>>>>>>> with our pipeline and type system, and I think this would also be the 
>>>>>>> best
>>>>>>> way to illustrate that the problem exists outside of this context. 
>>>>>>> However,
>>>>>>> converting all our data to XMI first in order to do the conversion in 
>>>>>>> our
>>>>>>> example would not be very practical for us, because it involves a large
>>>>>>> amount of data.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] 
>>>>>>>> <mailto:[email protected]>
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>> 
>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>> 
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureAnnotation each of which
>>>>>>>>                             has 5 slots: sofaRef, begin, end, name, 
>>>>>>>> value
>>>>>>>> 
>>>>>>>> the new TypeSystem has
>>>>>>>> 
>>>>>>>> Container
>>>>>>>>   features -> FSArray of FeatureRecord each of which
>>>>>>>>                              has 2 slots: name, value
>>>>>>>> 
>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>   1) create an FSArray of FeatureRecord,
>>>>>>>>   2) for each element,
>>>>>>>>      map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>> 
>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>  1) change the type from A to B
>>>>>>>>  2) set equal-named features from A to B, drop other features
>>>>>>>> 
>>>>>>>> This mapping would need to apply to a subset of the A's and B's, 
>>>>>>>> namely, only
>>>>>>>> those referenced by the FSArray where the element type changed.  Seems 
>>>>>>>> complex
>>>>>>>> and specific to this use case though.
>>>>>>>> 
>>>>>>>> -Marshall
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] 
>>>>>>>>> <mailto:[email protected]>
>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>>>> deserialization
>>>>>>>>>> code compares the two type systems, and allows for some mismatches 
>>>>>>>>>> (things
>>>>>>>>>> present in one and not in the other), but it doesn't allow for 
>>>>>>>>>> having a
>>>>>>>>>> feature
>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in 
>>>>>>>>>> the
>>>>>>>>>> other.
>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>> Without reading the code in detail - could we not relax this check 
>>>>>>>>> such
>>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>>> assumes that the source element type has the same features as the 
>>>>>>>>> target
>>>>>>>>> element type (with the usual lenient handling of missing features in 
>>>>>>>>> the
>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> -- Richard
>

Re: Migrating type system of form 6 compressed CAS binaries

Reply via email to