Re: Migrating type system of form 6 compressed CAS binaries

Mario Juric Wed, 25 Sep 2019 12:17:01 -0700

Thanks. I will take a look at it and then I get back to you.

Cheers,
Mario














> On 25 Sep 2019, at 20:46 , Marshall Schor <[email protected]> wrote:
> 
> Here's code that works that serializes in 1.1 format.
> 
> The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1".
> 
> XmiCasSerializer xmiCasSerializer = new 
> XmiCasSerializer(jCas.getTypeSystem());
> OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi"));
> try {
>  XMLSerializer xml11Serializer = new XMLSerializer(out);
>   xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>   xmiCasSerializer.serialize(jCas.getCas(), 
> xml11Serializer.getContentHandler());
>     }
> finally {
>  out.close();
> }
> 
> This is from a test case. -Marshall
> 
> On 9/25/2019 2:16 PM, Mario Juric wrote:
>> Thanks Marshall,
>> 
>> If you prefer then I can also have a look at it, although I probably need to 
>> finish something first within the next 3-4 weeks. It would probably get me 
>> faster started if you could share some of your experimental sample code.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 24 Sep 2019, at 21:32 , Marshall Schor <[email protected]> wrote:
>>> 
>>> yes, makes sense, thanks for posting the Jira.
>>> 
>>> If no one else steps up to work on this, I'll probably take a look in a few
>>> days. -Marshall
>>> 
>>> On 9/24/2019 6:47 AM, Mario Juric wrote:
>>>> Hi Marshall,
>>>> 
>>>> I added the following feature request to Apache Jira:
>>>> 
>>>> https://issues.apache.org/jira/browse/UIMA-6128
>>>> 
>>>> Hope it makes sense :)
>>>> 
>>>> Thanks a lot for the help, it’s appreciated.
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 23 Sep 2019, at 16:33 , Marshall Schor <[email protected]> wrote:
>>>>> 
>>>>> Re: serializing using XML 1.1
>>>>> 
>>>>> This was not thought of, when setting up the CasIOUtils.
>>>>> 
>>>>> The way it was done (above) was using some more "primitive/lower level" 
>>>>> APIs,
>>>>> rather than the CasIOUtils.
>>>>> 
>>>>> Please open a Jira ticket for this, with perhaps some suggestions on how 
>>>>> it
>>>>> might be specified in the CasIOUtils APIs.
>>>>> 
>>>>> Thanks! -Marshall
>>>>> 
>>>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>>>>> Hi Marshall,
>>>>>> 
>>>>>> Thanks for the thorough and excellent investigation.
>>>>>> 
>>>>>> We are looking into possible normalisation/cleanup of 
>>>>>> whitespace/invisible characters, but I don’t think we can necessarily do 
>>>>>> the same for some of the other characters. It sounds to me though that 
>>>>>> serialising to XML 1.1 could also be a simple fix right now, but can 
>>>>>> this be configured? CasIOUtils doesn’t seem to have an option for this, 
>>>>>> so I assume it’s something you have working in your branch.
>>>>>> 
>>>>>> Regarding the other problem. It seems that the JDK bug is fixed from 
>>>>>> Java 9 and after. Do you think switching to a more recent Java version 
>>>>>> would make a difference? I think we can also try this out ourselves when 
>>>>>> we look into migrating to UIMA 3 once our current deliveries are 
>>>>>> complete. We also like to switch to Java 11, and like UIMA 3 migration 
>>>>>> it will require some thorough testing.
>>>>>> 
>>>>>> Cheers,
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote:
>>>>>>> 
>>>>>>> In the test "OddDocumentText", this produces a "throw" due to an 
>>>>>>> invalid xml
>>>>>>> char, which is the \u0002.
>>>>>>> 
>>>>>>> This is in part because the xml version being used is xml 1.0.
>>>>>>> 
>>>>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>>>> 
>>>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>>>>>>> xml 1.1:
>>>>>>> 
>>>>>>>      XmiCasSerializer xmiCasSerializer = new
>>>>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>>>>      OutputStream out = new FileOutputStream(new File 
>>>>>>> ("odd-doc-txt-v11.xmi"));
>>>>>>>      try {
>>>>>>>        XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>>>>        xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>>>>        xmiCasSerializer.serialize(jCas.getCas(),
>>>>>>> xml11Serializer.getContentHandler());
>>>>>>>      }
>>>>>>>      finally {
>>>>>>>        out.close();
>>>>>>>      }
>>>>>>> 
>>>>>>> This succeeds and serializes this using xml 1.1.
>>>>>>> 
>>>>>>> I also tried serializing some doc text which includes \u77987.  That 
>>>>>>> did not
>>>>>>> serialize correctly.
>>>>>>> I could see it in the code while tracing up to some point down in the 
>>>>>>> innards of
>>>>>>> some internal
>>>>>>> sax java code
>>>>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  
>>>>>>> where it was
>>>>>>> "Correct" in the Java string.
>>>>>>> 
>>>>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>>>> 
>>>>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 
>>>>>>> byte encoding:
>>>>>>>      1110 xxxx 10xx xxxx 10xx xxxx
>>>>>>> 
>>>>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to 
>>>>>>> me.
>>>>>>> 
>>>>>>> But I think it's out of our hands - it's somewhere deep in the sax 
>>>>>>> transform
>>>>>>> java code.
>>>>>>> 
>>>>>>> I looked for a bug report and found some
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>>>> 
>>>>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>>>> 
>>>>>>> -Marshall
>>>>>>> 
>>>>>>> 
>>>>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>>>>> here's an idea.
>>>>>>>> 
>>>>>>>> If you have a string, with the surrogate pair &#77987 at position 10, 
>>>>>>>> and you
>>>>>>>> have some Java code, which is iterating through the string and getting 
>>>>>>>> the
>>>>>>>> code-point at each character offset, then that code will produce:
>>>>>>>> 
>>>>>>>> at position 10:  the code-point 77987
>>>>>>>> at position 11:  the code-point 56483
>>>>>>>> 
>>>>>>>> Of course, it's a "bug" to iterate through a string of characters, 
>>>>>>>> assuming you
>>>>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>>>> 
>>>>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 
>>>>>>>> (see
>>>>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>>>> 
>>>>>>>> I worry that even tools like the CVD or similar may not work properly, 
>>>>>>>> since
>>>>>>>> they're not designed to handle surrogate pairs, I think, so I have no 
>>>>>>>> idea if
>>>>>>>> they would work well enough for you.
>>>>>>>> 
>>>>>>>> I'll poke around some more to see if I can enable the conversion for 
>>>>>>>> document
>>>>>>>> strings.
>>>>>>>> 
>>>>>>>> -Marshall
>>>>>>>> 
>>>>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>>>>> Thanks Marshall,
>>>>>>>>> 
>>>>>>>>> Encoding the characters like you suggest should work just fine for us 
>>>>>>>>> as long as we can serialize and deserialise the XMI, so that we can 
>>>>>>>>> open the content in a tool like the CVD or similar. These characters 
>>>>>>>>> are just noise from the original content that happen to remain in the 
>>>>>>>>> CAS, but they are not visible in our final output because they are 
>>>>>>>>> basically filtered out one way or the other by downstream components. 
>>>>>>>>> They become a problem though when they make it more difficult for us 
>>>>>>>>> to inspect the content.
>>>>>>>>> 
>>>>>>>>> Regarding the feature name issue: Might you have an idea why we are 
>>>>>>>>> getting a different XMI output for the same character in our actual 
>>>>>>>>> pipeline, where it results in "&#77987;&#56483;”? I investigated the 
>>>>>>>>> value in the debugger again, and like you are illustrating it is also 
>>>>>>>>> just a single codepoint with the value 77987. We are simply not able 
>>>>>>>>> to load this XMI because of this, but unfortunately I couldn’t 
>>>>>>>>> reproduce it in my small example.
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Mario
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>> The odd-feature-text seems to work OK, but has some unusual 
>>>>>>>>>> properties, due to
>>>>>>>>>> that unicode character.
>>>>>>>>>> 
>>>>>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>>>> 
>>>>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>>>>>>>>> xmi:id="18"
>>>>>>>>>> name="&#77987;" value="1.0"/>
>>>>>>>>>> which seems correct.  The name field only has 1 (extended)unicode 
>>>>>>>>>> character
>>>>>>>>>> (taking 2 Java characters to represent),
>>>>>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>>>>>> 
>>>>>>>>>> When read in, the name field is assigned to a String, that string 
>>>>>>>>>> says it has a
>>>>>>>>>> length of 2 (but that's because it takes 2 java chars to represent 
>>>>>>>>>> this char).
>>>>>>>>>> If you have the name string in a variable "n", and do
>>>>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>>>> 
>>>>>>>>>> So, the string value serialization and deserialization seems to be 
>>>>>>>>>> "working".
>>>>>>>>>> 
>>>>>>>>>> The other code - for the sofa (document) serialization, is throwing 
>>>>>>>>>> that error,
>>>>>>>>>> because as currently designed, the
>>>>>>>>>> serialization code checks for these kinds of characters, and if 
>>>>>>>>>> found throws
>>>>>>>>>> that exception.  The code checking is
>>>>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>>>> 
>>>>>>>>>> This is because it's highly likely that "fixing this" in the same 
>>>>>>>>>> way as the
>>>>>>>>>> other, would result in hard-to-diagnose
>>>>>>>>>> future errors, because the subject of analysis string is processed 
>>>>>>>>>> with begin /
>>>>>>>>>> end offset all over the place, and makes
>>>>>>>>>> the assumption that the characters are all not coded as surrogate 
>>>>>>>>>> pairs.
>>>>>>>>>> 
>>>>>>>>>> We could change the code to output these like the name, as, e.g.,  
>>>>>>>>>> &#77987; 
>>>>>>>>>> 
>>>>>>>>>> Would that help in your case, or do you imagine other kinds of 
>>>>>>>>>> things might
>>>>>>>>>> break (due to begin/end offsets no longer
>>>>>>>>>> being on character boundaries, for example).
>>>>>>>>>> 
>>>>>>>>>> -Marshall
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>>>> 
>>>>>>>>>>> It is related to special unicode characters that are not handled by 
>>>>>>>>>>> XMI
>>>>>>>>>>> serialisation, and there seems to be two distinct categories of 
>>>>>>>>>>> issues we have
>>>>>>>>>>> identified so far.
>>>>>>>>>>> 
>>>>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>>>>> 2) Annotations with String features have values containing special 
>>>>>>>>>>> unicode
>>>>>>>>>>> characters
>>>>>>>>>>> 
>>>>>>>>>>> In both cases we could for sure solve the problem if we did a 
>>>>>>>>>>> better clean up
>>>>>>>>>>> job upstream, but with the amount and variety of data we receive 
>>>>>>>>>>> there is
>>>>>>>>>>> always a chance something passes through, and some of it may in the 
>>>>>>>>>>> general
>>>>>>>>>>> case even be valid content.
>>>>>>>>>>> 
>>>>>>>>>>> The first case is easy to reproduce with the OddDocumentText 
>>>>>>>>>>> example I
>>>>>>>>>>> attached. In this example the text is a snippet taken from the 
>>>>>>>>>>> content of a
>>>>>>>>>>> parsed XML document.
>>>>>>>>>>> 
>>>>>>>>>>> The other case was not possible to reproduce with the 
>>>>>>>>>>> OddFeatureText example,
>>>>>>>>>>> because I am getting slightly different output to what I have in 
>>>>>>>>>>> our real
>>>>>>>>>>> setup. The OddFeatureText example is based on the simple type 
>>>>>>>>>>> system I shared
>>>>>>>>>>> previously. The name value of a FeatureRecord contains special 
>>>>>>>>>>> unicode
>>>>>>>>>>> characters that I found in a similar data structure in our actual 
>>>>>>>>>>> CAS. The
>>>>>>>>>>> value comes from an external knowledge base holding some noisy 
>>>>>>>>>>> strings, which
>>>>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS 
>>>>>>>>>>> to XMI
>>>>>>>>>>> using the small example it only outputs the first of the two 
>>>>>>>>>>> characters in
>>>>>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but 
>>>>>>>>>>> in our
>>>>>>>>>>> actual setup both character values are written as 
>>>>>>>>>>> "&#77987;&#56483;”. This
>>>>>>>>>>> means that the attached example will for some reason parse the XMI 
>>>>>>>>>>> again, but
>>>>>>>>>>> it will not work in the case where both characters are written the 
>>>>>>>>>>> way we
>>>>>>>>>>> experience it. The XMI can be manually changed, so that both 
>>>>>>>>>>> character values
>>>>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>>>>> SAXParserException happens.
>>>>>>>>>>> 
>>>>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser 
>>>>>>>>>>> to handle
>>>>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Mario
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you very much for looking into this. It is really 
>>>>>>>>>>>> appreciated and I
>>>>>>>>>>>> think it touches upon something important, which is about data 
>>>>>>>>>>>> migration in
>>>>>>>>>>>> general.
>>>>>>>>>>>> 
>>>>>>>>>>>> I agree that some of these solutions can appear specific, awkward 
>>>>>>>>>>>> or complex
>>>>>>>>>>>> and the way forward is not to address our use case alone. I think 
>>>>>>>>>>>> there is a
>>>>>>>>>>>> need for a compact and efficient binary serialization format for 
>>>>>>>>>>>> the CAS when
>>>>>>>>>>>> dealing with large amounts of data because this is directly 
>>>>>>>>>>>> visible in costs
>>>>>>>>>>>> of processing and storing, and I found the compressed binary 
>>>>>>>>>>>> format to be
>>>>>>>>>>>> much better than XMI in this regard, although I have to admit it’s 
>>>>>>>>>>>> been a
>>>>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>>>>>>>>> described
>>>>>>>>>>>> type system then maybe it just lacks a way to describe schema 
>>>>>>>>>>>> evolution
>>>>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I 
>>>>>>>>>>>> think a more
>>>>>>>>>>>> formal approach to data migration would be critical to any larger 
>>>>>>>>>>>> operational
>>>>>>>>>>>> setup.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>>>>>>>>> observing,
>>>>>>>>>>>> so that it can be solved. We are primarily using XMI for 
>>>>>>>>>>>> inspection/debugging
>>>>>>>>>>>> purposes, and we are sometimes not able to do this because of this 
>>>>>>>>>>>> error. I
>>>>>>>>>>>> will try to extract a minimum example to avoid involving parts 
>>>>>>>>>>>> that has to do
>>>>>>>>>>>> with our pipeline and type system, and I think this would also be 
>>>>>>>>>>>> the best
>>>>>>>>>>>> way to illustrate that the problem exists outside of this context. 
>>>>>>>>>>>> However,
>>>>>>>>>>>> converting all our data to XMI first in order to do the conversion 
>>>>>>>>>>>> in our
>>>>>>>>>>>> example would not be very practical for us, because it involves a 
>>>>>>>>>>>> large
>>>>>>>>>>>> amount of data.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Mario
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Container
>>>>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>>>>                          has 5 slots: sofaRef, begin, end, name, 
>>>>>>>>>>>>> value
>>>>>>>>>>>>> 
>>>>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Container
>>>>>>>>>>>>> features -> FSArray of FeatureRecord each of which
>>>>>>>>>>>>>                           has 2 slots: name, value
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>>>> 1) create an FSArray of FeatureRecord,
>>>>>>>>>>>>> 2) for each element,
>>>>>>>>>>>>>   map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, 
>>>>>>>>>>>>> namely, only
>>>>>>>>>>>>> those referenced by the FSArray where the element type changed.  
>>>>>>>>>>>>> Seems complex
>>>>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Marshall
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>>>>>>>>> deserialization
>>>>>>>>>>>>>>> code compares the two type systems, and allows for some 
>>>>>>>>>>>>>>> mismatches (things
>>>>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for 
>>>>>>>>>>>>>>> having a
>>>>>>>>>>>>>>> feature
>>>>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type 
>>>>>>>>>>>>>>> YYYY in the
>>>>>>>>>>>>>>> other.
>>>>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>>>>> Without reading the code in detail - could we not relax this 
>>>>>>>>>>>>>> check such
>>>>>>>>>>>>>> that the element type of FSArrays is not checked and the code 
>>>>>>>>>>>>>> simply
>>>>>>>>>>>>>> assumes that the source element type has the same features as 
>>>>>>>>>>>>>> the target
>>>>>>>>>>>>>> element type (with the usual lenient handling of missing 
>>>>>>>>>>>>>> features in the
>>>>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- Richard
>>

Re: Migrating type system of form 6 compressed CAS binaries

Reply via email to