Hi Marshall, Seems the bug was already resolved for 8u92 in one of the backports:
https://bugs.openjdk.java.net/browse/JDK-8141098 <https://bugs.openjdk.java.net/browse/JDK-8141098> Cheers, Mario > On 23 Sep 2019, at 09:45 , Mario Juric <[email protected]> wrote: > > Hi Marshall, > > Thanks for the thorough and excellent investigation. > > We are looking into possible normalisation/cleanup of whitespace/invisible > characters, but I don’t think we can necessarily do the same for some of the > other characters. It sounds to me though that serialising to XML 1.1 could > also be a simple fix right now, but can this be configured? CasIOUtils > doesn’t seem to have an option for this, so I assume it’s something you have > working in your branch. > > Regarding the other problem. It seems that the JDK bug is fixed from Java 9 > and after. Do you think switching to a more recent Java version would make a > difference? I think we can also try this out ourselves when we look into > migrating to UIMA 3 once our current deliveries are complete. We also like to > switch to Java 11, and like UIMA 3 migration it will require some thorough > testing. > > Cheers, > Mario > > > > > > > > > > > > > >> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected] >> <mailto:[email protected]>> wrote: >> >> In the test "OddDocumentText", this produces a "throw" due to an invalid xml >> char, which is the \u0002. >> >> This is in part because the xml version being used is xml 1.0. >> >> XML 1.1 expanded the set of valid characters to include \u0002. >> >> Here's a snip from the XmiCasSerializerTest class which serializes with xml >> 1.1: >> >> XmiCasSerializer xmiCasSerializer = new >> XmiCasSerializer(jCas.getTypeSystem()); >> OutputStream out = new FileOutputStream(new File >> ("odd-doc-txt-v11.xmi")); >> try { >> XMLSerializer xml11Serializer = new XMLSerializer(out); >> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1"); >> xmiCasSerializer.serialize(jCas.getCas(), >> xml11Serializer.getContentHandler()); >> } >> finally { >> out.close(); >> } >> >> This succeeds and serializes this using xml 1.1. >> >> I also tried serializing some doc text which includes \u77987. That did not >> serialize correctly. >> I could see it in the code while tracing up to some point down in the >> innards of >> some internal >> sax java code >> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize where it >> was >> "Correct" in the Java string. >> >> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837. >> >> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte >> encoding: >> 1110 xxxx 10xx xxxx 10xx xxxx >> >> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me. >> >> But I think it's out of our hands - it's somewhere deep in the sax transform >> java code. >> >> I looked for a bug report and found some >> https://bugs.openjdk.java.net/browse/JDK-8058175 >> <https://bugs.openjdk.java.net/browse/JDK-8058175> >> >> Bottom line, is, I think to clean out these characters early :-) . >> >> -Marshall >> >> >> On 9/20/2019 1:28 PM, Marshall Schor wrote: >>> here's an idea. >>> >>> If you have a string, with the surrogate pair 𓂣 at position 10, and >>> you >>> have some Java code, which is iterating through the string and getting the >>> code-point at each character offset, then that code will produce: >>> >>> at position 10: the code-point 77987 >>> at position 11: the code-point 56483 >>> >>> Of course, it's a "bug" to iterate through a string of characters, assuming >>> you >>> have characters at each point, if you don't handle surrogate pairs. >>> >>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see >>> https://tools.ietf.org/html/rfc2781 ) >>> >>> I worry that even tools like the CVD or similar may not work properly, since >>> they're not designed to handle surrogate pairs, I think, so I have no idea >>> if >>> they would work well enough for you. >>> >>> I'll poke around some more to see if I can enable the conversion for >>> document >>> strings. >>> >>> -Marshall >>> >>> On 9/20/2019 11:09 AM, Mario Juric wrote: >>>> Thanks Marshall, >>>> >>>> Encoding the characters like you suggest should work just fine for us as >>>> long as we can serialize and deserialise the XMI, so that we can open the >>>> content in a tool like the CVD or similar. These characters are just noise >>>> from the original content that happen to remain in the CAS, but they are >>>> not visible in our final output because they are basically filtered out >>>> one way or the other by downstream components. They become a problem >>>> though when they make it more difficult for us to inspect the content. >>>> >>>> Regarding the feature name issue: Might you have an idea why we are >>>> getting a different XMI output for the same character in our actual >>>> pipeline, where it results in "𓂣�”? I investigated the value >>>> in the debugger again, and like you are illustrating it is also just a >>>> single codepoint with the value 77987. We are simply not able to load this >>>> XMI because of this, but unfortunately I couldn’t reproduce it in my small >>>> example. >>>> >>>> Cheers, >>>> Mario >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote: >>>>> >>>>> The odd-feature-text seems to work OK, but has some unusual properties, >>>>> due to >>>>> that unicode character. >>>>> >>>>> Here's what I see: The FeatureRecord "name" field is set to a >>>>> 1-unicode-character, that must be encoded as 2 java characters. >>>>> >>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord >>>>> xmi:id="18" >>>>> name="𓂣" value="1.0"/> >>>>> which seems correct. The name field only has 1 (extended)unicode >>>>> character >>>>> (taking 2 Java characters to represent), >>>>> due to setting it with this code: String oddName = "\uD80C\uDCA3"; >>>>> >>>>> When read in, the name field is assigned to a String, that string says it >>>>> has a >>>>> length of 2 (but that's because it takes 2 java chars to represent this >>>>> char). >>>>> If you have the name string in a variable "n", and do >>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987. >>>>> n.codePointCount(0, n.length()) is, as expected, 1. >>>>> >>>>> So, the string value serialization and deserialization seems to be >>>>> "working". >>>>> >>>>> The other code - for the sofa (document) serialization, is throwing that >>>>> error, >>>>> because as currently designed, the >>>>> serialization code checks for these kinds of characters, and if found >>>>> throws >>>>> that exception. The code checking is >>>>> in XMLUtils.checkForNonXmlCharacters >>>>> >>>>> This is because it's highly likely that "fixing this" in the same way as >>>>> the >>>>> other, would result in hard-to-diagnose >>>>> future errors, because the subject of analysis string is processed with >>>>> begin / >>>>> end offset all over the place, and makes >>>>> the assumption that the characters are all not coded as surrogate pairs. >>>>> >>>>> We could change the code to output these like the name, as, e.g., >>>>> 𓂣 >>>>> >>>>> Would that help in your case, or do you imagine other kinds of things >>>>> might >>>>> break (due to begin/end offsets no longer >>>>> being on character boundaries, for example). >>>>> >>>>> -Marshall >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 9/18/2019 11:41 AM, Mario Juric wrote: >>>>>> Hi, >>>>>> >>>>>> I investigated the XMI issue as promised and these are my findings. >>>>>> >>>>>> It is related to special unicode characters that are not handled by XMI >>>>>> serialisation, and there seems to be two distinct categories of issues >>>>>> we have >>>>>> identified so far. >>>>>> >>>>>> 1) The document text of the CAS contains special unicode characters >>>>>> 2) Annotations with String features have values containing special >>>>>> unicode >>>>>> characters >>>>>> >>>>>> In both cases we could for sure solve the problem if we did a better >>>>>> clean up >>>>>> job upstream, but with the amount and variety of data we receive there is >>>>>> always a chance something passes through, and some of it may in the >>>>>> general >>>>>> case even be valid content. >>>>>> >>>>>> The first case is easy to reproduce with the OddDocumentText example I >>>>>> attached. In this example the text is a snippet taken from the content >>>>>> of a >>>>>> parsed XML document. >>>>>> >>>>>> The other case was not possible to reproduce with the OddFeatureText >>>>>> example, >>>>>> because I am getting slightly different output to what I have in our real >>>>>> setup. The OddFeatureText example is based on the simple type system I >>>>>> shared >>>>>> previously. The name value of a FeatureRecord contains special unicode >>>>>> characters that I found in a similar data structure in our actual CAS. >>>>>> The >>>>>> value comes from an external knowledge base holding some noisy strings, >>>>>> which >>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI >>>>>> using the small example it only outputs the first of the two characters >>>>>> in >>>>>> "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but in our >>>>>> actual setup both character values are written as "𓂣�”. >>>>>> This >>>>>> means that the attached example will for some reason parse the XMI >>>>>> again, but >>>>>> it will not work in the case where both characters are written the way we >>>>>> experience it. The XMI can be manually changed, so that both character >>>>>> values >>>>>> are included the way it happens in our output, and in this case a >>>>>> SAXParserException happens. >>>>>> >>>>>> I don’t know whether it is outside the scope of the XMI serialiser to >>>>>> handle >>>>>> any of this, but it will be good to know in any case :) >>>>>> >>>>>> Cheers, >>>>>> Mario >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] >>>>>>> <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> >>>>>>> wrote: >>>>>>> >>>>>>> Thank you very much for looking into this. It is really appreciated and >>>>>>> I >>>>>>> think it touches upon something important, which is about data >>>>>>> migration in >>>>>>> general. >>>>>>> >>>>>>> I agree that some of these solutions can appear specific, awkward or >>>>>>> complex >>>>>>> and the way forward is not to address our use case alone. I think there >>>>>>> is a >>>>>>> need for a compact and efficient binary serialization format for the >>>>>>> CAS when >>>>>>> dealing with large amounts of data because this is directly visible in >>>>>>> costs >>>>>>> of processing and storing, and I found the compressed binary format to >>>>>>> be >>>>>>> much better than XMI in this regard, although I have to admit it’s been >>>>>>> a >>>>>>> while since I benchmarked this. Given that UIMA already has a well >>>>>>> described >>>>>>> type system then maybe it just lacks a way to describe schema evolution >>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a >>>>>>> more >>>>>>> formal approach to data migration would be critical to any larger >>>>>>> operational >>>>>>> setup. >>>>>>> >>>>>>> Regarding XMI I like to provide some input to the problem we are >>>>>>> observing, >>>>>>> so that it can be solved. We are primarily using XMI for >>>>>>> inspection/debugging >>>>>>> purposes, and we are sometimes not able to do this because of this >>>>>>> error. I >>>>>>> will try to extract a minimum example to avoid involving parts that has >>>>>>> to do >>>>>>> with our pipeline and type system, and I think this would also be the >>>>>>> best >>>>>>> way to illustrate that the problem exists outside of this context. >>>>>>> However, >>>>>>> converting all our data to XMI first in order to do the conversion in >>>>>>> our >>>>>>> example would not be very practical for us, because it involves a large >>>>>>> amount of data. >>>>>>> >>>>>>> Cheers, >>>>>>> Mario >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] >>>>>>>> <mailto:[email protected]> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>> >>>>>>>> In this case, the original looks kind-of like this: >>>>>>>> >>>>>>>> Container >>>>>>>> features -> FSArray of FeatureAnnotation each of which >>>>>>>> has 5 slots: sofaRef, begin, end, name, >>>>>>>> value >>>>>>>> >>>>>>>> the new TypeSystem has >>>>>>>> >>>>>>>> Container >>>>>>>> features -> FSArray of FeatureRecord each of which >>>>>>>> has 2 slots: name, value >>>>>>>> >>>>>>>> The deserializer code would need some way to decide how to >>>>>>>> 1) create an FSArray of FeatureRecord, >>>>>>>> 2) for each element, >>>>>>>> map the FeatureAnnotation to a new instance of FeatureRecord >>>>>>>> >>>>>>>> I guess I could imagine a default mapping (for item 2 above) of >>>>>>>> 1) change the type from A to B >>>>>>>> 2) set equal-named features from A to B, drop other features >>>>>>>> >>>>>>>> This mapping would need to apply to a subset of the A's and B's, >>>>>>>> namely, only >>>>>>>> those referenced by the FSArray where the element type changed. Seems >>>>>>>> complex >>>>>>>> and specific to this use case though. >>>>>>>> >>>>>>>> -Marshall >>>>>>>> >>>>>>>> >>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote: >>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>> I can reproduce the problem, and see what is happening. The >>>>>>>>>> deserialization >>>>>>>>>> code compares the two type systems, and allows for some mismatches >>>>>>>>>> (things >>>>>>>>>> present in one and not in the other), but it doesn't allow for >>>>>>>>>> having a >>>>>>>>>> feature >>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in >>>>>>>>>> the >>>>>>>>>> other. >>>>>>>>>> See CasTypeSystemMapper lines 299 - 315. >>>>>>>>> Without reading the code in detail - could we not relax this check >>>>>>>>> such >>>>>>>>> that the element type of FSArrays is not checked and the code simply >>>>>>>>> assumes that the source element type has the same features as the >>>>>>>>> target >>>>>>>>> element type (with the usual lenient handling of missing features in >>>>>>>>> the >>>>>>>>> target type)? - Kind of a "duck typing" approach? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> -- Richard >
