Hi Marshall, Thanks for the thorough and excellent investigation.
We are looking into possible normalisation/cleanup of whitespace/invisible characters, but I don’t think we can necessarily do the same for some of the other characters. It sounds to me though that serialising to XML 1.1 could also be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem to have an option for this, so I assume it’s something you have working in your branch. Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and after. Do you think switching to a more recent Java version would make a difference? I think we can also try this out ourselves when we look into migrating to UIMA 3 once our current deliveries are complete. We also like to switch to Java 11, and like UIMA 3 migration it will require some thorough testing. Cheers, Mario > On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote: > > In the test "OddDocumentText", this produces a "throw" due to an invalid xml > char, which is the \u0002. > > This is in part because the xml version being used is xml 1.0. > > XML 1.1 expanded the set of valid characters to include \u0002. > > Here's a snip from the XmiCasSerializerTest class which serializes with xml > 1.1: > > XmiCasSerializer xmiCasSerializer = new > XmiCasSerializer(jCas.getTypeSystem()); > OutputStream out = new FileOutputStream(new File > ("odd-doc-txt-v11.xmi")); > try { > XMLSerializer xml11Serializer = new XMLSerializer(out); > xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1"); > xmiCasSerializer.serialize(jCas.getCas(), > xml11Serializer.getContentHandler()); > } > finally { > out.close(); > } > > This succeeds and serializes this using xml 1.1. > > I also tried serializing some doc text which includes \u77987. That did not > serialize correctly. > I could see it in the code while tracing up to some point down in the innards > of > some internal > sax java code > com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize where it > was > "Correct" in the Java string. > > When serialized (as UTF-8) it came out as a 4 byte string E79E 9837. > > This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte > encoding: > 1110 xxxx 10xx xxxx 10xx xxxx > > of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me. > > But I think it's out of our hands - it's somewhere deep in the sax transform > java code. > > I looked for a bug report and found some > https://bugs.openjdk.java.net/browse/JDK-8058175 > > Bottom line, is, I think to clean out these characters early :-) . > > -Marshall > > > On 9/20/2019 1:28 PM, Marshall Schor wrote: >> here's an idea. >> >> If you have a string, with the surrogate pair 𓂣 at position 10, and you >> have some Java code, which is iterating through the string and getting the >> code-point at each character offset, then that code will produce: >> >> at position 10: the code-point 77987 >> at position 11: the code-point 56483 >> >> Of course, it's a "bug" to iterate through a string of characters, assuming >> you >> have characters at each point, if you don't handle surrogate pairs. >> >> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see >> https://tools.ietf.org/html/rfc2781 ) >> >> I worry that even tools like the CVD or similar may not work properly, since >> they're not designed to handle surrogate pairs, I think, so I have no idea if >> they would work well enough for you. >> >> I'll poke around some more to see if I can enable the conversion for document >> strings. >> >> -Marshall >> >> On 9/20/2019 11:09 AM, Mario Juric wrote: >>> Thanks Marshall, >>> >>> Encoding the characters like you suggest should work just fine for us as >>> long as we can serialize and deserialise the XMI, so that we can open the >>> content in a tool like the CVD or similar. These characters are just noise >>> from the original content that happen to remain in the CAS, but they are >>> not visible in our final output because they are basically filtered out one >>> way or the other by downstream components. They become a problem though >>> when they make it more difficult for us to inspect the content. >>> >>> Regarding the feature name issue: Might you have an idea why we are getting >>> a different XMI output for the same character in our actual pipeline, where >>> it results in "𓂣�”? I investigated the value in the debugger >>> again, and like you are illustrating it is also just a single codepoint >>> with the value 77987. We are simply not able to load this XMI because of >>> this, but unfortunately I couldn’t reproduce it in my small example. >>> >>> Cheers, >>> Mario >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote: >>>> >>>> The odd-feature-text seems to work OK, but has some unusual properties, >>>> due to >>>> that unicode character. >>>> >>>> Here's what I see: The FeatureRecord "name" field is set to a >>>> 1-unicode-character, that must be encoded as 2 java characters. >>>> >>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord >>>> xmi:id="18" >>>> name="𓂣" value="1.0"/> >>>> which seems correct. The name field only has 1 (extended)unicode character >>>> (taking 2 Java characters to represent), >>>> due to setting it with this code: String oddName = "\uD80C\uDCA3"; >>>> >>>> When read in, the name field is assigned to a String, that string says it >>>> has a >>>> length of 2 (but that's because it takes 2 java chars to represent this >>>> char). >>>> If you have the name string in a variable "n", and do >>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987. >>>> n.codePointCount(0, n.length()) is, as expected, 1. >>>> >>>> So, the string value serialization and deserialization seems to be >>>> "working". >>>> >>>> The other code - for the sofa (document) serialization, is throwing that >>>> error, >>>> because as currently designed, the >>>> serialization code checks for these kinds of characters, and if found >>>> throws >>>> that exception. The code checking is >>>> in XMLUtils.checkForNonXmlCharacters >>>> >>>> This is because it's highly likely that "fixing this" in the same way as >>>> the >>>> other, would result in hard-to-diagnose >>>> future errors, because the subject of analysis string is processed with >>>> begin / >>>> end offset all over the place, and makes >>>> the assumption that the characters are all not coded as surrogate pairs. >>>> >>>> We could change the code to output these like the name, as, e.g., >>>> 𓂣 >>>> >>>> Would that help in your case, or do you imagine other kinds of things might >>>> break (due to begin/end offsets no longer >>>> being on character boundaries, for example). >>>> >>>> -Marshall >>>> >>>> >>>> >>>> >>>> >>>> On 9/18/2019 11:41 AM, Mario Juric wrote: >>>>> Hi, >>>>> >>>>> I investigated the XMI issue as promised and these are my findings. >>>>> >>>>> It is related to special unicode characters that are not handled by XMI >>>>> serialisation, and there seems to be two distinct categories of issues we >>>>> have >>>>> identified so far. >>>>> >>>>> 1) The document text of the CAS contains special unicode characters >>>>> 2) Annotations with String features have values containing special unicode >>>>> characters >>>>> >>>>> In both cases we could for sure solve the problem if we did a better >>>>> clean up >>>>> job upstream, but with the amount and variety of data we receive there is >>>>> always a chance something passes through, and some of it may in the >>>>> general >>>>> case even be valid content. >>>>> >>>>> The first case is easy to reproduce with the OddDocumentText example I >>>>> attached. In this example the text is a snippet taken from the content of >>>>> a >>>>> parsed XML document. >>>>> >>>>> The other case was not possible to reproduce with the OddFeatureText >>>>> example, >>>>> because I am getting slightly different output to what I have in our real >>>>> setup. The OddFeatureText example is based on the simple type system I >>>>> shared >>>>> previously. The name value of a FeatureRecord contains special unicode >>>>> characters that I found in a similar data structure in our actual CAS. The >>>>> value comes from an external knowledge base holding some noisy strings, >>>>> which >>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI >>>>> using the small example it only outputs the first of the two characters in >>>>> "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but in our >>>>> actual setup both character values are written as "𓂣�”. This >>>>> means that the attached example will for some reason parse the XMI again, >>>>> but >>>>> it will not work in the case where both characters are written the way we >>>>> experience it. The XMI can be manually changed, so that both character >>>>> values >>>>> are included the way it happens in our output, and in this case a >>>>> SAXParserException happens. >>>>> >>>>> I don’t know whether it is outside the scope of the XMI serialiser to >>>>> handle >>>>> any of this, but it will be good to know in any case :) >>>>> >>>>> Cheers, >>>>> Mario >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] >>>>>> <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> >>>>>> wrote: >>>>>> >>>>>> Thank you very much for looking into this. It is really appreciated and I >>>>>> think it touches upon something important, which is about data migration >>>>>> in >>>>>> general. >>>>>> >>>>>> I agree that some of these solutions can appear specific, awkward or >>>>>> complex >>>>>> and the way forward is not to address our use case alone. I think there >>>>>> is a >>>>>> need for a compact and efficient binary serialization format for the CAS >>>>>> when >>>>>> dealing with large amounts of data because this is directly visible in >>>>>> costs >>>>>> of processing and storing, and I found the compressed binary format to be >>>>>> much better than XMI in this regard, although I have to admit it’s been a >>>>>> while since I benchmarked this. Given that UIMA already has a well >>>>>> described >>>>>> type system then maybe it just lacks a way to describe schema evolution >>>>>> similar to Apache Avro or similar serialisation frameworks. I think a >>>>>> more >>>>>> formal approach to data migration would be critical to any larger >>>>>> operational >>>>>> setup. >>>>>> >>>>>> Regarding XMI I like to provide some input to the problem we are >>>>>> observing, >>>>>> so that it can be solved. We are primarily using XMI for >>>>>> inspection/debugging >>>>>> purposes, and we are sometimes not able to do this because of this >>>>>> error. I >>>>>> will try to extract a minimum example to avoid involving parts that has >>>>>> to do >>>>>> with our pipeline and type system, and I think this would also be the >>>>>> best >>>>>> way to illustrate that the problem exists outside of this context. >>>>>> However, >>>>>> converting all our data to XMI first in order to do the conversion in our >>>>>> example would not be very practical for us, because it involves a large >>>>>> amount of data. >>>>>> >>>>>> Cheers, >>>>>> Mario >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>> >>>>>>> In this case, the original looks kind-of like this: >>>>>>> >>>>>>> Container >>>>>>> features -> FSArray of FeatureAnnotation each of which >>>>>>> has 5 slots: sofaRef, begin, end, name, >>>>>>> value >>>>>>> >>>>>>> the new TypeSystem has >>>>>>> >>>>>>> Container >>>>>>> features -> FSArray of FeatureRecord each of which >>>>>>> has 2 slots: name, value >>>>>>> >>>>>>> The deserializer code would need some way to decide how to >>>>>>> 1) create an FSArray of FeatureRecord, >>>>>>> 2) for each element, >>>>>>> map the FeatureAnnotation to a new instance of FeatureRecord >>>>>>> >>>>>>> I guess I could imagine a default mapping (for item 2 above) of >>>>>>> 1) change the type from A to B >>>>>>> 2) set equal-named features from A to B, drop other features >>>>>>> >>>>>>> This mapping would need to apply to a subset of the A's and B's, >>>>>>> namely, only >>>>>>> those referenced by the FSArray where the element type changed. Seems >>>>>>> complex >>>>>>> and specific to this use case though. >>>>>>> >>>>>>> -Marshall >>>>>>> >>>>>>> >>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote: >>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] >>>>>>>> <mailto:[email protected]> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>> I can reproduce the problem, and see what is happening. The >>>>>>>>> deserialization >>>>>>>>> code compares the two type systems, and allows for some mismatches >>>>>>>>> (things >>>>>>>>> present in one and not in the other), but it doesn't allow for having >>>>>>>>> a >>>>>>>>> feature >>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in >>>>>>>>> the >>>>>>>>> other. >>>>>>>>> See CasTypeSystemMapper lines 299 - 315. >>>>>>>> Without reading the code in detail - could we not relax this check such >>>>>>>> that the element type of FSArrays is not checked and the code simply >>>>>>>> assumes that the source element type has the same features as the >>>>>>>> target >>>>>>>> element type (with the usual lenient handling of missing features in >>>>>>>> the >>>>>>>> target type)? - Kind of a "duck typing" approach? >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> -- Richard
