Thanks Marshall, If you prefer then I can also have a look at it, although I probably need to finish something first within the next 3-4 weeks. It would probably get me faster started if you could share some of your experimental sample code.
Cheers, Mario > On 24 Sep 2019, at 21:32 , Marshall Schor <m...@schor.com> wrote: > > yes, makes sense, thanks for posting the Jira. > > If no one else steps up to work on this, I'll probably take a look in a few > days. -Marshall > > On 9/24/2019 6:47 AM, Mario Juric wrote: >> Hi Marshall, >> >> I added the following feature request to Apache Jira: >> >> https://issues.apache.org/jira/browse/UIMA-6128 >> >> Hope it makes sense :) >> >> Thanks a lot for the help, it’s appreciated. >> >> Cheers, >> Mario >> >> >> >> >> >> >> >> >> >> >> >> >> >>> On 23 Sep 2019, at 16:33 , Marshall Schor <m...@schor.com> wrote: >>> >>> Re: serializing using XML 1.1 >>> >>> This was not thought of, when setting up the CasIOUtils. >>> >>> The way it was done (above) was using some more "primitive/lower level" >>> APIs, >>> rather than the CasIOUtils. >>> >>> Please open a Jira ticket for this, with perhaps some suggestions on how it >>> might be specified in the CasIOUtils APIs. >>> >>> Thanks! -Marshall >>> >>> On 9/23/2019 3:45 AM, Mario Juric wrote: >>>> Hi Marshall, >>>> >>>> Thanks for the thorough and excellent investigation. >>>> >>>> We are looking into possible normalisation/cleanup of whitespace/invisible >>>> characters, but I don’t think we can necessarily do the same for some of >>>> the other characters. It sounds to me though that serialising to XML 1.1 >>>> could also be a simple fix right now, but can this be configured? >>>> CasIOUtils doesn’t seem to have an option for this, so I assume it’s >>>> something you have working in your branch. >>>> >>>> Regarding the other problem. It seems that the JDK bug is fixed from Java >>>> 9 and after. Do you think switching to a more recent Java version would >>>> make a difference? I think we can also try this out ourselves when we look >>>> into migrating to UIMA 3 once our current deliveries are complete. We also >>>> like to switch to Java 11, and like UIMA 3 migration it will require some >>>> thorough testing. >>>> >>>> Cheers, >>>> Mario >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <m...@schor.com> wrote: >>>>> >>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid >>>>> xml >>>>> char, which is the \u0002. >>>>> >>>>> This is in part because the xml version being used is xml 1.0. >>>>> >>>>> XML 1.1 expanded the set of valid characters to include \u0002. >>>>> >>>>> Here's a snip from the XmiCasSerializerTest class which serializes with >>>>> xml 1.1: >>>>> >>>>> XmiCasSerializer xmiCasSerializer = new >>>>> XmiCasSerializer(jCas.getTypeSystem()); >>>>> OutputStream out = new FileOutputStream(new File >>>>> ("odd-doc-txt-v11.xmi")); >>>>> try { >>>>> XMLSerializer xml11Serializer = new XMLSerializer(out); >>>>> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1"); >>>>> xmiCasSerializer.serialize(jCas.getCas(), >>>>> xml11Serializer.getContentHandler()); >>>>> } >>>>> finally { >>>>> out.close(); >>>>> } >>>>> >>>>> This succeeds and serializes this using xml 1.1. >>>>> >>>>> I also tried serializing some doc text which includes \u77987. That did >>>>> not >>>>> serialize correctly. >>>>> I could see it in the code while tracing up to some point down in the >>>>> innards of >>>>> some internal >>>>> sax java code >>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize where >>>>> it was >>>>> "Correct" in the Java string. >>>>> >>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837. >>>>> >>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 >>>>> byte encoding: >>>>> 1110 xxxx 10xx xxxx 10xx xxxx >>>>> >>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me. >>>>> >>>>> But I think it's out of our hands - it's somewhere deep in the sax >>>>> transform >>>>> java code. >>>>> >>>>> I looked for a bug report and found some >>>>> https://bugs.openjdk.java.net/browse/JDK-8058175 >>>>> >>>>> Bottom line, is, I think to clean out these characters early :-) . >>>>> >>>>> -Marshall >>>>> >>>>> >>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote: >>>>>> here's an idea. >>>>>> >>>>>> If you have a string, with the surrogate pair 𓂣 at position 10, >>>>>> and you >>>>>> have some Java code, which is iterating through the string and getting >>>>>> the >>>>>> code-point at each character offset, then that code will produce: >>>>>> >>>>>> at position 10: the code-point 77987 >>>>>> at position 11: the code-point 56483 >>>>>> >>>>>> Of course, it's a "bug" to iterate through a string of characters, >>>>>> assuming you >>>>>> have characters at each point, if you don't handle surrogate pairs. >>>>>> >>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 >>>>>> (see >>>>>> https://tools.ietf.org/html/rfc2781 ) >>>>>> >>>>>> I worry that even tools like the CVD or similar may not work properly, >>>>>> since >>>>>> they're not designed to handle surrogate pairs, I think, so I have no >>>>>> idea if >>>>>> they would work well enough for you. >>>>>> >>>>>> I'll poke around some more to see if I can enable the conversion for >>>>>> document >>>>>> strings. >>>>>> >>>>>> -Marshall >>>>>> >>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote: >>>>>>> Thanks Marshall, >>>>>>> >>>>>>> Encoding the characters like you suggest should work just fine for us >>>>>>> as long as we can serialize and deserialise the XMI, so that we can >>>>>>> open the content in a tool like the CVD or similar. These characters >>>>>>> are just noise from the original content that happen to remain in the >>>>>>> CAS, but they are not visible in our final output because they are >>>>>>> basically filtered out one way or the other by downstream components. >>>>>>> They become a problem though when they make it more difficult for us to >>>>>>> inspect the content. >>>>>>> >>>>>>> Regarding the feature name issue: Might you have an idea why we are >>>>>>> getting a different XMI output for the same character in our actual >>>>>>> pipeline, where it results in "𓂣�”? I investigated the >>>>>>> value in the debugger again, and like you are illustrating it is also >>>>>>> just a single codepoint with the value 77987. We are simply not able to >>>>>>> load this XMI because of this, but unfortunately I couldn’t reproduce >>>>>>> it in my small example. >>>>>>> >>>>>>> Cheers, >>>>>>> Mario >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <m...@schor.com> wrote: >>>>>>>> >>>>>>>> The odd-feature-text seems to work OK, but has some unusual >>>>>>>> properties, due to >>>>>>>> that unicode character. >>>>>>>> >>>>>>>> Here's what I see: The FeatureRecord "name" field is set to a >>>>>>>> 1-unicode-character, that must be encoded as 2 java characters. >>>>>>>> >>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord >>>>>>>> xmi:id="18" >>>>>>>> name="𓂣" value="1.0"/> >>>>>>>> which seems correct. The name field only has 1 (extended)unicode >>>>>>>> character >>>>>>>> (taking 2 Java characters to represent), >>>>>>>> due to setting it with this code: String oddName = "\uD80C\uDCA3"; >>>>>>>> >>>>>>>> When read in, the name field is assigned to a String, that string says >>>>>>>> it has a >>>>>>>> length of 2 (but that's because it takes 2 java chars to represent >>>>>>>> this char). >>>>>>>> If you have the name string in a variable "n", and do >>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987. >>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1. >>>>>>>> >>>>>>>> So, the string value serialization and deserialization seems to be >>>>>>>> "working". >>>>>>>> >>>>>>>> The other code - for the sofa (document) serialization, is throwing >>>>>>>> that error, >>>>>>>> because as currently designed, the >>>>>>>> serialization code checks for these kinds of characters, and if found >>>>>>>> throws >>>>>>>> that exception. The code checking is >>>>>>>> in XMLUtils.checkForNonXmlCharacters >>>>>>>> >>>>>>>> This is because it's highly likely that "fixing this" in the same way >>>>>>>> as the >>>>>>>> other, would result in hard-to-diagnose >>>>>>>> future errors, because the subject of analysis string is processed >>>>>>>> with begin / >>>>>>>> end offset all over the place, and makes >>>>>>>> the assumption that the characters are all not coded as surrogate >>>>>>>> pairs. >>>>>>>> >>>>>>>> We could change the code to output these like the name, as, e.g., >>>>>>>> 𓂣 >>>>>>>> >>>>>>>> Would that help in your case, or do you imagine other kinds of things >>>>>>>> might >>>>>>>> break (due to begin/end offsets no longer >>>>>>>> being on character boundaries, for example). >>>>>>>> >>>>>>>> -Marshall >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I investigated the XMI issue as promised and these are my findings. >>>>>>>>> >>>>>>>>> It is related to special unicode characters that are not handled by >>>>>>>>> XMI >>>>>>>>> serialisation, and there seems to be two distinct categories of >>>>>>>>> issues we have >>>>>>>>> identified so far. >>>>>>>>> >>>>>>>>> 1) The document text of the CAS contains special unicode characters >>>>>>>>> 2) Annotations with String features have values containing special >>>>>>>>> unicode >>>>>>>>> characters >>>>>>>>> >>>>>>>>> In both cases we could for sure solve the problem if we did a better >>>>>>>>> clean up >>>>>>>>> job upstream, but with the amount and variety of data we receive >>>>>>>>> there is >>>>>>>>> always a chance something passes through, and some of it may in the >>>>>>>>> general >>>>>>>>> case even be valid content. >>>>>>>>> >>>>>>>>> The first case is easy to reproduce with the OddDocumentText example I >>>>>>>>> attached. In this example the text is a snippet taken from the >>>>>>>>> content of a >>>>>>>>> parsed XML document. >>>>>>>>> >>>>>>>>> The other case was not possible to reproduce with the OddFeatureText >>>>>>>>> example, >>>>>>>>> because I am getting slightly different output to what I have in our >>>>>>>>> real >>>>>>>>> setup. The OddFeatureText example is based on the simple type system >>>>>>>>> I shared >>>>>>>>> previously. The name value of a FeatureRecord contains special unicode >>>>>>>>> characters that I found in a similar data structure in our actual >>>>>>>>> CAS. The >>>>>>>>> value comes from an external knowledge base holding some noisy >>>>>>>>> strings, which >>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to >>>>>>>>> XMI >>>>>>>>> using the small example it only outputs the first of the two >>>>>>>>> characters in >>>>>>>>> "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but in >>>>>>>>> our >>>>>>>>> actual setup both character values are written as "𓂣�”. >>>>>>>>> This >>>>>>>>> means that the attached example will for some reason parse the XMI >>>>>>>>> again, but >>>>>>>>> it will not work in the case where both characters are written the >>>>>>>>> way we >>>>>>>>> experience it. The XMI can be manually changed, so that both >>>>>>>>> character values >>>>>>>>> are included the way it happens in our output, and in this case a >>>>>>>>> SAXParserException happens. >>>>>>>>> >>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to >>>>>>>>> handle >>>>>>>>> any of this, but it will be good to know in any case :) >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Mario >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <m...@unsilo.ai >>>>>>>>>> <mailto:m...@unsilo.ai> <mailto:m...@unsilo.ai >>>>>>>>>> <mailto:m...@unsilo.ai>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Thank you very much for looking into this. It is really appreciated >>>>>>>>>> and I >>>>>>>>>> think it touches upon something important, which is about data >>>>>>>>>> migration in >>>>>>>>>> general. >>>>>>>>>> >>>>>>>>>> I agree that some of these solutions can appear specific, awkward or >>>>>>>>>> complex >>>>>>>>>> and the way forward is not to address our use case alone. I think >>>>>>>>>> there is a >>>>>>>>>> need for a compact and efficient binary serialization format for the >>>>>>>>>> CAS when >>>>>>>>>> dealing with large amounts of data because this is directly visible >>>>>>>>>> in costs >>>>>>>>>> of processing and storing, and I found the compressed binary format >>>>>>>>>> to be >>>>>>>>>> much better than XMI in this regard, although I have to admit it’s >>>>>>>>>> been a >>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well >>>>>>>>>> described >>>>>>>>>> type system then maybe it just lacks a way to describe schema >>>>>>>>>> evolution >>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think >>>>>>>>>> a more >>>>>>>>>> formal approach to data migration would be critical to any larger >>>>>>>>>> operational >>>>>>>>>> setup. >>>>>>>>>> >>>>>>>>>> Regarding XMI I like to provide some input to the problem we are >>>>>>>>>> observing, >>>>>>>>>> so that it can be solved. We are primarily using XMI for >>>>>>>>>> inspection/debugging >>>>>>>>>> purposes, and we are sometimes not able to do this because of this >>>>>>>>>> error. I >>>>>>>>>> will try to extract a minimum example to avoid involving parts that >>>>>>>>>> has to do >>>>>>>>>> with our pipeline and type system, and I think this would also be >>>>>>>>>> the best >>>>>>>>>> way to illustrate that the problem exists outside of this context. >>>>>>>>>> However, >>>>>>>>>> converting all our data to XMI first in order to do the conversion >>>>>>>>>> in our >>>>>>>>>> example would not be very practical for us, because it involves a >>>>>>>>>> large >>>>>>>>>> amount of data. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Mario >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <m...@schor.com >>>>>>>>>>> <mailto:m...@schor.com> >>>>>>>>>>> <mailto:m...@schor.com <mailto:m...@schor.com>>> wrote: >>>>>>>>>>> >>>>>>>>>>> In this case, the original looks kind-of like this: >>>>>>>>>>> >>>>>>>>>>> Container >>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which >>>>>>>>>>> has 5 slots: sofaRef, begin, end, name, >>>>>>>>>>> value >>>>>>>>>>> >>>>>>>>>>> the new TypeSystem has >>>>>>>>>>> >>>>>>>>>>> Container >>>>>>>>>>> features -> FSArray of FeatureRecord each of which >>>>>>>>>>> has 2 slots: name, value >>>>>>>>>>> >>>>>>>>>>> The deserializer code would need some way to decide how to >>>>>>>>>>> 1) create an FSArray of FeatureRecord, >>>>>>>>>>> 2) for each element, >>>>>>>>>>> map the FeatureAnnotation to a new instance of FeatureRecord >>>>>>>>>>> >>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of >>>>>>>>>>> 1) change the type from A to B >>>>>>>>>>> 2) set equal-named features from A to B, drop other features >>>>>>>>>>> >>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, >>>>>>>>>>> namely, only >>>>>>>>>>> those referenced by the FSArray where the element type changed. >>>>>>>>>>> Seems complex >>>>>>>>>>> and specific to this use case though. >>>>>>>>>>> >>>>>>>>>>> -Marshall >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote: >>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <m...@schor.com >>>>>>>>>>>> <mailto:m...@schor.com> >>>>>>>>>>>> <mailto:m...@schor.com <mailto:m...@schor.com>>> wrote: >>>>>>>>>>>>> I can reproduce the problem, and see what is happening. The >>>>>>>>>>>>> deserialization >>>>>>>>>>>>> code compares the two type systems, and allows for some >>>>>>>>>>>>> mismatches (things >>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for >>>>>>>>>>>>> having a >>>>>>>>>>>>> feature >>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY >>>>>>>>>>>>> in the >>>>>>>>>>>>> other. >>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315. >>>>>>>>>>>> Without reading the code in detail - could we not relax this check >>>>>>>>>>>> such >>>>>>>>>>>> that the element type of FSArrays is not checked and the code >>>>>>>>>>>> simply >>>>>>>>>>>> assumes that the source element type has the same features as the >>>>>>>>>>>> target >>>>>>>>>>>> element type (with the usual lenient handling of missing features >>>>>>>>>>>> in the >>>>>>>>>>>> target type)? - Kind of a "duck typing" approach? >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> >>>>>>>>>>>> -- Richard >>