Thanks. I will take a look at it and then I get back to you. Cheers, Mario
> On 25 Sep 2019, at 20:46 , Marshall Schor <[email protected]> wrote: > > Here's code that works that serializes in 1.1 format. > > The key idea is to set the OutputProperty OutputKeys.VERSION to "1.1". > > XmiCasSerializer xmiCasSerializer = new > XmiCasSerializer(jCas.getTypeSystem()); > OutputStream out = new FileOutputStream(new File ("odd-doc-txt-v11.xmi")); > try { > XMLSerializer xml11Serializer = new XMLSerializer(out); > xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1"); > xmiCasSerializer.serialize(jCas.getCas(), > xml11Serializer.getContentHandler()); > } > finally { > out.close(); > } > > This is from a test case. -Marshall > > On 9/25/2019 2:16 PM, Mario Juric wrote: >> Thanks Marshall, >> >> If you prefer then I can also have a look at it, although I probably need to >> finish something first within the next 3-4 weeks. It would probably get me >> faster started if you could share some of your experimental sample code. >> >> Cheers, >> Mario >> >> >> >> >> >> >> >> >> >> >> >> >> >>> On 24 Sep 2019, at 21:32 , Marshall Schor <[email protected]> wrote: >>> >>> yes, makes sense, thanks for posting the Jira. >>> >>> If no one else steps up to work on this, I'll probably take a look in a few >>> days. -Marshall >>> >>> On 9/24/2019 6:47 AM, Mario Juric wrote: >>>> Hi Marshall, >>>> >>>> I added the following feature request to Apache Jira: >>>> >>>> https://issues.apache.org/jira/browse/UIMA-6128 >>>> >>>> Hope it makes sense :) >>>> >>>> Thanks a lot for the help, it’s appreciated. >>>> >>>> Cheers, >>>> Mario >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>> On 23 Sep 2019, at 16:33 , Marshall Schor <[email protected]> wrote: >>>>> >>>>> Re: serializing using XML 1.1 >>>>> >>>>> This was not thought of, when setting up the CasIOUtils. >>>>> >>>>> The way it was done (above) was using some more "primitive/lower level" >>>>> APIs, >>>>> rather than the CasIOUtils. >>>>> >>>>> Please open a Jira ticket for this, with perhaps some suggestions on how >>>>> it >>>>> might be specified in the CasIOUtils APIs. >>>>> >>>>> Thanks! -Marshall >>>>> >>>>> On 9/23/2019 3:45 AM, Mario Juric wrote: >>>>>> Hi Marshall, >>>>>> >>>>>> Thanks for the thorough and excellent investigation. >>>>>> >>>>>> We are looking into possible normalisation/cleanup of >>>>>> whitespace/invisible characters, but I don’t think we can necessarily do >>>>>> the same for some of the other characters. It sounds to me though that >>>>>> serialising to XML 1.1 could also be a simple fix right now, but can >>>>>> this be configured? CasIOUtils doesn’t seem to have an option for this, >>>>>> so I assume it’s something you have working in your branch. >>>>>> >>>>>> Regarding the other problem. It seems that the JDK bug is fixed from >>>>>> Java 9 and after. Do you think switching to a more recent Java version >>>>>> would make a difference? I think we can also try this out ourselves when >>>>>> we look into migrating to UIMA 3 once our current deliveries are >>>>>> complete. We also like to switch to Java 11, and like UIMA 3 migration >>>>>> it will require some thorough testing. >>>>>> >>>>>> Cheers, >>>>>> Mario >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote: >>>>>>> >>>>>>> In the test "OddDocumentText", this produces a "throw" due to an >>>>>>> invalid xml >>>>>>> char, which is the \u0002. >>>>>>> >>>>>>> This is in part because the xml version being used is xml 1.0. >>>>>>> >>>>>>> XML 1.1 expanded the set of valid characters to include \u0002. >>>>>>> >>>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with >>>>>>> xml 1.1: >>>>>>> >>>>>>> XmiCasSerializer xmiCasSerializer = new >>>>>>> XmiCasSerializer(jCas.getTypeSystem()); >>>>>>> OutputStream out = new FileOutputStream(new File >>>>>>> ("odd-doc-txt-v11.xmi")); >>>>>>> try { >>>>>>> XMLSerializer xml11Serializer = new XMLSerializer(out); >>>>>>> xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1"); >>>>>>> xmiCasSerializer.serialize(jCas.getCas(), >>>>>>> xml11Serializer.getContentHandler()); >>>>>>> } >>>>>>> finally { >>>>>>> out.close(); >>>>>>> } >>>>>>> >>>>>>> This succeeds and serializes this using xml 1.1. >>>>>>> >>>>>>> I also tried serializing some doc text which includes \u77987. That >>>>>>> did not >>>>>>> serialize correctly. >>>>>>> I could see it in the code while tracing up to some point down in the >>>>>>> innards of >>>>>>> some internal >>>>>>> sax java code >>>>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize >>>>>>> where it was >>>>>>> "Correct" in the Java string. >>>>>>> >>>>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837. >>>>>>> >>>>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 >>>>>>> byte encoding: >>>>>>> 1110 xxxx 10xx xxxx 10xx xxxx >>>>>>> >>>>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to >>>>>>> me. >>>>>>> >>>>>>> But I think it's out of our hands - it's somewhere deep in the sax >>>>>>> transform >>>>>>> java code. >>>>>>> >>>>>>> I looked for a bug report and found some >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8058175 >>>>>>> >>>>>>> Bottom line, is, I think to clean out these characters early :-) . >>>>>>> >>>>>>> -Marshall >>>>>>> >>>>>>> >>>>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote: >>>>>>>> here's an idea. >>>>>>>> >>>>>>>> If you have a string, with the surrogate pair 𓂣 at position 10, >>>>>>>> and you >>>>>>>> have some Java code, which is iterating through the string and getting >>>>>>>> the >>>>>>>> code-point at each character offset, then that code will produce: >>>>>>>> >>>>>>>> at position 10: the code-point 77987 >>>>>>>> at position 11: the code-point 56483 >>>>>>>> >>>>>>>> Of course, it's a "bug" to iterate through a string of characters, >>>>>>>> assuming you >>>>>>>> have characters at each point, if you don't handle surrogate pairs. >>>>>>>> >>>>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 >>>>>>>> (see >>>>>>>> https://tools.ietf.org/html/rfc2781 ) >>>>>>>> >>>>>>>> I worry that even tools like the CVD or similar may not work properly, >>>>>>>> since >>>>>>>> they're not designed to handle surrogate pairs, I think, so I have no >>>>>>>> idea if >>>>>>>> they would work well enough for you. >>>>>>>> >>>>>>>> I'll poke around some more to see if I can enable the conversion for >>>>>>>> document >>>>>>>> strings. >>>>>>>> >>>>>>>> -Marshall >>>>>>>> >>>>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote: >>>>>>>>> Thanks Marshall, >>>>>>>>> >>>>>>>>> Encoding the characters like you suggest should work just fine for us >>>>>>>>> as long as we can serialize and deserialise the XMI, so that we can >>>>>>>>> open the content in a tool like the CVD or similar. These characters >>>>>>>>> are just noise from the original content that happen to remain in the >>>>>>>>> CAS, but they are not visible in our final output because they are >>>>>>>>> basically filtered out one way or the other by downstream components. >>>>>>>>> They become a problem though when they make it more difficult for us >>>>>>>>> to inspect the content. >>>>>>>>> >>>>>>>>> Regarding the feature name issue: Might you have an idea why we are >>>>>>>>> getting a different XMI output for the same character in our actual >>>>>>>>> pipeline, where it results in "𓂣�”? I investigated the >>>>>>>>> value in the debugger again, and like you are illustrating it is also >>>>>>>>> just a single codepoint with the value 77987. We are simply not able >>>>>>>>> to load this XMI because of this, but unfortunately I couldn’t >>>>>>>>> reproduce it in my small example. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Mario >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> The odd-feature-text seems to work OK, but has some unusual >>>>>>>>>> properties, due to >>>>>>>>>> that unicode character. >>>>>>>>>> >>>>>>>>>> Here's what I see: The FeatureRecord "name" field is set to a >>>>>>>>>> 1-unicode-character, that must be encoded as 2 java characters. >>>>>>>>>> >>>>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord >>>>>>>>>> xmi:id="18" >>>>>>>>>> name="𓂣" value="1.0"/> >>>>>>>>>> which seems correct. The name field only has 1 (extended)unicode >>>>>>>>>> character >>>>>>>>>> (taking 2 Java characters to represent), >>>>>>>>>> due to setting it with this code: String oddName = "\uD80C\uDCA3"; >>>>>>>>>> >>>>>>>>>> When read in, the name field is assigned to a String, that string >>>>>>>>>> says it has a >>>>>>>>>> length of 2 (but that's because it takes 2 java chars to represent >>>>>>>>>> this char). >>>>>>>>>> If you have the name string in a variable "n", and do >>>>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987. >>>>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1. >>>>>>>>>> >>>>>>>>>> So, the string value serialization and deserialization seems to be >>>>>>>>>> "working". >>>>>>>>>> >>>>>>>>>> The other code - for the sofa (document) serialization, is throwing >>>>>>>>>> that error, >>>>>>>>>> because as currently designed, the >>>>>>>>>> serialization code checks for these kinds of characters, and if >>>>>>>>>> found throws >>>>>>>>>> that exception. The code checking is >>>>>>>>>> in XMLUtils.checkForNonXmlCharacters >>>>>>>>>> >>>>>>>>>> This is because it's highly likely that "fixing this" in the same >>>>>>>>>> way as the >>>>>>>>>> other, would result in hard-to-diagnose >>>>>>>>>> future errors, because the subject of analysis string is processed >>>>>>>>>> with begin / >>>>>>>>>> end offset all over the place, and makes >>>>>>>>>> the assumption that the characters are all not coded as surrogate >>>>>>>>>> pairs. >>>>>>>>>> >>>>>>>>>> We could change the code to output these like the name, as, e.g., >>>>>>>>>> 𓂣 >>>>>>>>>> >>>>>>>>>> Would that help in your case, or do you imagine other kinds of >>>>>>>>>> things might >>>>>>>>>> break (due to begin/end offsets no longer >>>>>>>>>> being on character boundaries, for example). >>>>>>>>>> >>>>>>>>>> -Marshall >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I investigated the XMI issue as promised and these are my findings. >>>>>>>>>>> >>>>>>>>>>> It is related to special unicode characters that are not handled by >>>>>>>>>>> XMI >>>>>>>>>>> serialisation, and there seems to be two distinct categories of >>>>>>>>>>> issues we have >>>>>>>>>>> identified so far. >>>>>>>>>>> >>>>>>>>>>> 1) The document text of the CAS contains special unicode characters >>>>>>>>>>> 2) Annotations with String features have values containing special >>>>>>>>>>> unicode >>>>>>>>>>> characters >>>>>>>>>>> >>>>>>>>>>> In both cases we could for sure solve the problem if we did a >>>>>>>>>>> better clean up >>>>>>>>>>> job upstream, but with the amount and variety of data we receive >>>>>>>>>>> there is >>>>>>>>>>> always a chance something passes through, and some of it may in the >>>>>>>>>>> general >>>>>>>>>>> case even be valid content. >>>>>>>>>>> >>>>>>>>>>> The first case is easy to reproduce with the OddDocumentText >>>>>>>>>>> example I >>>>>>>>>>> attached. In this example the text is a snippet taken from the >>>>>>>>>>> content of a >>>>>>>>>>> parsed XML document. >>>>>>>>>>> >>>>>>>>>>> The other case was not possible to reproduce with the >>>>>>>>>>> OddFeatureText example, >>>>>>>>>>> because I am getting slightly different output to what I have in >>>>>>>>>>> our real >>>>>>>>>>> setup. The OddFeatureText example is based on the simple type >>>>>>>>>>> system I shared >>>>>>>>>>> previously. The name value of a FeatureRecord contains special >>>>>>>>>>> unicode >>>>>>>>>>> characters that I found in a similar data structure in our actual >>>>>>>>>>> CAS. The >>>>>>>>>>> value comes from an external knowledge base holding some noisy >>>>>>>>>>> strings, which >>>>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS >>>>>>>>>>> to XMI >>>>>>>>>>> using the small example it only outputs the first of the two >>>>>>>>>>> characters in >>>>>>>>>>> "\uD80C\uDCA3”, which yields the value "𓂣” in the XMI, but >>>>>>>>>>> in our >>>>>>>>>>> actual setup both character values are written as >>>>>>>>>>> "𓂣�”. This >>>>>>>>>>> means that the attached example will for some reason parse the XMI >>>>>>>>>>> again, but >>>>>>>>>>> it will not work in the case where both characters are written the >>>>>>>>>>> way we >>>>>>>>>>> experience it. The XMI can be manually changed, so that both >>>>>>>>>>> character values >>>>>>>>>>> are included the way it happens in our output, and in this case a >>>>>>>>>>> SAXParserException happens. >>>>>>>>>>> >>>>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser >>>>>>>>>>> to handle >>>>>>>>>>> any of this, but it will be good to know in any case :) >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Mario >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] >>>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected] >>>>>>>>>>>> <mailto:[email protected]>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thank you very much for looking into this. It is really >>>>>>>>>>>> appreciated and I >>>>>>>>>>>> think it touches upon something important, which is about data >>>>>>>>>>>> migration in >>>>>>>>>>>> general. >>>>>>>>>>>> >>>>>>>>>>>> I agree that some of these solutions can appear specific, awkward >>>>>>>>>>>> or complex >>>>>>>>>>>> and the way forward is not to address our use case alone. I think >>>>>>>>>>>> there is a >>>>>>>>>>>> need for a compact and efficient binary serialization format for >>>>>>>>>>>> the CAS when >>>>>>>>>>>> dealing with large amounts of data because this is directly >>>>>>>>>>>> visible in costs >>>>>>>>>>>> of processing and storing, and I found the compressed binary >>>>>>>>>>>> format to be >>>>>>>>>>>> much better than XMI in this regard, although I have to admit it’s >>>>>>>>>>>> been a >>>>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well >>>>>>>>>>>> described >>>>>>>>>>>> type system then maybe it just lacks a way to describe schema >>>>>>>>>>>> evolution >>>>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I >>>>>>>>>>>> think a more >>>>>>>>>>>> formal approach to data migration would be critical to any larger >>>>>>>>>>>> operational >>>>>>>>>>>> setup. >>>>>>>>>>>> >>>>>>>>>>>> Regarding XMI I like to provide some input to the problem we are >>>>>>>>>>>> observing, >>>>>>>>>>>> so that it can be solved. We are primarily using XMI for >>>>>>>>>>>> inspection/debugging >>>>>>>>>>>> purposes, and we are sometimes not able to do this because of this >>>>>>>>>>>> error. I >>>>>>>>>>>> will try to extract a minimum example to avoid involving parts >>>>>>>>>>>> that has to do >>>>>>>>>>>> with our pipeline and type system, and I think this would also be >>>>>>>>>>>> the best >>>>>>>>>>>> way to illustrate that the problem exists outside of this context. >>>>>>>>>>>> However, >>>>>>>>>>>> converting all our data to XMI first in order to do the conversion >>>>>>>>>>>> in our >>>>>>>>>>>> example would not be very practical for us, because it involves a >>>>>>>>>>>> large >>>>>>>>>>>> amount of data. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Mario >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] >>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> In this case, the original looks kind-of like this: >>>>>>>>>>>>> >>>>>>>>>>>>> Container >>>>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which >>>>>>>>>>>>> has 5 slots: sofaRef, begin, end, name, >>>>>>>>>>>>> value >>>>>>>>>>>>> >>>>>>>>>>>>> the new TypeSystem has >>>>>>>>>>>>> >>>>>>>>>>>>> Container >>>>>>>>>>>>> features -> FSArray of FeatureRecord each of which >>>>>>>>>>>>> has 2 slots: name, value >>>>>>>>>>>>> >>>>>>>>>>>>> The deserializer code would need some way to decide how to >>>>>>>>>>>>> 1) create an FSArray of FeatureRecord, >>>>>>>>>>>>> 2) for each element, >>>>>>>>>>>>> map the FeatureAnnotation to a new instance of FeatureRecord >>>>>>>>>>>>> >>>>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of >>>>>>>>>>>>> 1) change the type from A to B >>>>>>>>>>>>> 2) set equal-named features from A to B, drop other features >>>>>>>>>>>>> >>>>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, >>>>>>>>>>>>> namely, only >>>>>>>>>>>>> those referenced by the FSArray where the element type changed. >>>>>>>>>>>>> Seems complex >>>>>>>>>>>>> and specific to this use case though. >>>>>>>>>>>>> >>>>>>>>>>>>> -Marshall >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote: >>>>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] >>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>>>>>>>>>>>>>> I can reproduce the problem, and see what is happening. The >>>>>>>>>>>>>>> deserialization >>>>>>>>>>>>>>> code compares the two type systems, and allows for some >>>>>>>>>>>>>>> mismatches (things >>>>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for >>>>>>>>>>>>>>> having a >>>>>>>>>>>>>>> feature >>>>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type >>>>>>>>>>>>>>> YYYY in the >>>>>>>>>>>>>>> other. >>>>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315. >>>>>>>>>>>>>> Without reading the code in detail - could we not relax this >>>>>>>>>>>>>> check such >>>>>>>>>>>>>> that the element type of FSArrays is not checked and the code >>>>>>>>>>>>>> simply >>>>>>>>>>>>>> assumes that the source element type has the same features as >>>>>>>>>>>>>> the target >>>>>>>>>>>>>> element type (with the usual lenient handling of missing >>>>>>>>>>>>>> features in the >>>>>>>>>>>>>> target type)? - Kind of a "duck typing" approach? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- Richard >>
