Thanks Marshall,

If you prefer then I can also have a look at it, although I probably need to 
finish something first within the next 3-4 weeks. It would probably get me 
faster started if you could share some of your experimental sample code.

Cheers,
Mario













> On 24 Sep 2019, at 21:32 , Marshall Schor <m...@schor.com> wrote:
> 
> yes, makes sense, thanks for posting the Jira.
> 
> If no one else steps up to work on this, I'll probably take a look in a few
> days. -Marshall
> 
> On 9/24/2019 6:47 AM, Mario Juric wrote:
>> Hi Marshall,
>> 
>> I added the following feature request to Apache Jira:
>> 
>> https://issues.apache.org/jira/browse/UIMA-6128
>> 
>> Hope it makes sense :)
>> 
>> Thanks a lot for the help, it’s appreciated.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 23 Sep 2019, at 16:33 , Marshall Schor <m...@schor.com> wrote:
>>> 
>>> Re: serializing using XML 1.1
>>> 
>>> This was not thought of, when setting up the CasIOUtils.
>>> 
>>> The way it was done (above) was using some more "primitive/lower level" 
>>> APIs,
>>> rather than the CasIOUtils.
>>> 
>>> Please open a Jira ticket for this, with perhaps some suggestions on how it
>>> might be specified in the CasIOUtils APIs.
>>> 
>>> Thanks! -Marshall
>>> 
>>> On 9/23/2019 3:45 AM, Mario Juric wrote:
>>>> Hi Marshall,
>>>> 
>>>> Thanks for the thorough and excellent investigation.
>>>> 
>>>> We are looking into possible normalisation/cleanup of whitespace/invisible 
>>>> characters, but I don’t think we can necessarily do the same for some of 
>>>> the other characters. It sounds to me though that serialising to XML 1.1 
>>>> could also be a simple fix right now, but can this be configured? 
>>>> CasIOUtils doesn’t seem to have an option for this, so I assume it’s 
>>>> something you have working in your branch.
>>>> 
>>>> Regarding the other problem. It seems that the JDK bug is fixed from Java 
>>>> 9 and after. Do you think switching to a more recent Java version would 
>>>> make a difference? I think we can also try this out ourselves when we look 
>>>> into migrating to UIMA 3 once our current deliveries are complete. We also 
>>>> like to switch to Java 11, and like UIMA 3 migration it will require some 
>>>> thorough testing.
>>>> 
>>>> Cheers,
>>>> Mario
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 20 Sep 2019, at 20:52 , Marshall Schor <m...@schor.com> wrote:
>>>>> 
>>>>> In the test "OddDocumentText", this produces a "throw" due to an invalid 
>>>>> xml
>>>>> char, which is the \u0002.
>>>>> 
>>>>> This is in part because the xml version being used is xml 1.0.
>>>>> 
>>>>> XML 1.1 expanded the set of valid characters to include \u0002.
>>>>> 
>>>>> Here's a snip from the XmiCasSerializerTest class which serializes with 
>>>>> xml 1.1:
>>>>> 
>>>>>       XmiCasSerializer xmiCasSerializer = new
>>>>> XmiCasSerializer(jCas.getTypeSystem());
>>>>>       OutputStream out = new FileOutputStream(new File 
>>>>> ("odd-doc-txt-v11.xmi"));
>>>>>       try {
>>>>>         XMLSerializer xml11Serializer = new XMLSerializer(out);
>>>>>         xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>>>>>         xmiCasSerializer.serialize(jCas.getCas(),
>>>>> xml11Serializer.getContentHandler());
>>>>>       }
>>>>>       finally {
>>>>>         out.close();
>>>>>       }
>>>>> 
>>>>> This succeeds and serializes this using xml 1.1.
>>>>> 
>>>>> I also tried serializing some doc text which includes \u77987.  That did 
>>>>> not
>>>>> serialize correctly.
>>>>> I could see it in the code while tracing up to some point down in the 
>>>>> innards of
>>>>> some internal
>>>>> sax java code
>>>>> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where 
>>>>> it was
>>>>> "Correct" in the Java string.
>>>>> 
>>>>> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
>>>>> 
>>>>> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 
>>>>> byte encoding:
>>>>>       1110 xxxx 10xx xxxx 10xx xxxx
>>>>> 
>>>>> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
>>>>> 
>>>>> But I think it's out of our hands - it's somewhere deep in the sax 
>>>>> transform
>>>>> java code.
>>>>> 
>>>>> I looked for a bug report and found some
>>>>> https://bugs.openjdk.java.net/browse/JDK-8058175
>>>>> 
>>>>> Bottom line, is, I think to clean out these characters early :-) .
>>>>> 
>>>>> -Marshall
>>>>> 
>>>>> 
>>>>> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>>>>>> here's an idea.
>>>>>> 
>>>>>> If you have a string, with the surrogate pair &#77987 at position 10, 
>>>>>> and you
>>>>>> have some Java code, which is iterating through the string and getting 
>>>>>> the
>>>>>> code-point at each character offset, then that code will produce:
>>>>>> 
>>>>>> at position 10:  the code-point 77987
>>>>>> at position 11:  the code-point 56483
>>>>>> 
>>>>>> Of course, it's a "bug" to iterate through a string of characters, 
>>>>>> assuming you
>>>>>> have characters at each point, if you don't handle surrogate pairs.
>>>>>> 
>>>>>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 
>>>>>> (see
>>>>>> https://tools.ietf.org/html/rfc2781 )
>>>>>> 
>>>>>> I worry that even tools like the CVD or similar may not work properly, 
>>>>>> since
>>>>>> they're not designed to handle surrogate pairs, I think, so I have no 
>>>>>> idea if
>>>>>> they would work well enough for you.
>>>>>> 
>>>>>> I'll poke around some more to see if I can enable the conversion for 
>>>>>> document
>>>>>> strings.
>>>>>> 
>>>>>> -Marshall
>>>>>> 
>>>>>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>>>>>> Thanks Marshall,
>>>>>>> 
>>>>>>> Encoding the characters like you suggest should work just fine for us 
>>>>>>> as long as we can serialize and deserialise the XMI, so that we can 
>>>>>>> open the content in a tool like the CVD or similar. These characters 
>>>>>>> are just noise from the original content that happen to remain in the 
>>>>>>> CAS, but they are not visible in our final output because they are 
>>>>>>> basically filtered out one way or the other by downstream components. 
>>>>>>> They become a problem though when they make it more difficult for us to 
>>>>>>> inspect the content.
>>>>>>> 
>>>>>>> Regarding the feature name issue: Might you have an idea why we are 
>>>>>>> getting a different XMI output for the same character in our actual 
>>>>>>> pipeline, where it results in "&#77987;&#56483;”? I investigated the 
>>>>>>> value in the debugger again, and like you are illustrating it is also 
>>>>>>> just a single codepoint with the value 77987. We are simply not able to 
>>>>>>> load this XMI because of this, but unfortunately I couldn’t reproduce 
>>>>>>> it in my small example.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <m...@schor.com> wrote:
>>>>>>>> 
>>>>>>>> The odd-feature-text seems to work OK, but has some unusual 
>>>>>>>> properties, due to
>>>>>>>> that unicode character.
>>>>>>>> 
>>>>>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>>>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>>>>>> 
>>>>>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>>>>>>> xmi:id="18"
>>>>>>>> name="&#77987;" value="1.0"/>
>>>>>>>> which seems correct.  The name field only has 1 (extended)unicode 
>>>>>>>> character
>>>>>>>> (taking 2 Java characters to represent),
>>>>>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>>>>>> 
>>>>>>>> When read in, the name field is assigned to a String, that string says 
>>>>>>>> it has a
>>>>>>>> length of 2 (but that's because it takes 2 java chars to represent 
>>>>>>>> this char).
>>>>>>>> If you have the name string in a variable "n", and do
>>>>>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>>>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>>>>>> 
>>>>>>>> So, the string value serialization and deserialization seems to be 
>>>>>>>> "working".
>>>>>>>> 
>>>>>>>> The other code - for the sofa (document) serialization, is throwing 
>>>>>>>> that error,
>>>>>>>> because as currently designed, the
>>>>>>>> serialization code checks for these kinds of characters, and if found 
>>>>>>>> throws
>>>>>>>> that exception.  The code checking is
>>>>>>>> in XMLUtils.checkForNonXmlCharacters
>>>>>>>> 
>>>>>>>> This is because it's highly likely that "fixing this" in the same way 
>>>>>>>> as the
>>>>>>>> other, would result in hard-to-diagnose
>>>>>>>> future errors, because the subject of analysis string is processed 
>>>>>>>> with begin /
>>>>>>>> end offset all over the place, and makes
>>>>>>>> the assumption that the characters are all not coded as surrogate 
>>>>>>>> pairs.
>>>>>>>> 
>>>>>>>> We could change the code to output these like the name, as, e.g.,  
>>>>>>>> &#77987; 
>>>>>>>> 
>>>>>>>> Would that help in your case, or do you imagine other kinds of things 
>>>>>>>> might
>>>>>>>> break (due to begin/end offsets no longer
>>>>>>>> being on character boundaries, for example).
>>>>>>>> 
>>>>>>>> -Marshall
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>>>>>> 
>>>>>>>>> It is related to special unicode characters that are not handled by 
>>>>>>>>> XMI
>>>>>>>>> serialisation, and there seems to be two distinct categories of 
>>>>>>>>> issues we have
>>>>>>>>> identified so far.
>>>>>>>>> 
>>>>>>>>> 1) The document text of the CAS contains special unicode characters
>>>>>>>>> 2) Annotations with String features have values containing special 
>>>>>>>>> unicode
>>>>>>>>> characters
>>>>>>>>> 
>>>>>>>>> In both cases we could for sure solve the problem if we did a better 
>>>>>>>>> clean up
>>>>>>>>> job upstream, but with the amount and variety of data we receive 
>>>>>>>>> there is
>>>>>>>>> always a chance something passes through, and some of it may in the 
>>>>>>>>> general
>>>>>>>>> case even be valid content.
>>>>>>>>> 
>>>>>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>>>>>> attached. In this example the text is a snippet taken from the 
>>>>>>>>> content of a
>>>>>>>>> parsed XML document.
>>>>>>>>> 
>>>>>>>>> The other case was not possible to reproduce with the OddFeatureText 
>>>>>>>>> example,
>>>>>>>>> because I am getting slightly different output to what I have in our 
>>>>>>>>> real
>>>>>>>>> setup. The OddFeatureText example is based on the simple type system 
>>>>>>>>> I shared
>>>>>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>>>>>> characters that I found in a similar data structure in our actual 
>>>>>>>>> CAS. The
>>>>>>>>> value comes from an external knowledge base holding some noisy 
>>>>>>>>> strings, which
>>>>>>>>> in this case is a hieroglyph entity. However, when I write the CAS to 
>>>>>>>>> XMI
>>>>>>>>> using the small example it only outputs the first of the two 
>>>>>>>>> characters in
>>>>>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in 
>>>>>>>>> our
>>>>>>>>> actual setup both character values are written as "&#77987;&#56483;”. 
>>>>>>>>> This
>>>>>>>>> means that the attached example will for some reason parse the XMI 
>>>>>>>>> again, but
>>>>>>>>> it will not work in the case where both characters are written the 
>>>>>>>>> way we
>>>>>>>>> experience it. The XMI can be manually changed, so that both 
>>>>>>>>> character values
>>>>>>>>> are included the way it happens in our output, and in this case a
>>>>>>>>> SAXParserException happens.
>>>>>>>>> 
>>>>>>>>> I don’t know whether it is outside the scope of the XMI serialiser to 
>>>>>>>>> handle
>>>>>>>>> any of this, but it will be good to know in any case :)
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Mario
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <m...@unsilo.ai 
>>>>>>>>>> <mailto:m...@unsilo.ai> <mailto:m...@unsilo.ai 
>>>>>>>>>> <mailto:m...@unsilo.ai>>>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thank you very much for looking into this. It is really appreciated 
>>>>>>>>>> and I
>>>>>>>>>> think it touches upon something important, which is about data 
>>>>>>>>>> migration in
>>>>>>>>>> general.
>>>>>>>>>> 
>>>>>>>>>> I agree that some of these solutions can appear specific, awkward or 
>>>>>>>>>> complex
>>>>>>>>>> and the way forward is not to address our use case alone. I think 
>>>>>>>>>> there is a
>>>>>>>>>> need for a compact and efficient binary serialization format for the 
>>>>>>>>>> CAS when
>>>>>>>>>> dealing with large amounts of data because this is directly visible 
>>>>>>>>>> in costs
>>>>>>>>>> of processing and storing, and I found the compressed binary format 
>>>>>>>>>> to be
>>>>>>>>>> much better than XMI in this regard, although I have to admit it’s 
>>>>>>>>>> been a
>>>>>>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>>>>>>> described
>>>>>>>>>> type system then maybe it just lacks a way to describe schema 
>>>>>>>>>> evolution
>>>>>>>>>> similar to Apache Avro or similar serialisation frameworks. I think 
>>>>>>>>>> a more
>>>>>>>>>> formal approach to data migration would be critical to any larger 
>>>>>>>>>> operational
>>>>>>>>>> setup.
>>>>>>>>>> 
>>>>>>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>>>>>>> observing,
>>>>>>>>>> so that it can be solved. We are primarily using XMI for 
>>>>>>>>>> inspection/debugging
>>>>>>>>>> purposes, and we are sometimes not able to do this because of this 
>>>>>>>>>> error. I
>>>>>>>>>> will try to extract a minimum example to avoid involving parts that 
>>>>>>>>>> has to do
>>>>>>>>>> with our pipeline and type system, and I think this would also be 
>>>>>>>>>> the best
>>>>>>>>>> way to illustrate that the problem exists outside of this context. 
>>>>>>>>>> However,
>>>>>>>>>> converting all our data to XMI first in order to do the conversion 
>>>>>>>>>> in our
>>>>>>>>>> example would not be very practical for us, because it involves a 
>>>>>>>>>> large
>>>>>>>>>> amount of data.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Mario
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <m...@schor.com 
>>>>>>>>>>> <mailto:m...@schor.com>
>>>>>>>>>>> <mailto:m...@schor.com <mailto:m...@schor.com>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> In this case, the original looks kind-of like this:
>>>>>>>>>>> 
>>>>>>>>>>> Container
>>>>>>>>>>> features -> FSArray of FeatureAnnotation each of which
>>>>>>>>>>>                           has 5 slots: sofaRef, begin, end, name, 
>>>>>>>>>>> value
>>>>>>>>>>> 
>>>>>>>>>>> the new TypeSystem has
>>>>>>>>>>> 
>>>>>>>>>>> Container
>>>>>>>>>>> features -> FSArray of FeatureRecord each of which
>>>>>>>>>>>                            has 2 slots: name, value
>>>>>>>>>>> 
>>>>>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>>>>> 1) create an FSArray of FeatureRecord,
>>>>>>>>>>> 2) for each element,
>>>>>>>>>>>    map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>>>>>> 
>>>>>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>>>>> 1) change the type from A to B
>>>>>>>>>>> 2) set equal-named features from A to B, drop other features
>>>>>>>>>>> 
>>>>>>>>>>> This mapping would need to apply to a subset of the A's and B's, 
>>>>>>>>>>> namely, only
>>>>>>>>>>> those referenced by the FSArray where the element type changed.  
>>>>>>>>>>> Seems complex
>>>>>>>>>>> and specific to this use case though.
>>>>>>>>>>> 
>>>>>>>>>>> -Marshall
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <m...@schor.com 
>>>>>>>>>>>> <mailto:m...@schor.com>
>>>>>>>>>>>> <mailto:m...@schor.com <mailto:m...@schor.com>>> wrote:
>>>>>>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>>>>>>> deserialization
>>>>>>>>>>>>> code compares the two type systems, and allows for some 
>>>>>>>>>>>>> mismatches (things
>>>>>>>>>>>>> present in one and not in the other), but it doesn't allow for 
>>>>>>>>>>>>> having a
>>>>>>>>>>>>> feature
>>>>>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY 
>>>>>>>>>>>>> in the
>>>>>>>>>>>>> other.
>>>>>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>>>>>> Without reading the code in detail - could we not relax this check 
>>>>>>>>>>>> such
>>>>>>>>>>>> that the element type of FSArrays is not checked and the code 
>>>>>>>>>>>> simply
>>>>>>>>>>>> assumes that the source element type has the same features as the 
>>>>>>>>>>>> target
>>>>>>>>>>>> element type (with the usual lenient handling of missing features 
>>>>>>>>>>>> in the
>>>>>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> 
>>>>>>>>>>>> -- Richard
>> 

Reply via email to