Hi Marshall,

Thanks for the thorough and excellent investigation.

We are looking into possible normalisation/cleanup of whitespace/invisible 
characters, but I don’t think we can necessarily do the same for some of the 
other characters. It sounds to me though that serialising to XML 1.1 could also 
be a simple fix right now, but can this be configured? CasIOUtils doesn’t seem 
to have an option for this, so I assume it’s something you have working in your 
branch.

Regarding the other problem. It seems that the JDK bug is fixed from Java 9 and 
after. Do you think switching to a more recent Java version would make a 
difference? I think we can also try this out ourselves when we look into 
migrating to UIMA 3 once our current deliveries are complete. We also like to 
switch to Java 11, and like UIMA 3 migration it will require some thorough 
testing.

Cheers,
Mario













> On 20 Sep 2019, at 20:52 , Marshall Schor <[email protected]> wrote:
> 
> In the test "OddDocumentText", this produces a "throw" due to an invalid xml
> char, which is the \u0002.
> 
> This is in part because the xml version being used is xml 1.0.
> 
> XML 1.1 expanded the set of valid characters to include \u0002.
> 
> Here's a snip from the XmiCasSerializerTest class which serializes with xml 
> 1.1:
> 
>         XmiCasSerializer xmiCasSerializer = new
> XmiCasSerializer(jCas.getTypeSystem());
>         OutputStream out = new FileOutputStream(new File 
> ("odd-doc-txt-v11.xmi"));
>         try {
>           XMLSerializer xml11Serializer = new XMLSerializer(out);
>           xml11Serializer.setOutputProperty(OutputKeys.VERSION,"1.1");
>           xmiCasSerializer.serialize(jCas.getCas(),
> xml11Serializer.getContentHandler());
>         }
>         finally {
>           out.close();
>         }
> 
> This succeeds and serializes this using xml 1.1.
> 
> I also tried serializing some doc text which includes \u77987.  That did not
> serialize correctly.
> I could see it in the code while tracing up to some point down in the innards 
> of
> some internal
> sax java code
> com.sun.org.apache.xml.internal.serializer.AttributesImplSerialize  where it 
> was
> "Correct" in the Java string.
> 
> When serialized (as UTF-8) it came out as a 4 byte string E79E 9837.
> 
> This is 1110 0111 1001 1110 1001 1000 0011 0111, which in utf8 is a 3 byte 
> encoding:
>         1110 xxxx 10xx xxxx 10xx xxxx
> 
> of 0111 0111 1001 1000 which in hex is "7 7 9 8" so it looks fishy to me.
> 
> But I think it's out of our hands - it's somewhere deep in the sax transform
> java code.
> 
> I looked for a bug report and found some
> https://bugs.openjdk.java.net/browse/JDK-8058175
> 
> Bottom line, is, I think to clean out these characters early :-) .
> 
> -Marshall
> 
> 
> On 9/20/2019 1:28 PM, Marshall Schor wrote:
>> here's an idea.
>> 
>> If you have a string, with the surrogate pair &#77987 at position 10, and you
>> have some Java code, which is iterating through the string and getting the
>> code-point at each character offset, then that code will produce:
>> 
>> at position 10:  the code-point 77987
>> at position 11:  the code-point 56483
>> 
>> Of course, it's a "bug" to iterate through a string of characters, assuming 
>> you
>> have characters at each point, if you don't handle surrogate pairs.
>> 
>> The 56483 is just the lower bits of the surrogate pair, added to xDC00 (see
>> https://tools.ietf.org/html/rfc2781 )
>> 
>> I worry that even tools like the CVD or similar may not work properly, since
>> they're not designed to handle surrogate pairs, I think, so I have no idea if
>> they would work well enough for you.
>> 
>> I'll poke around some more to see if I can enable the conversion for document
>> strings.
>> 
>> -Marshall
>> 
>> On 9/20/2019 11:09 AM, Mario Juric wrote:
>>> Thanks Marshall,
>>> 
>>> Encoding the characters like you suggest should work just fine for us as 
>>> long as we can serialize and deserialise the XMI, so that we can open the 
>>> content in a tool like the CVD or similar. These characters are just noise 
>>> from the original content that happen to remain in the CAS, but they are 
>>> not visible in our final output because they are basically filtered out one 
>>> way or the other by downstream components. They become a problem though 
>>> when they make it more difficult for us to inspect the content.
>>> 
>>> Regarding the feature name issue: Might you have an idea why we are getting 
>>> a different XMI output for the same character in our actual pipeline, where 
>>> it results in "&#77987;&#56483;”? I investigated the value in the debugger 
>>> again, and like you are illustrating it is also just a single codepoint 
>>> with the value 77987. We are simply not able to load this XMI because of 
>>> this, but unfortunately I couldn’t reproduce it in my small example.
>>> 
>>> Cheers,
>>> Mario
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 19 Sep 2019, at 22:41 , Marshall Schor <[email protected]> wrote:
>>>> 
>>>> The odd-feature-text seems to work OK, but has some unusual properties, 
>>>> due to
>>>> that unicode character.
>>>> 
>>>> Here's what I see:  The FeatureRecord "name" field is set to a
>>>> 1-unicode-character, that must be encoded as 2 java characters.
>>>> 
>>>> When output, it shows up in the xmi as <noNamespace:FeatureRecord 
>>>> xmi:id="18"
>>>> name="&#77987;" value="1.0"/>
>>>> which seems correct.  The name field only has 1 (extended)unicode character
>>>> (taking 2 Java characters to represent),
>>>> due to setting it with this code:   String oddName = "\uD80C\uDCA3";
>>>> 
>>>> When read in, the name field is assigned to a String, that string says it 
>>>> has a
>>>> length of 2 (but that's because it takes 2 java chars to represent this 
>>>> char).
>>>> If you have the name string in a variable "n", and do
>>>> System.out.println(n.codePointAt(0)), it shows (correctly) 77987.
>>>> n.codePointCount(0, n.length()) is, as expected, 1.
>>>> 
>>>> So, the string value serialization and deserialization seems to be 
>>>> "working".
>>>> 
>>>> The other code - for the sofa (document) serialization, is throwing that 
>>>> error,
>>>> because as currently designed, the
>>>> serialization code checks for these kinds of characters, and if found 
>>>> throws
>>>> that exception.  The code checking is
>>>> in XMLUtils.checkForNonXmlCharacters
>>>> 
>>>> This is because it's highly likely that "fixing this" in the same way as 
>>>> the
>>>> other, would result in hard-to-diagnose
>>>> future errors, because the subject of analysis string is processed with 
>>>> begin /
>>>> end offset all over the place, and makes
>>>> the assumption that the characters are all not coded as surrogate pairs.
>>>> 
>>>> We could change the code to output these like the name, as, e.g.,  
>>>> &#77987; 
>>>> 
>>>> Would that help in your case, or do you imagine other kinds of things might
>>>> break (due to begin/end offsets no longer
>>>> being on character boundaries, for example).
>>>> 
>>>> -Marshall
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 9/18/2019 11:41 AM, Mario Juric wrote:
>>>>> Hi,
>>>>> 
>>>>> I investigated the XMI issue as promised and these are my findings.
>>>>> 
>>>>> It is related to special unicode characters that are not handled by XMI
>>>>> serialisation, and there seems to be two distinct categories of issues we 
>>>>> have
>>>>> identified so far.
>>>>> 
>>>>> 1) The document text of the CAS contains special unicode characters
>>>>> 2) Annotations with String features have values containing special unicode
>>>>> characters
>>>>> 
>>>>> In both cases we could for sure solve the problem if we did a better 
>>>>> clean up
>>>>> job upstream, but with the amount and variety of data we receive there is
>>>>> always a chance something passes through, and some of it may in the 
>>>>> general
>>>>> case even be valid content.
>>>>> 
>>>>> The first case is easy to reproduce with the OddDocumentText example I
>>>>> attached. In this example the text is a snippet taken from the content of 
>>>>> a
>>>>> parsed XML document.
>>>>> 
>>>>> The other case was not possible to reproduce with the OddFeatureText 
>>>>> example,
>>>>> because I am getting slightly different output to what I have in our real
>>>>> setup. The OddFeatureText example is based on the simple type system I 
>>>>> shared
>>>>> previously. The name value of a FeatureRecord contains special unicode
>>>>> characters that I found in a similar data structure in our actual CAS. The
>>>>> value comes from an external knowledge base holding some noisy strings, 
>>>>> which
>>>>> in this case is a hieroglyph entity. However, when I write the CAS to XMI
>>>>> using the small example it only outputs the first of the two characters in
>>>>> "\uD80C\uDCA3”, which yields the value "&#77987;” in the XMI, but in our
>>>>> actual setup both character values are written as "&#77987;&#56483;”. This
>>>>> means that the attached example will for some reason parse the XMI again, 
>>>>> but
>>>>> it will not work in the case where both characters are written the way we
>>>>> experience it. The XMI can be manually changed, so that both character 
>>>>> values
>>>>> are included the way it happens in our output, and in this case a
>>>>> SAXParserException happens.
>>>>> 
>>>>> I don’t know whether it is outside the scope of the XMI serialiser to 
>>>>> handle
>>>>> any of this, but it will be good to know in any case :)
>>>>> 
>>>>> Cheers,
>>>>> Mario
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 17 Sep 2019, at 09:36 , Mario Juric <[email protected] 
>>>>>> <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>> wrote:
>>>>>> 
>>>>>> Thank you very much for looking into this. It is really appreciated and I
>>>>>> think it touches upon something important, which is about data migration 
>>>>>> in
>>>>>> general.
>>>>>> 
>>>>>> I agree that some of these solutions can appear specific, awkward or 
>>>>>> complex
>>>>>> and the way forward is not to address our use case alone. I think there 
>>>>>> is a
>>>>>> need for a compact and efficient binary serialization format for the CAS 
>>>>>> when
>>>>>> dealing with large amounts of data because this is directly visible in 
>>>>>> costs
>>>>>> of processing and storing, and I found the compressed binary format to be
>>>>>> much better than XMI in this regard, although I have to admit it’s been a
>>>>>> while since I benchmarked this. Given that UIMA already has a well 
>>>>>> described
>>>>>> type system then maybe it just lacks a way to describe schema evolution
>>>>>> similar to Apache Avro or similar serialisation frameworks. I think a 
>>>>>> more
>>>>>> formal approach to data migration would be critical to any larger 
>>>>>> operational
>>>>>> setup.
>>>>>> 
>>>>>> Regarding XMI I like to provide some input to the problem we are 
>>>>>> observing,
>>>>>> so that it can be solved. We are primarily using XMI for 
>>>>>> inspection/debugging
>>>>>> purposes, and we are sometimes not able to do this because of this 
>>>>>> error. I
>>>>>> will try to extract a minimum example to avoid involving parts that has 
>>>>>> to do
>>>>>> with our pipeline and type system, and I think this would also be the 
>>>>>> best
>>>>>> way to illustrate that the problem exists outside of this context. 
>>>>>> However,
>>>>>> converting all our data to XMI first in order to do the conversion in our
>>>>>> example would not be very practical for us, because it involves a large
>>>>>> amount of data.
>>>>>> 
>>>>>> Cheers,
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 16 Sep 2019, at 23:02 , Marshall Schor <[email protected] 
>>>>>>> <mailto:[email protected]>
>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>> 
>>>>>>> In this case, the original looks kind-of like this:
>>>>>>> 
>>>>>>> Container
>>>>>>>   features -> FSArray of FeatureAnnotation each of which
>>>>>>>                             has 5 slots: sofaRef, begin, end, name, 
>>>>>>> value
>>>>>>> 
>>>>>>> the new TypeSystem has
>>>>>>> 
>>>>>>> Container
>>>>>>>   features -> FSArray of FeatureRecord each of which
>>>>>>>                              has 2 slots: name, value
>>>>>>> 
>>>>>>> The deserializer code would need some way to decide how to
>>>>>>>   1) create an FSArray of FeatureRecord,
>>>>>>>   2) for each element,
>>>>>>>      map the FeatureAnnotation to a new instance of FeatureRecord
>>>>>>> 
>>>>>>> I guess I could imagine a default mapping (for item 2 above) of
>>>>>>>  1) change the type from A to B
>>>>>>>  2) set equal-named features from A to B, drop other features
>>>>>>> 
>>>>>>> This mapping would need to apply to a subset of the A's and B's, 
>>>>>>> namely, only
>>>>>>> those referenced by the FSArray where the element type changed.  Seems 
>>>>>>> complex
>>>>>>> and specific to this use case though.
>>>>>>> 
>>>>>>> -Marshall
>>>>>>> 
>>>>>>> 
>>>>>>> On 9/16/2019 2:42 PM, Richard Eckart de Castilho wrote:
>>>>>>>> On 16. Sep 2019, at 19:05, Marshall Schor <[email protected] 
>>>>>>>> <mailto:[email protected]>
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>>>> I can reproduce the problem, and see what is happening.  The 
>>>>>>>>> deserialization
>>>>>>>>> code compares the two type systems, and allows for some mismatches 
>>>>>>>>> (things
>>>>>>>>> present in one and not in the other), but it doesn't allow for having 
>>>>>>>>> a
>>>>>>>>> feature
>>>>>>>>> whose range (value) is type XXXX in one type system and type YYYY in 
>>>>>>>>> the
>>>>>>>>> other.
>>>>>>>>> See CasTypeSystemMapper lines 299 - 315.
>>>>>>>> Without reading the code in detail - could we not relax this check such
>>>>>>>> that the element type of FSArrays is not checked and the code simply
>>>>>>>> assumes that the source element type has the same features as the 
>>>>>>>> target
>>>>>>>> element type (with the usual lenient handling of missing features in 
>>>>>>>> the
>>>>>>>> target type)? - Kind of a "duck typing" approach?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> -- Richard

Reply via email to