>> Richard Eckart de Castilho commented on UIMA-3141:
>> --------------------------------------------------
>> 
>> If the custom sub-type of DocumentAnnotation is part of the target type 
>> system, it works (not verified in exactly the given test case, but in the 
>> context form which this test was distilled).
>> 
>> Since the document annotation is a special annotation in UIMA, it may 
>> require special handling. I would expect that all features are set if they 
>> are available on the document annotation, even if the type of the document 
>> annotation is not the same.
> I'm not following... All features of DocMeta are set.  It's the entire type
> instance of DocMeta that's being "filtered out" when deserializing.
> 
> I'm probably not understanding your point correctly though - please say more .

I think my point is this: if the type T for a given FS is not available in the 
target type system, but a type S which is the supertype of T in the source type 
system is available in the target type system, then an instance of S should be 
created and all features should be set.

Now, stating it like this, it becomes obvious that this is probably not the 
best idea in general. Eventually nothing would be filtered, because everything 
inherits from TOP. But I'll still go on and explain what lead me to believe 
this would be a good idea, at least in certain cases.

Case 1: custom document annotation

First off, this point is moot when the document annotation type is customized 
as described in the UIMA documentation [1]. However, not everybody follows that 
documentation. E.g. Ruta and DKPro Core instead customize the document 
annotation type by deriving from it.

The document annotation is quite special. There are methods in the CAS 
interface (e.g. getDocumentLanguage()) which internally access the document 
annotation, but this is not obvious. It appears that the language is just a 
property of the CAS itself. 

When loading data from a binary CAS with a customized document annotation type 
into a target CAS with another document annotation type (either custom or 
default), one would expect that such general information as the document 
language should be preserved. It is basically mandatory that the language 
feature exists in any kind of document annotation, since it is blessed with its 
own dedicated getter/setter methods in the CAS interfaces.

Case 2: tags as types

Several type systems model tags/categories as types. A typical type hierarchy 
would e.g. contain a type PartOfSpeech and a sub-type Noun, Verb, etc. (often 
categories from a specific tag set are used). The PartOfSpeech type tends to 
also have a feature holding the tag value, e.g. "tag" which assumes values such 
as "NN", "NNP", etc. (generally from a specific tag set, even if the sub-types 
mentioned before may be more coarse-grained.). 

Assume one is serializing a CAS containing such tag sub-types, e.g. in an 
annotation editor. Now the user reconfigures the type system, e.g. switching 
from coarse-grained tag types ("Noun") to fine grained tag types ("NN", "NNP, 
etc.). Then the user loads the data back. Currently, all the annotations of 
type "Noun" would be lost, because the "Noun" type does not exist anymore. It 
would be useful if they had just been downgraded to "PartOfSpeech" annotations, 
which now could be upgraded to the new "NN", "NNP" types.


As mentioned before, generally falling back to super-types is an obviously bad 
idea, even though there may be use cases where this can help (case 2). However, 
I still think that specially blessed information, such as document language 
should be preserved, even if the document annotation type is changed (case 1). 

Cheers,

-- Richard

[1] 
http://uima.apache.org/d/uimaj-2.4.1/references.html#ugr.ref.jcas.documentannotation_issues

Reply via email to