Re: [jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly

Richard Eckart de Castilho Mon, 05 Aug 2013 01:27:16 -0700

>> Richard Eckart de Castilho commented on UIMA-3141:
>> --------------------------------------------------
>> 
>> If the custom sub-type of DocumentAnnotation is part of the target type 
>> system, it works (not verified in exactly the given test case, but in the 
>> context form which this test was distilled).
>> 
>> Since the document annotation is a special annotation in UIMA, it may 
>> require special handling. I would expect that all features are set if they 
>> are available on the document annotation, even if the type of the document 
>> annotation is not the same.
> I'm not following... All features of DocMeta are set.  It's the entire type
> instance of DocMeta that's being "filtered out" when deserializing.
> 
> I'm probably not understanding your point correctly though - please say more .

I think my point is this: if the type T for a given FS is not available in the
target type system, but a type S which is the supertype of T in the source type
system is available in the target type system, then an instance of S should be
created and all features should be set.

Now, stating it like this, it becomes obvious that this is probably not the
best idea in general. Eventually nothing would be filtered, because everything
inherits from TOP. But I'll still go on and explain what lead me to believe
this would be a good idea, at least in certain cases.

Case 1: custom document annotation

First off, this point is moot when the document annotation type is customized
as described in the UIMA documentation [1]. However, not everybody follows that
documentation. E.g. Ruta and DKPro Core instead customize the document
annotation type by deriving from it.

The document annotation is quite special. There are methods in the CAS
interface (e.g. getDocumentLanguage()) which internally access the document
annotation, but this is not obvious. It appears that the language is just a
property of the CAS itself.

When loading data from a binary CAS with a customized document annotation type
into a target CAS with another document annotation type (either custom or
default), one would expect that such general information as the document
language should be preserved. It is basically mandatory that the language
feature exists in any kind of document annotation, since it is blessed with its
own dedicated getter/setter methods in the CAS interfaces.

Case 2: tags as types

Several type systems model tags/categories as types. A typical type hierarchy
would e.g. contain a type PartOfSpeech and a sub-type Noun, Verb, etc. (often
categories from a specific tag set are used). The PartOfSpeech type tends to
also have a feature holding the tag value, e.g. "tag" which assumes values such
as "NN", "NNP", etc. (generally from a specific tag set, even if the sub-types
mentioned before may be more coarse-grained.).

Assume one is serializing a CAS containing such tag sub-types, e.g. in an
annotation editor. Now the user reconfigures the type system, e.g. switching
from coarse-grained tag types ("Noun") to fine grained tag types ("NN", "NNP,
etc.). Then the user loads the data back. Currently, all the annotations of
type "Noun" would be lost, because the "Noun" type does not exist anymore. It
would be useful if they had just been downgraded to "PartOfSpeech" annotations,
which now could be upgraded to the new "NN", "NNP" types.

As mentioned before, generally falling back to super-types is an obviously bad
idea, even though there may be use cases where this can help (case 2). However,
I still think that specially blessed information, such as document language
should be preserved, even if the document annotation type is changed (case 1).

Cheers,

-- Richard

[1]
http://uima.apache.org/d/uimaj-2.4.1/references.html#ugr.ref.jcas.documentannotation_issues

Re: [jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly

Reply via email to