One other thought experiment I use when considering new design extensions is
this:
Is this new thing (much) more likely to be a user error vs a user intent?
In this case, the question would be: when a user defines a subtype of
DocumentAnnotation and then "filters" it out during serialization/
deserialization, is this (much) more likely to be a user error, where the user
would benefit from an (warning) error message about this, versus something that
is (much) more likely a popular use case?
If it is much more likely a user error, we could have UIMA detect this and issue
a warning/error message.
-Marshall
On 8/5/2013 7:55 AM, Marshall Schor wrote:
> Thanks for expanding on this issue, see some comments below.
>
> On 8/5/2013 4:25 AM, Richard Eckart de Castilho wrote:
>>>> Richard Eckart de Castilho commented on UIMA-3141:
>>>> --------------------------------------------------
>>>>
>>>> If the custom sub-type of DocumentAnnotation is part of the target type
>>>> system, it works (not verified in exactly the given test case, but in the
>>>> context form which this test was distilled).
>>>>
>>>> Since the document annotation is a special annotation in UIMA, it may
>>>> require special handling. I would expect that all features are set if they
>>>> are available on the document annotation, even if the type of the document
>>>> annotation is not the same.
>>> I'm not following... All features of DocMeta are set. It's the entire type
>>> instance of DocMeta that's being "filtered out" when deserializing.
>>>
>>> I'm probably not understanding your point correctly though - please say
>>> more .
>> I think my point is this: if the type T for a given FS is not available in
>> the target type system, but a type S which is the supertype of T in the
>> source type system is available in the target type system, then an instance
>> of S should be created and all features should be set.
>>
>> Now, stating it like this, it becomes obvious that this is probably not the
>> best idea in general. Eventually nothing would be filtered, because
>> everything inherits from TOP. But I'll still go on and explain what lead me
>> to believe this would be a good idea, at least in certain cases.
>>
>> Case 1: custom document annotation
>>
>> First off, this point is moot when the document annotation type is
>> customized as described in the UIMA documentation [1]. However, not
>> everybody follows that documentation. E.g. Ruta and DKPro Core instead
>> customize the document annotation type by deriving from it.
> This use was a surprise to me, and I wonder about the utility of it, as
> compared
> to extending the DocumentAnnotation by adding more features to it. I'm
> wondering why the original designers of UIMA didn't declare this type to be a
> type which could not be inherited from.
>> The document annotation is quite special. There are methods in the CAS
>> interface (e.g. getDocumentLanguage()) which internally access the document
>> annotation, but this is not obvious. It appears that the language is just a
>> property of the CAS itself.
> I agree this design is a bit unusual, and I don't know the reason it was done
> this way, other than I know there was a desire to keep UIMA independent of the
> actual kind of unstructured information being processed, and the designers
> were
> aware that not all unstructured data was "text" (think of audio, video, etc).
> So my guess of the motivation behind this is that "language" was not part of
> the
> CAS, but rather part of the DocumentAnnotation, which was specific to "text".
> But for convenience, the set/get methods were added to the CAS interface.
>
>>
>>
>> When loading data from a binary CAS with a customized document annotation
>> type into a target CAS with another document annotation type (either custom
>> or default), one would expect that such general information as the document
>> language should be preserved. It is basically mandatory that the language
>> feature exists in any kind of document annotation, since it is blessed with
>> its own dedicated getter/setter methods in the CAS interfaces.
> So, I suppose we could special - case this feature. But it's not clear in the
> general case how to design this. The general case might include situations
> where users declared multiple subtypes of DocumentationAnnotation, or even
> subtypes of subtypes (in a supertype chain), and set some of their "language"
> features to several different values. Some subset of these might be
> "filtered",
> but others might still exist.
>
> I think this is a surprising thing for users to do; however, I was surprised
> that users made subtypes of Document Annotation. And I wonder if the better
> solution is to deprecate making subtypes of Document Annotation, rather than
> trying to find a way to handle these kinds of cases.
>
>> Case 2: tags as types
>>
>> Several type systems model tags/categories as types. A typical type
>> hierarchy would e.g. contain a type PartOfSpeech and a sub-type Noun, Verb,
>> etc. (often categories from a specific tag set are used). The PartOfSpeech
>> type tends to also have a feature holding the tag value, e.g. "tag" which
>> assumes values such as "NN", "NNP", etc. (generally from a specific tag set,
>> even if the sub-types mentioned before may be more coarse-grained.).
>>
>> Assume one is serializing a CAS containing such tag sub-types, e.g. in an
>> annotation editor. Now the user reconfigures the type system, e.g. switching
>> from coarse-grained tag types ("Noun") to fine grained tag types ("NN",
>> "NNP, etc.). Then the user loads the data back. Currently, all the
>> annotations of type "Noun" would be lost, because the "Noun" type does not
>> exist anymore. It would be useful if they had just been downgraded to
>> "PartOfSpeech" annotations, which now could be upgraded to the new "NN",
>> "NNP" types.
> I wonder if supporting this kind of up-classing is sufficiently useful and
> general to be part of the form 6 serialization / deserialization. I can
> imagine
> many other kinds of type system "conversions" that users might want.
>
> The general topic of type system conversion is a complex one. I think more
> complex forms of type conversion are an orthogonal topic to compressed binary
> serialization. More complex forms of this probably don't belong in form 6
> serialization/deserialization, which I think should be limited to the simpler
> type and feature filtering, which is also done in other serialization /
> deserialization forms, too, when "lenient" forms are used. (CasCopier also
> has
> lenient forms, too).
>
>
>>
>> As mentioned before, generally falling back to super-types is an obviously
>> bad idea, even though there may be use cases where this can help (case 2).
>> However, I still think that specially blessed information, such as document
>> language should be preserved, even if the document annotation type is
>> changed (case 1).
> Is there a real, frequently appearing situation? Why wouldn't one include the
> DocMeta type in the target type system? I think that in the general case
> (where users could design an arbitrary tree of subtypes of DocumentAnnotation,
> and instantiate one or more of these types, and then filter one or more of
> these
> types), there is not an obvious design on how to "pick" the right language
> setting and how to promote it or if it needs promoting. I think this whole
> area can easily go beyond the design intent of UIMA (which was to encourage
> interoperability and sharing in a growing community of people working in
> unstructured analysis), and that the better solution is to gradually enforce
> the
> simpler approach by deprecating type definitions that try to be a subtype of
> DocumentAnnotation, unless of course there are valid use-cases for doing this
> (which I'm unaware of at the moment :-) ).
>
> -Marshall
>
>
>