Re: [jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly

Marshall Schor Mon, 05 Aug 2013 05:55:59 -0700

One other thought experiment I use when considering new design extensions is 
this:


Is this new thing (much) more likely to be a user error vs a user intent?

In this case, the question would be: when a user defines a subtype of
DocumentAnnotation and then "filters" it out during serialization/
deserialization, is this (much) more likely to be a user error, where the user
would benefit from an (warning) error message about this, versus something that
is (much) more likely a popular use case?

If it is much more likely a user error, we could have UIMA detect this and issue
a warning/error message.

-Marshall

On 8/5/2013 7:55 AM, Marshall Schor wrote:
> Thanks for expanding on this issue, see some comments below.
>
> On 8/5/2013 4:25 AM, Richard Eckart de Castilho wrote:
>>>> Richard Eckart de Castilho commented on UIMA-3141:
>>>> --------------------------------------------------
>>>>
>>>> If the custom sub-type of DocumentAnnotation is part of the target type 
>>>> system, it works (not verified in exactly the given test case, but in the 
>>>> context form which this test was distilled).
>>>>
>>>> Since the document annotation is a special annotation in UIMA, it may 
>>>> require special handling. I would expect that all features are set if they 
>>>> are available on the document annotation, even if the type of the document 
>>>> annotation is not the same.
>>> I'm not following... All features of DocMeta are set.  It's the entire type
>>> instance of DocMeta that's being "filtered out" when deserializing.
>>>
>>> I'm probably not understanding your point correctly though - please say 
>>> more .
>> I think my point is this: if the type T for a given FS is not available in 
>> the target type system, but a type S which is the supertype of T in the 
>> source type system is available in the target type system, then an instance 
>> of S should be created and all features should be set.
>>
>> Now, stating it like this, it becomes obvious that this is probably not the 
>> best idea in general. Eventually nothing would be filtered, because 
>> everything inherits from TOP. But I'll still go on and explain what lead me 
>> to believe this would be a good idea, at least in certain cases.
>>
>> Case 1: custom document annotation
>>
>> First off, this point is moot when the document annotation type is 
>> customized as described in the UIMA documentation [1]. However, not 
>> everybody follows that documentation. E.g. Ruta and DKPro Core instead 
>> customize the document annotation type by deriving from it.
> This use was a surprise to me, and I wonder about the utility of it, as 
> compared
> to extending the DocumentAnnotation by adding more features to it.  I'm
> wondering why the original designers of UIMA didn't declare this type to be a
> type which could not be inherited from.
>> The document annotation is quite special. There are methods in the CAS 
>> interface (e.g. getDocumentLanguage()) which internally access the document 
>> annotation, but this is not obvious. It appears that the language is just a 
>> property of the CAS itself.
> I agree this design is a bit unusual, and I don't know the reason it was done
> this way, other than I know there was a desire to keep UIMA independent of the
> actual kind of unstructured information being processed, and the designers 
> were
> aware that not all unstructured data was "text" (think of audio, video, etc). 
> So my guess of the motivation behind this is that "language" was not part of 
> the
> CAS, but rather part of the DocumentAnnotation, which was specific to "text". 
> But for convenience, the set/get methods were added to the CAS interface.
>
>>  
>>
>> When loading data from a binary CAS with a customized document annotation 
>> type into a target CAS with another document annotation type (either custom 
>> or default), one would expect that such general information as the document 
>> language should be preserved. It is basically mandatory that the language 
>> feature exists in any kind of document annotation, since it is blessed with 
>> its own dedicated getter/setter methods in the CAS interfaces.
> So, I suppose we could special - case this feature. But it's not clear in the
> general case how to design this.  The general case might include situations
> where users declared multiple subtypes of DocumentationAnnotation, or even
> subtypes of subtypes (in a supertype chain), and set some of their "language"
> features to several different values.  Some subset of these might be 
> "filtered",
> but others might still exist.
>
> I think this is a surprising thing for users to do; however, I was surprised
> that users made subtypes of Document Annotation.  And I wonder if the better
> solution is to deprecate making subtypes of Document Annotation, rather than
> trying to find a way to handle these kinds of cases.
>
>> Case 2: tags as types
>>
>> Several type systems model tags/categories as types. A typical type 
>> hierarchy would e.g. contain a type PartOfSpeech and a sub-type Noun, Verb, 
>> etc. (often categories from a specific tag set are used). The PartOfSpeech 
>> type tends to also have a feature holding the tag value, e.g. "tag" which 
>> assumes values such as "NN", "NNP", etc. (generally from a specific tag set, 
>> even if the sub-types mentioned before may be more coarse-grained.). 
>>
>> Assume one is serializing a CAS containing such tag sub-types, e.g. in an 
>> annotation editor. Now the user reconfigures the type system, e.g. switching 
>> from coarse-grained tag types ("Noun") to fine grained tag types ("NN", 
>> "NNP, etc.). Then the user loads the data back. Currently, all the 
>> annotations of type "Noun" would be lost, because the "Noun" type does not 
>> exist anymore. It would be useful if they had just been downgraded to 
>> "PartOfSpeech" annotations, which now could be upgraded to the new "NN", 
>> "NNP" types.
> I wonder if supporting this kind of up-classing is sufficiently useful and
> general to be part of the form 6 serialization / deserialization.  I can 
> imagine
> many other kinds of type system "conversions" that users might want. 
>
> The general topic of type system conversion is a complex one.  I think more
> complex forms of type conversion are an orthogonal topic to compressed binary
> serialization.  More complex forms of this probably don't belong in form 6
> serialization/deserialization, which I think should be limited to the simpler
> type and feature filtering, which is also done in other serialization /
> deserialization forms, too, when "lenient" forms are used.  (CasCopier also 
> has
> lenient forms, too).
>
>
>>
>> As mentioned before, generally falling back to super-types is an obviously 
>> bad idea, even though there may be use cases where this can help (case 2). 
>> However, I still think that specially blessed information, such as document 
>> language should be preserved, even if the document annotation type is 
>> changed (case 1). 
> Is there a real, frequently appearing situation?  Why wouldn't one include the
> DocMeta type in the target type system?    I think that in the general case
> (where users could design an arbitrary tree of subtypes of DocumentAnnotation,
> and instantiate one or more of these types, and then filter one or more of 
> these
> types), there is not an obvious design on how to "pick" the right language
> setting and how to promote it or if it needs promoting.   I think this whole
> area can easily go beyond the design intent of UIMA (which was to encourage
> interoperability and sharing in a growing community of people working in
> unstructured analysis), and that the better solution is to gradually enforce 
> the
> simpler approach by deprecating type definitions that try to be a subtype of
> DocumentAnnotation, unless of course there are valid use-cases for doing this
> (which I'm unaware of at the moment :-) ).
>
> -Marshall
>
>
>

Re: [jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly

Reply via email to