Re: Ideas for UIMA v3

Scott Cyphers Wed, 24 Jun 2015 08:38:29 -0700

Hi,

I have been skimming this thread while on vacation.  Before vacation, I started 
writing up a statically typed approach I developed that could deal with large 
vectors, and multi-sofa analysis, like DkPro’s features, and no need for a 
separate type system.  I was going to finish next week and get some feedback 
from a couple UIMA users I know, but it looks like it could be relevant to this 
discussion so here it is in raw and unfinished form:
http://people.csail.mit.edu/cyphers/uima/xcas.pdf 
<http://people.csail.mit.edu/cyphers/uima/xcas.pdf>
The implementation is quite simple and I can release it as BSD.  I am in the 
keeping frameworks minimal and interoperable with other frameworks camp.  
Unfortunately, this is about the limit of my participation until next week.


Scott

> On Jun 24, 2015, at 9:46 AM, Marshall Schor <[email protected]> wrote:
> 
> On 6/24/2015 3:12 AM, Richard Eckart de Castilho wrote:
>> On 23.06.2015, at 17:11, Marshall Schor <[email protected]> wrote:
>> 
>>> I added a wiki page to develop the ideas here. 
>>> 
>>> This is what I got from reading this:
>>> 
>>> One idea is having an annotator not have a type system specification, but 
>>> rather
>>> have it dynamically create types / features according to some configuration 
>>> info
>>> or some dynamically-obtained information (perhaps the results of some 
>>> previous
>>> analysis).
>> I think that right now, an annotator doesn't need to have a type system 
>> specification.
> This seems true, especially for "generic" kinds of annotators that aren't tied
> to particular types/features.
> Those annotators have the ability to query the TypeSystem and we have APIs 
> that
> use indirection in specifying both types and features.
> That is, instead of saying
>   instance_of_annotation.setBegin(123), where instance_of_annotation is
> statically typed to be a (subtype of) annotation, and the begin feature is
> hard-coded directly into the code via "setBegin", you could write
>   someFS.setIntFeature(someIndirectionToAfeature, 123), where someFS could be
> typed as a generic FeatureStructure, and someIndirectToAfeature could be of 
> the
> generic type Feature, and set to the particular feature elsewhere.
> 
> This indirection has pros and cons; the pros: enables generic annotators where
> the types and features are not explicit in the code.  cons: when the types and
> features are known, then the annotator code can be easier to read, as it has
> less indirection.  Also there can be performance / space issues with 
> indirection.
> 
>> The specification is necessary to create a CAS, but not to create an 
>> annotator.
>> With uimaFIT, it is common to first create a CAS (based on types 
>> automatically
>> detected in the classpath), then fill that CAS with some initial information
>> (avoiding a reader for easy embedding into an application), and then to pass
>> that CAS through an aggregate. While uimaFIT also adds the automatically 
>> detected
>> types to every analysis engine description in the aggregate, I believe this
>> is not really necessary because the CAS has already been initialized.
>> 
>> Independent from that is the problem that the type system is locked after 
>> the 
>> CAS has been created. Engines such as Ruta would profit if the type system 
>> would
>> at least allow compatible changes such as adding new types or adding new 
>> features
>> to existing types. The types may not be known at the time the CAS is 
>> initialized,
>> but only when the CAS is actually being processed.
> Some languages (Ruby, Javascript) allow dynamic modification of classes. So 
> new
> types can be defined, and new features can be added to classes. 
> In fact, I found this web article which lists a very long list of languages
> (Java not among them) where fields can be added to a class at runtime:
> http://rosettacode.org/wiki/Add_a_variable_to_a_class_instance_at_runtime 
> <http://rosettacode.org/wiki/Add_a_variable_to_a_class_instance_at_runtime>
> 
> In Java, you can add classes at runtime; but modifying existing classes (to 
> add
> additional fields) is not supported.
> UIMA's current design (where Java is optional) might be able to be extended to
> support new types and additional fields, at some cost in performance/space.
> The recently proposed cas-object design could also partially support this I
> think.  (It couldn't support 1) create a FS with 3 types, 2) add feature # 4, 
> 3)
> set feature # 4 in the already created FS).  More dynamic data structures of
> course do support this idea of dynamically extensible Types.
> 
> Other alternative JCas approaches which generate a full JCas cover class
> automatically from the merged type systems, would also have problems with 
> adding
> features to existing Types, but could define dynamic new types.
> 
> Finally, we could modify the Java cover class design to support a hybrid - 
> those
> things known ahead could be statically typed, and those things added 
> dynamically
> could be handled with more flexible augmentations embedded into the generated
> class; maybe this allows the best of both worlds.
> 
> The usual pros/cons apply.
>>> Another idea is having an annotator be able to read Feature Structure data 
>>> from
>>> a wide variety of sources, and have the data include the type/feature 
>>> metadata
>>> (either externally - as we do now in UIMA with a type system external XML
>>> specification, or embedded - like JSON would naturally do).  Such an 
>>> annotator
>>> would have some notion of the type / feature information it was interested 
>>> in
>>> processing, but could ignore the rest.
>> Let's see...
>> 
>> a) easier ingestion of data into feature structures, optimally by 
>> automatically
>>   creating FSes based on a (typed) external data description. E.g. a JSON 
>> object
>>   like
>> 
>>   { "fs1": {"feature1": "value1", "feature2": 10 }
>> 
>>   would be converted to a FS with a string feature1 and a numeric feature2.
>>   However, the type of the FS would basically be underspecified in the type
>>   system as the next feature structure read could have the same features 
>>   using different value ranges and in fact the type of the FS itself is
>>   unknown. Sounds as if heading towards some kind of duck-typing e.g. for
>>   annotations (if it has a begin/end, then it is an annotation).
> An interesting thing to observe is that in this direction of "simplicity", the
> ideas of Views and Sofas and Indexes might be optional?
> A thought experiment: is there a decomposition for UIMA facilities that can 
> omit
> these kinds of things if not "needed", yet gradually include this 
> functionality
> for more complex implementations?
> 
>> 
>> b) the part about the type/feature information that an annotator is 
>> interested
>>   in but being able to ignore the rest I didn't get.
> This is the concept (already present in the way UIMA deserializers operate for
> remote annotators) that when reading an external representation, you don't 
> have
> to be able to handle all the types and features.  You can "ignore" those you
> don't recognize, and just work with those you're interested in.
>> 
>>> Finally, a third idea is to have the componentization be such that no "UIMA
>>> Framework" was needed, or if present, it's hidden.  I'm thinking that this
>>> means, for simpler analytics, the idea of a pipe line, and combining things,
>>> would not be present; it would be more like just a single annotator.  For 
>>> more
>>> complex things, the idea of a pipeline would be encapsulated (like UIMA's
>>> Aggregates), and the whole thing would look like something that could be
>>> embedded, in any of the other "big data" frameworks as an analysis piece.  
>>> The
>>> implication is that this would enable using other frameworks' scaleout 
>>> mechanisms.
>> uimaFIT goes a long way in "hiding" the bulk of the UIMA framework and
>> providing a rather sane Java API for pipelines. It makes the creation of a 
>> POJO
>> wrapper around them a breeze. People do use this to embed UIMA in alternative
>> scale-out frameworks such as Hadoop.
>> 
>> Just for the sake of knowing where this is going, assuming the UIMA core API
>> as a baseline and the uimaFIT API as an improvement, how would this further
>> improvement look like?
> It might look like some kind of layering, stripping out complexity (until
> needed).  (See thought experiment, above).
>> 
>> 
>> Or would the issue be solvable by integrating uimaFIT into the core (e.g. to
>> avoid re-approval of libraries by company legal departments)?
> I don't this integration solves this issue, but integrating uimaFIT into the
> core seems like a good thing to work on (it's an item in the v3 wiki page).
> 
> -Marshall

Re: Ideas for UIMA v3

Reply via email to