Hi, I have been skimming this thread while on vacation. Before vacation, I started writing up a statically typed approach I developed that could deal with large vectors, and multi-sofa analysis, like DkPro’s features, and no need for a separate type system. I was going to finish next week and get some feedback from a couple UIMA users I know, but it looks like it could be relevant to this discussion so here it is in raw and unfinished form: http://people.csail.mit.edu/cyphers/uima/xcas.pdf <http://people.csail.mit.edu/cyphers/uima/xcas.pdf> The implementation is quite simple and I can release it as BSD. I am in the keeping frameworks minimal and interoperable with other frameworks camp. Unfortunately, this is about the limit of my participation until next week.
Scott > On Jun 24, 2015, at 9:46 AM, Marshall Schor <[email protected]> wrote: > > On 6/24/2015 3:12 AM, Richard Eckart de Castilho wrote: >> On 23.06.2015, at 17:11, Marshall Schor <[email protected]> wrote: >> >>> I added a wiki page to develop the ideas here. >>> >>> This is what I got from reading this: >>> >>> One idea is having an annotator not have a type system specification, but >>> rather >>> have it dynamically create types / features according to some configuration >>> info >>> or some dynamically-obtained information (perhaps the results of some >>> previous >>> analysis). >> I think that right now, an annotator doesn't need to have a type system >> specification. > This seems true, especially for "generic" kinds of annotators that aren't tied > to particular types/features. > Those annotators have the ability to query the TypeSystem and we have APIs > that > use indirection in specifying both types and features. > That is, instead of saying > instance_of_annotation.setBegin(123), where instance_of_annotation is > statically typed to be a (subtype of) annotation, and the begin feature is > hard-coded directly into the code via "setBegin", you could write > someFS.setIntFeature(someIndirectionToAfeature, 123), where someFS could be > typed as a generic FeatureStructure, and someIndirectToAfeature could be of > the > generic type Feature, and set to the particular feature elsewhere. > > This indirection has pros and cons; the pros: enables generic annotators where > the types and features are not explicit in the code. cons: when the types and > features are known, then the annotator code can be easier to read, as it has > less indirection. Also there can be performance / space issues with > indirection. > >> The specification is necessary to create a CAS, but not to create an >> annotator. >> With uimaFIT, it is common to first create a CAS (based on types >> automatically >> detected in the classpath), then fill that CAS with some initial information >> (avoiding a reader for easy embedding into an application), and then to pass >> that CAS through an aggregate. While uimaFIT also adds the automatically >> detected >> types to every analysis engine description in the aggregate, I believe this >> is not really necessary because the CAS has already been initialized. >> >> Independent from that is the problem that the type system is locked after >> the >> CAS has been created. Engines such as Ruta would profit if the type system >> would >> at least allow compatible changes such as adding new types or adding new >> features >> to existing types. The types may not be known at the time the CAS is >> initialized, >> but only when the CAS is actually being processed. > Some languages (Ruby, Javascript) allow dynamic modification of classes. So > new > types can be defined, and new features can be added to classes. > In fact, I found this web article which lists a very long list of languages > (Java not among them) where fields can be added to a class at runtime: > http://rosettacode.org/wiki/Add_a_variable_to_a_class_instance_at_runtime > <http://rosettacode.org/wiki/Add_a_variable_to_a_class_instance_at_runtime> > > In Java, you can add classes at runtime; but modifying existing classes (to > add > additional fields) is not supported. > UIMA's current design (where Java is optional) might be able to be extended to > support new types and additional fields, at some cost in performance/space. > The recently proposed cas-object design could also partially support this I > think. (It couldn't support 1) create a FS with 3 types, 2) add feature # 4, > 3) > set feature # 4 in the already created FS). More dynamic data structures of > course do support this idea of dynamically extensible Types. > > Other alternative JCas approaches which generate a full JCas cover class > automatically from the merged type systems, would also have problems with > adding > features to existing Types, but could define dynamic new types. > > Finally, we could modify the Java cover class design to support a hybrid - > those > things known ahead could be statically typed, and those things added > dynamically > could be handled with more flexible augmentations embedded into the generated > class; maybe this allows the best of both worlds. > > The usual pros/cons apply. >>> Another idea is having an annotator be able to read Feature Structure data >>> from >>> a wide variety of sources, and have the data include the type/feature >>> metadata >>> (either externally - as we do now in UIMA with a type system external XML >>> specification, or embedded - like JSON would naturally do). Such an >>> annotator >>> would have some notion of the type / feature information it was interested >>> in >>> processing, but could ignore the rest. >> Let's see... >> >> a) easier ingestion of data into feature structures, optimally by >> automatically >> creating FSes based on a (typed) external data description. E.g. a JSON >> object >> like >> >> { "fs1": {"feature1": "value1", "feature2": 10 } >> >> would be converted to a FS with a string feature1 and a numeric feature2. >> However, the type of the FS would basically be underspecified in the type >> system as the next feature structure read could have the same features >> using different value ranges and in fact the type of the FS itself is >> unknown. Sounds as if heading towards some kind of duck-typing e.g. for >> annotations (if it has a begin/end, then it is an annotation). > An interesting thing to observe is that in this direction of "simplicity", the > ideas of Views and Sofas and Indexes might be optional? > A thought experiment: is there a decomposition for UIMA facilities that can > omit > these kinds of things if not "needed", yet gradually include this > functionality > for more complex implementations? > >> >> b) the part about the type/feature information that an annotator is >> interested >> in but being able to ignore the rest I didn't get. > This is the concept (already present in the way UIMA deserializers operate for > remote annotators) that when reading an external representation, you don't > have > to be able to handle all the types and features. You can "ignore" those you > don't recognize, and just work with those you're interested in. >> >>> Finally, a third idea is to have the componentization be such that no "UIMA >>> Framework" was needed, or if present, it's hidden. I'm thinking that this >>> means, for simpler analytics, the idea of a pipe line, and combining things, >>> would not be present; it would be more like just a single annotator. For >>> more >>> complex things, the idea of a pipeline would be encapsulated (like UIMA's >>> Aggregates), and the whole thing would look like something that could be >>> embedded, in any of the other "big data" frameworks as an analysis piece. >>> The >>> implication is that this would enable using other frameworks' scaleout >>> mechanisms. >> uimaFIT goes a long way in "hiding" the bulk of the UIMA framework and >> providing a rather sane Java API for pipelines. It makes the creation of a >> POJO >> wrapper around them a breeze. People do use this to embed UIMA in alternative >> scale-out frameworks such as Hadoop. >> >> Just for the sake of knowing where this is going, assuming the UIMA core API >> as a baseline and the uimaFIT API as an improvement, how would this further >> improvement look like? > It might look like some kind of layering, stripping out complexity (until > needed). (See thought experiment, above). >> >> >> Or would the issue be solvable by integrating uimaFIT into the core (e.g. to >> avoid re-approval of libraries by company legal departments)? > I don't this integration solves this issue, but integrating uimaFIT into the > core seems like a good thing to work on (it's an item in the v3 wiki page). > > -Marshall
