On 23.06.2015, at 17:11, Marshall Schor <[email protected]> wrote:
> I added a wiki page to develop the ideas here.
>
> This is what I got from reading this:
>
> One idea is having an annotator not have a type system specification, but
> rather
> have it dynamically create types / features according to some configuration
> info
> or some dynamically-obtained information (perhaps the results of some previous
> analysis).
I think that right now, an annotator doesn't need to have a type system
specification.
The specification is necessary to create a CAS, but not to create an annotator.
With uimaFIT, it is common to first create a CAS (based on types automatically
detected in the classpath), then fill that CAS with some initial information
(avoiding a reader for easy embedding into an application), and then to pass
that CAS through an aggregate. While uimaFIT also adds the automatically
detected
types to every analysis engine description in the aggregate, I believe this
is not really necessary because the CAS has already been initialized.
Independent from that is the problem that the type system is locked after the
CAS has been created. Engines such as Ruta would profit if the type system would
at least allow compatible changes such as adding new types or adding new
features
to existing types. The types may not be known at the time the CAS is
initialized,
but only when the CAS is actually being processed.
> Another idea is having an annotator be able to read Feature Structure data
> from
> a wide variety of sources, and have the data include the type/feature metadata
> (either externally - as we do now in UIMA with a type system external XML
> specification, or embedded - like JSON would naturally do). Such an annotator
> would have some notion of the type / feature information it was interested in
> processing, but could ignore the rest.
Let's see...
a) easier ingestion of data into feature structures, optimally by automatically
creating FSes based on a (typed) external data description. E.g. a JSON
object
like
{ "fs1": {"feature1": "value1", "feature2": 10 }
would be converted to a FS with a string feature1 and a numeric feature2.
However, the type of the FS would basically be underspecified in the type
system as the next feature structure read could have the same features
using different value ranges and in fact the type of the FS itself is
unknown. Sounds as if heading towards some kind of duck-typing e.g. for
annotations (if it has a begin/end, then it is an annotation).
b) the part about the type/feature information that an annotator is interested
in but being able to ignore the rest I didn't get.
> Finally, a third idea is to have the componentization be such that no "UIMA
> Framework" was needed, or if present, it's hidden. I'm thinking that this
> means, for simpler analytics, the idea of a pipe line, and combining things,
> would not be present; it would be more like just a single annotator. For more
> complex things, the idea of a pipeline would be encapsulated (like UIMA's
> Aggregates), and the whole thing would look like something that could be
> embedded, in any of the other "big data" frameworks as an analysis piece. The
> implication is that this would enable using other frameworks' scaleout
> mechanisms.
uimaFIT goes a long way in "hiding" the bulk of the UIMA framework and
providing a rather sane Java API for pipelines. It makes the creation of a POJO
wrapper around them a breeze. People do use this to embed UIMA in alternative
scale-out frameworks such as Hadoop.
Just for the sake of knowing where this is going, assuming the UIMA core API
as a baseline and the uimaFIT API as an improvement, how would this further
improvement look like?
Or would the issue be solvable by integrating uimaFIT into the core (e.g. to
avoid re-approval of libraries by company legal departments)?
Cheers,
-- Richard