On Thu, 2015-07-23 at 11:01 -0400, Marshall Schor wrote: > Hi Jörn, > > Thank you for your comments; I hope you can expand a bit (see below). > > On 7/23/2015 9:45 AM, Joern Kottmann wrote: > > Well, I thought about something which can be done in 3, 4 or 5 lines of > > code. > > > > To use a CAS, its first creating the TypeSystemDescriptor, creating an > > empty CAS and then loading something into it. > > Placing content in it is often done using an AE. If I want to reuse an > > existing deserializer/serializer I always end up with an AE, > > maybe there are some rare exceptions. > > > > In a bigger system there will be a couple of components dealing with CASes, > > if there is a small change to the type system they all have to be updated, > > even when they are not affected by the change, e.g. type addition or a > > change to a type they don't use. > I'd like to understand this better. Since the pipeline's final type system is > created at pipeline-startup-time, from the "merge" of all the component's type > systems, it seems to me that you would not need to update the type systems in > other components not affected by the change?
I was not referring to UIMA components here.
Imagine a system that uses multiple AAEs to analyze some documents. The
documents might have really different types. The AAEs do a great job
adapting to the document types by using the right AEs to deal with the
content. An AE added to an AAE can also introduce new types. These new
types are merged into the type system of that particular AAE. The FS
added to the CAS having those types might be interesting later when
viewing that document, or for things that are specific to that
particular document, but not be important across the entire document
collection.
All the CASes outputted by these different AAEs have to be further
processed. And that is where things get tricky. The component dealing
with them is probably again very specific and might only want to look at
a few FeatureStructure types or maybe at all.
How can we write a mapreduce job that processes all CASes (with slightly
different but not incompatible type systems) in a database. Maybe
something as simple as the count of all Email Address Annotations in all
those CASes.
To be able to load that content into a CAS we either have to swap the
type system per cas type (not nice) or just merge all existing type
systems together.
If this type system in one of my AAEs now changes, e.g. type addition,
the mapred job also has to be updated with the new type system, even
tough it might never deal with that type.
Ok, that we can maybe somehow solve by stripping the unkown types from
the CASes.
> If the concern is the need to have a JCas cover class generated for the merged
> type system, version 3 is hoping to make that "automatic".
> > In our system we have many different
> > import pipelines, sometimes those pipelines have specific types which are
> > only used in an early stage, if a generic component has to deal with one of
> > those CASes the only good option is to merge all type systems together.
> Since UIMA pipelines do this type merge, I'm guessing you might be thinking
> about this outside of UIMA pipelines, such as a scenario where you have one
> step
> (using those many different import pipelines), and perhaps having those write
> out some CASs, and then wanting to read in those CASes in another step to be
> processed by your generic component, and therefore needing that 2nd step to
> have
> the merge of all the type systems together, to enable deserializing. Is this
> the scenario, or is there another use case you're thinking of?
Yes, but that generic component might have different requirements, maybe
it just deals with a few types it knows very well, or it can deal with
all types.
> If this is the scenario, another option would be to have the serialized CASes
> stored along with a reference to their type system, and have some new
> deserialization capability be able to locate the referred-to type system along
> with the CAS to be read in. Would that "solve" this issue, or are there other
> aspects?
It would probably solve it, but it is not a simple solution either. That
would mean that the Type System get switched frequently and have be
looked up all the time.
There is a CAS. It maybe contain, or doesn't contain FSes of a certain
type. The type is always known by the code dealing with the CAS.
Why do I first have to load the right type system to retrieve those
FSes?
With my CAS-like thing I can just write cas.getIndex("index37",
EmailAddressAnnotation.class) and it just returns them as Java objects
of type EmailAddressAnnotation.
There is no type system, and it doesn't work well with types I don't
know anything about, but at that place I am also not interested in
those.
In a different place in the system the code might only assume its an
annotation and retrieves the same index as objects of type AnnotationFS.
cas.getIndex("index37", AnnotationFS.class)
UIMA doesn't make it easy with its static type system to write generic
code working with CASes.
> >
> > The way we use UIMA is that we let it process our content with different
> > custom pipelines, and at the end of each pipeline the results are converted
> > into POJOs and those are written into a database, all code which follows
> > just uses the POJOs to process the data. My point is: If the CAS would be
> > in a better state we could just use it through out the entire application
> > instead of our CAS-like layer.
>
> In version 3, we're planning on storing the Feature Structures as just
> instances
> of their JCas Java Cover Objects, pretty close to POJOs. So maybe there's a
> good
> chance...
Do we just use POJOs or are they again generated from a type system?
Jörn
signature.asc
Description: This is a digitally signed message part
