Hi,
At the risk of maybe discussing things that have been previously discussed :-) ,
here's some thoughts. I'm thinking (mainly) from the perspective of UIMA
processing the extracts of a CAS Store. One could, of course, also imagine
non-UIMA kinds of processing of extracts of a CAS Store - e.g., count the number
of annotations of a certain kind in the store.
=========
Re: Globally Unique Ids (GUID) for CASes and FeatureStructures (FSs).
Since FSs are associated with a particular CAS, maybe it is useful to think of
the GUIDs as 2 parts: 1) a GUID for the CAS itself, plus 2) some scheme to
number each Feature Structure in the CAS.
In this approach, the FS part of the GUID could in a majority of the cases be a
1-word int (although some 'escape' for the rare case where more than FSs in a
CAS could exist (over time) exceeded the limits imposed by 1-word).
=========
Re: Loading parts of a particular CAS (e.g., a "projection" via some kind of
query, such as all Feature Structure of types X, Y, or Z).
- Feature Structures can have references to other FSs
- Feature Structures can be associated with a SofA - for instance, an
annotation over text, using its begin / end values to get the "covered" text.
When thinking about "loading" some part of a CAS via a projection, one has to
consider whether or not to load the SofA associated with it, and whether or not
to load referenced FSs (recursively, perhaps, as well). If the referenced FSs
were *not* loaded, we could imagine replacing the references with a special
value which indicated it was (a) not loaded, and (b) had the FS id part - to
enable "lazy" loading (if dereferenced).
=========
Re: Loading parts of a CAS - indexing FSs (or not). When a FS is loaded, a
decision has to be made - should it be "added to the indexes" or not? Adding to
indexes can be an expensive operation (depending on the indexes, etc.). If the
particular FS is one that is only located by dereferencing a FS reference, then
it won't need to be in the indexes (an efficiency optimization).
As an example, consider the built-in Feature Structure supporting lists:
uima.cas.FSList and uima.cas.EmptyFSList. These are unlikely to be indexed, and
when loaded, they probably should not be indexed (for efficiency).
The existing UIMA serialization code records which FSs should be indexed upon
loading, and which shouldn't. This information is kept *per view* - that is, a
FS could be indexed in one view, and not in another view. This information
should probably be kept with the FS in a Cas Store, so later loading could do
the right thing.
=========
Re: FS reference to another FS in a different CAS - This is not currently
supported, and there may be lots of issues to think through to do this in a
general manner, with the right efficiency tradeoffs.
=========
Re: reading collections of FSs from collections of CASes. This would happen, I
think, in the use-case described below as "READ FSes produced by a certain
annotator across all CASes in all collections or in a certain collection".
There are maybe two sub use-cases.
- (u1) One is where the READ is being done by some application outside of
UIMA.
- (u2) The other is where the intent is to run a UIMA pipeline over this
collection. This has 2 sub cases:
-- (u2a) One where each set of FSs associated with one particular CAS is
processed as a (partial) load of that CAS, and multiple of these (partially)
loaded CASes are processed.
-- (u2b) One where all of the FSs associated with all of the CASes are
loaded together into one new CAS (having of course a new CAS Id).
If the FS is "isolated", meaning that it has no reference to a SofA, or other FS
references, then a new CAS could be constructed with these FSs loaded. The
"unique" FSid (consisting the GUID for the CAS + the FSid) would change, because
the CAS they were loaded into would have a new GUID.
But, if the FS is not "isolated", then if the use case envisions accessing that
FS's "covered text", for example, this would only fit into a CAS structure if
each loaded FS referring to a different SofA, went into a separate view, and
SofAs for each view were added.
Likewise, if the FS is not "isolated", and some number of the FS refs wanted to
be dereferenced, then those references would be to FSs in different CASes. If
these were loaded into one CAS, either (a) they would lose their CAS-association
identity, or (b) we would have an FS reference to another FS in a different CAS.
So, perhaps the underlying assumption for this use-case is either (u1) or (u2a)
- avoiding (u2b) and its issues. Is that what is envisioned?
-Marshall
On 1/31/2013 6:33 PM, Neal R Lewis wrote:
> Hello All,
>
> Thank you again for all of your responses about the UIMA CAS Store. I'm glad
> that you were interested in this topic, and I would like to submit another
> summary to see if we can concisely define what would be requirements
> interfacing with a CAS Store.
>
> We talked a bit about implementation (Binary vs XMI, DB vs File system), but
> I would like to first discuss an interface for a CAS Store. The reason
> being is that it seems while there is consistent functionality in a CAS
> store, there might be different implementation constraints / preferences.
> I'll try to be concise, and if you would like to comment, please do so.
>
> Implementation:
> - Compatible with current UIMA implementations (UIMAj, UIMACpp, UIMAFit)
> - Well defined API
> - Documentation
>
> Functionality:
> - Accessible from a Web Service (SOAP / REST)
> - Maintain Collection of CASes
> - INSERT / DELETE/ UPDATE / READ CASes
> - INSERT / DELETE/ UPDATE / READ Cas Fragments (Objects within a CAS)
> - READ FSes produced by a certain annotator across all CASes in all
> collections or in a certain collection
> - Query CASes that already have annotations
> - Use stable identification of CAS
>
> As for the identification of CASes and objects within, I would like to push
> the idea of a Feature Structure ID, as I've written about before. Were there
> any other thoughts / suggestions about such an object?
>
>
>