Adam Lally wrote:
On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:
I didn't mean to suggest to have duplicate indexes.  What I meant to say
was, each view should have its own annotation index.  In the CAS, each
of these annotation indexes can be accessed separately.  In fact, I
think this is pretty much what you're saying as well.  I don't see a use
case for a global merged annotation index, other than tooling and
utilities.  And even for tooling, I think it makes sense to access the
annotation for each view separately.

I think maybe we should take a step back and try to agree on a few
basic things that we want to be true of CASes and CasViews.  Here are
the ideas that I had, mostly drawing on the definition in the UIMA
spec proposal.

(1) The CAS is the container for all of the analysis data (as per the
UIMA spec).  It must be possible to create FS directly on the CAS
and there must be some reasonable way to retrieve the FS in the CAS
without having to be concerened wtih views.

Agreed. It should be possible to say, on the global index repository: give me all indexes. This will include the global indexes, as well as all view-specific indexes. You can then iterate over all data in all indexes, without knowing anything about views.


(2) A CasView is a way of accessing a subset of FS in the CAS.  It
must be possible
to assert than an FS is a _member_ of a CasView, and there must be
some reasonable way to retrieve the members of the CasView.

In the general CAS, we can only access those FSs that are in some index. If you need to be able to retrieve any FS whatsoever, you need to define a bag index over all types. I would propose to handle views the same way. A FS is a member of a view iff it's contained in one of the indexes specific to the view. The same FS may live in several indexes, belonging to different views. That seems in accordance with the spec proposal.

<snip>
If we need to iterate over
annotations from different views sorted by their offsets, irrespective
of the sofa they point into, we can provide a utility function that does
that on the fly.

I agree that it doesn't make much sense that if I access annotations
irrespective of sofas, they would be sorted by begin, end.  However, I
still think I might just want to get all annotations (of some type)
and not care about the order.

You can do that under my proposal: just get all annotation indexes for all views and iterate over each of them in turn. If we need a utility function for that, it's easy enough to do.



Note however that this implies that one should never do addFsToIndexes()
on the CAS with an annotation, as it would be added to all annotation
indexes.  My suggestion implies that the index repository itself is
agnostic of views and sofas.  If you add an annotation to the wrong
repository, it's your own fault.


This behavior doesn't mesh well with the 3 ideas above.  To me,
indexing an FS in the CAS just means that I want to be able to
retrieve this FS back out of the CAS later.  It does not mean that I'm
asserting it to be a member of any view.

A view to me is just a set of indexes; moreover, it's a subset of the set of all indexes, which are exactly the indexes defined in the CAS. When I add a FS to all those indexes, it will be added to all applicable indexes, and that means all view indexes as well. Alternatively, we can say adding an FS in the CAS means adding it to global, non-view indexes only. That would make sense, but it doesn't sync with the idea that the CAS index repository contains all indexes, not just the global ones. Maybe we need a special API for that, addFsToGlobalIndexes(). So maybe getGlobalIndexRepository() should be called something else, to avoid confusion. getCompleteIndexRepository() or something.


Moreover, I think the reverse direction should be true -- indexing an
FS in a view's index repository DOES add it (at least conceptually) to
indexes that apply to the CAS as a whole.  I liked this latter idea
because it provided a way to get at all the FS in the CAS without
having to be concerned with views.

I agree, and I hope that has been clear from my previous posts. Any view-specific index is visible from the CAS, in my approach.



So to summarize, I would suggest that annotation indexes, for example,
only live in views, there is no global annotation index (neither
conceptually, nor physically).  To access annotations from the CAS, you
still need to access view-specific indexes.

Non-sofa indexes, on the other hand, only exist in the global namespace.
  The only rule of visibility is that one view can not access the
view-specific indexes of another view. Everything else is always visible.

So what I haven't figured out for myself is, what makes a sofa-index a
sofa-index?  Do we need a declaration, or can we figure this out
automatically?


I think it's a view-index, not necessarily a sofa-index (for now it
doesn't matter, but we may someday break the 1-1 correspondence
between views and sofas).  I think the most general design here would
be to allow a declaration saying which view(s) the index belongs to,
and/or whether it belongs to the CAS as a whole.  (I think it could be
both.)  In the absence of such a declaration, the index applies to all
views for backwards compatibility and I think maybe also applies to
the CAS as a whole.  The nice thing about the default being that it
applies to everything is that we can put off implementing
view-restricted indexes until later; I think adding them is more a
peformance optimization than anything else, elminating the creation of
unneeded indexes.

I am very much concerned with performance, and it needs to be a consideration from the start. We simply can't add every annotation to two indexes by default. I also don't want to start this discussion again in 3 months. If we can't get this decided for 2.1, so be it. Then let's not change anything now and do it right for 2.2.

--Thilo

Reply via email to