Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-27 Thread Marshall Schor

Thilo Goetz wrote:

Marshall Schor wrote:

Thilo Goetz wrote:


From a performance perspective, I'd vote for having the filtering on 
the iterator side of thing, where it already is.  If one annotator 
decides it needs a "filtered index" over annotations, that can 
affect the performance of all other annotators as well, because then 
all annotations not only go into the regular annotation index, but 
the additional index as well.
Wouldn't the performance be better with the filtering on the indexing 
side, if the #writes/updates  << # read accesses to the filtered set?


No, because the way I see it, no filtering would ever be necessary.  
If you have a different annotation index for each anchored view, you 
don't need to do any filtering at indexing time, nor at access time.
Good point :-)  I was assuming, just for the purpose of exploring (not 
advocating :-) ) having only one index-set, and "add-to-indexes" would 
go through all of them and the "filter" would only add the item to the 
right view.


Your design point was assuming having different indexes per "anchored-view.

-Marshall



Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-27 Thread Thilo Goetz

Marshall Schor wrote:

Thilo Goetz wrote:


From a performance perspective, I'd vote for having the filtering on 
the iterator side of thing, where it already is.  If one annotator 
decides it needs a "filtered index" over annotations, that can affect 
the performance of all other annotators as well, because then all 
annotations not only go into the regular annotation index, but the 
additional index as well.
Wouldn't the performance be better with the filtering on the indexing 
side, if the #writes/updates  << # read accesses to the filtered set?


No, because the way I see it, no filtering would ever be necessary.  If 
you have a different annotation index for each anchored view, you don't 
need to do any filtering at indexing time, nor at access time.


--Thilo




Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-23 Thread Marshall Schor

Thilo Goetz wrote:

Marshall Schor wrote:

Adam Lally wrote:

On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote:
If we had filtering predicates as part of an index specification, 
then we
could create indexes over subsets of types quite arbitrarily. Could 
this

more general mechanism serve this purpose better than views?


I'm not sure what you mean, "subets of types".  Do you mean "subsets
of objects (FeatureStructures)", as in a filter that checks arbitrary
feature values to decide whether an object gets added to the index?

Yes.

Could be... this sounds like it's saying that an index is a way to
optimize what could be implemented by an annotator using a filter over
all FS in the CAS followed by a sort.

Right.


From a performance perspective, I'd vote for having the filtering on 
the iterator side of thing, where it already is.  If one annotator 
decides it needs a "filtered index" over annotations, that can affect 
the performance of all other annotators as well, because then all 
annotations not only go into the regular annotation index, but the 
additional index as well.
Wouldn't the performance be better with the filtering on the indexing 
side, if the #writes/updates  << # read accesses to the filtered set?


I'm thinking that the best thing is to have clear, documented 
performance expectations, and let the developer choose.


-Marshall


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Thilo Goetz

Marshall Schor wrote:

Adam Lally wrote:

On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote:
If we had filtering predicates as part of an index specification, 
then we

could create indexes over subsets of types quite arbitrarily. Could this
more general mechanism serve this purpose better than views?


I'm not sure what you mean, "subets of types".  Do you mean "subsets
of objects (FeatureStructures)", as in a filter that checks arbitrary
feature values to decide whether an object gets added to the index?

Yes.

Could be... this sounds like it's saying that an index is a way to
optimize what could be implemented by an annotator using a filter over
all FS in the CAS followed by a sort.

Right.


From a performance perspective, I'd vote for having the filtering on 
the iterator side of thing, where it already is.  If one annotator 
decides it needs a "filtered index" over annotations, that can affect 
the performance of all other annotators as well, because then all 
annotations not only go into the regular annotation index, but the 
additional index as well.


--Thilo


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Marshall Schor

Adam Lally wrote:

On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote:
If we had filtering predicates as part of an index specification, 
then we

could create indexes over subsets of types quite arbitrarily. Could this
more general mechanism serve this purpose better than views?


I'm not sure what you mean, "subets of types".  Do you mean "subsets
of objects (FeatureStructures)", as in a filter that checks arbitrary
feature values to decide whether an object gets added to the index?

Yes.

Could be... this sounds like it's saying that an index is a way to
optimize what could be implemented by an annotator using a filter over
all FS in the CAS followed by a sort.

Right.

> Going back to my hypothetical annotator that created an annotation off
> the base CAS by calling CAS.createAnnotation(begin, end, Sofa).  In
> our current implementation this isn't useful because the annotation
> has to be indexed to be retrievable, and the only way to index it is
> to add it to a view.  Are there any other options we could consider
This doesn't seem correct: an annotation doesn't have to be indexed 
to be
retrievable - it could be referenced by some chain of FS, with the 
starting

FS of course being indexed  So I could have an FSArray for instance, of
Annotations, and index the FSArray.


Yes, of course; I glossed over that detail.  I don't think it really
affects my point, though.  The only way to index the FSArray
containing my annotation would be to add it to a view, which I don't
want to do.  Is there a way to make my Annotation accessible from the
base CAS without having to go through a view first?
I guess I missed the point ...  
Indexes are sometimes used as a performance optimization, but other 
times

they're part of a component's logic - as when a component depends on
a particular sorting order.


The annotator could do the sorting itself.  But I'll correct my
statement to say that indexes are both a performance optimization and
a convenience.
I think you may be technically correct, but my point was that the users 
tend to

think of indexes differently (more than
just optimizations and conveniences) - they think
of them as part of their component logic.  I'm thinking of the users 
that use

the special iterators over Annotation types, and depend on things like Type
Priorities for correct operation of their components.

-Marshall


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Adam Lally

On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote:

If we had filtering predicates as part of an index specification, then we
could create indexes over subsets of types quite arbitrarily. Could this
more general mechanism serve this purpose better than views?


I'm not sure what you mean, "subets of types".  Do you mean "subsets
of objects (FeatureStructures)", as in a filter that checks arbitrary
feature values to decide whether an object gets added to the index?

Could be... this sounds like it's saying that an index is a way to
optimize what could be implemented by an annotator using a filter over
all FS in the CAS followed by a sort.


> Going back to my hypothetical annotator that created an annotation off
> the base CAS by calling CAS.createAnnotation(begin, end, Sofa).  In
> our current implementation this isn't useful because the annotation
> has to be indexed to be retrievable, and the only way to index it is
> to add it to a view.  Are there any other options we could consider
This doesn't seem correct: an annotation doesn't have to be indexed to be
retrievable - it could be referenced by some chain of FS, with the starting
FS of course being indexed  So I could have an FSArray for instance, of
Annotations, and index the FSArray.


Yes, of course; I glossed over that detail.  I don't think it really
affects my point, though.  The only way to index the FSArray
containing my annotation would be to add it to a view, which I don't
want to do.  Is there a way to make my Annotation accessible from the
base CAS without having to go through a view first?


Indexes are sometimes used as a performance optimization, but other times
they're part of a component's logic - as when a component depends on
a particular sorting order.



The annotator could do the sorting itself.  But I'll correct my
statement to say that indexes are both a performance optimization and
a convenience.

-Adam


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Marshall Schor

Adam Lally wrote:

On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

Adam Lally wrote:
> (1) The CAS is the container for all of the analysis data (as per the
> UIMA spec).  It must be possible to create FS directly on the CAS
> and there must be some reasonable way to retrieve the FS in the CAS
> without having to be concerened with views.

This seems to be an important point, and one that I still haven't really
understood.  Why is this necessary? An anchored view is the only way to
contain a subject of analysis.  UIMA without sofas (in the conceptual
sense) is nothing.  Why do I need to be able to access annotations
without being concerned about views?  Conceptually and in an ideal
world, that is.  Don't get me wrong, I'm not opposed to this.  I simply
don't understand the motivation, and I would like to.



That's a fair question...

One thing I want to clarify is that UIMA without views doesn't mean
UIMA without Sofas. You should be able to access the Sofas (all of
them) directly from the CAS.  They're just FeatureStructures after
all, and our current implementation does have a Sofa index, though
it's hidden at the moment.

So one way of working with the CAS without views might be for an
annotator to look through the Sofa index for a Sofa it wants to
analyze and create some annotations over it (I suggested a
CAS.createAnnotation(begin, end, Sofa) method for this purpose.)

Views are a way that we think is useful to organize feature structures
in the CAS, and one key way to organize them is to collect all the
annotations referring to a single sofa into one (anchored) view.  

If we had filtering predicates as part of an index specification, then we
could create indexes over subsets of types quite arbitrarily. Could this
more general mechanism serve this purpose better than views?

But
is this the only way to do things in the UIMA standard?  That proved
to be a tough sell to the people who worked on the UIMA spec proposal
who were thinking not just about our implementation but also about
other UIM frameworks/systems that do things differently.  So the state
of things for the UIMA spec proposal right now is that views are an
optional way of doing things.

Now on top of that we have to figure out what to do with indexes,
which aren't part of the UIMA spec at the moment.  In our current
implementation indexes only operate on views.  Maybe its OK to leave
it that way for now, but I thought it was worth exploring if there's a
way to have indexes work on over the CAS as a whole, as well.

Going back to my hypothetical annotator that created an annotation off
the base CAS by calling CAS.createAnnotation(begin, end, Sofa).  In
our current implementation this isn't useful because the annotation
has to be indexed to be retrievable, and the only way to index it is
to add it to a view.  Are there any other options we could consider

This doesn't seem correct: an annotation doesn't have to be indexed to be
retrievable - it could be referenced by some chain of FS, with the starting
FS of course being indexed  So I could have an FSArray for instance, of
Annotations, and index the FSArray.


If we can't or don't want to change the fact that indexes only operate
on views, we could provide an iterator that walks the heap and returns
everything regardless of whether it's indexed.  Then we'd be saying -
neither views nor indexes are required -- they're a performance
optimization.

Indexes are sometimes used as a performance optimization, but other times
they're part of a component's logic - as when a component depends on
a particular sorting order.

-Marshall


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Adam Lally

On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

Eddie Epstein wrote:
> Doesn't that previous discussion read on the topic of global indexes?

Is it my brain, or this sentence, that doesn't make any sense ;-)  Could
you explain?



Must be Eddie's Southern US dialect. ;)  I'm not familiar with that
use of "read on" either.  From context I'm guessing it's supposed to
mean the same thing as "speak to" as in, "has relevance to".

-Adam


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Adam Lally

On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

Adam Lally wrote:
> (1) The CAS is the container for all of the analysis data (as per the
> UIMA spec).  It must be possible to create FS directly on the CAS
> and there must be some reasonable way to retrieve the FS in the CAS
> without having to be concerened with views.

This seems to be an important point, and one that I still haven't really
understood.  Why is this necessary?  An anchored view is the only way to
contain a subject of analysis.  UIMA without sofas (in the conceptual
sense) is nothing.  Why do I need to be able to access annotations
without being concerned about views?  Conceptually and in an ideal
world, that is.  Don't get me wrong, I'm not opposed to this.  I simply
don't understand the motivation, and I would like to.



That's a fair question...

One thing I want to clarify is that UIMA without views doesn't mean
UIMA without Sofas. You should be able to access the Sofas (all of
them) directly from the CAS.  They're just FeatureStructures after
all, and our current implementation does have a Sofa index, though
it's hidden at the moment.

So one way of working with the CAS without views might be for an
annotator to look through the Sofa index for a Sofa it wants to
analyze and create some annotations over it (I suggested a
CAS.createAnnotation(begin, end, Sofa) method for this purpose.)

Views are a way that we think is useful to organize feature structures
in the CAS, and one key way to organize them is to collect all the
annotations referring to a single sofa into one (anchored) view.  But
is this the only way to do things in the UIMA standard?  That proved
to be a tough sell to the people who worked on the UIMA spec proposal
who were thinking not just about our implementation but also about
other UIM frameworks/systems that do things differently.  So the state
of things for the UIMA spec proposal right now is that views are an
optional way of doing things.

Now on top of that we have to figure out what to do with indexes,
which aren't part of the UIMA spec at the moment.  In our current
implementation indexes only operate on views.  Maybe its OK to leave
it that way for now, but I thought it was worth exploring if there's a
way to have indexes work on over the CAS as a whole, as well.

Going back to my hypothetical annotator that created an annotation off
the base CAS by calling CAS.createAnnotation(begin, end, Sofa).  In
our current implementation this isn't useful because the annotation
has to be indexed to be retrievable, and the only way to index it is
to add it to a view.  Are there any other options we could consider?

If we can't or don't want to change the fact that indexes only operate
on views, we could provide an iterator that walks the heap and returns
everything regardless of whether it's indexed.  Then we'd be saying -
neither views nor indexes are required -- they're a performance
optimization.

-Adam


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Thilo Goetz

Eddie Epstein wrote:


We had previously discussed that using the base CAS as a single global
view was not useful for applications because of potential collisions, and
therefore recommended that a collection of multi-view analytics that need
a single "global" view should create a named view for that purpose.


So far I'm with you.  Marshall had mentioned this as well.  I kind of 
like the idea, I'm just wondering how complex this will be to specify. 
To me, the core idea is that a view holds a subset of all indexes. 
Those indexes could be shared by other views, if that makes sense.



Doesn't that previous discussion read on the topic of global indexes?


Is it my brain, or this sentence, that doesn't make any sense ;-)  Could 
you explain?


--Thilo




Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Adam Lally

We had previously discussed that using the base CAS as a single global
view was not useful for applications because of potential collisions, and
therefore recommended that a collection of multi-view analytics that need
a single "global" view should create a named view for that purpose.
Doesn't that previous discussion read on the topic of global indexes?



I remember that discussion, but I guess I'm flip-flopping.  It's true
global indexes would need to be used with some care; annotators can't
assume no one else is writing to them.  Used appropriately, I don't
see this as likely to cause a problem (but if you want to argue
otherwise, maybe you can convince me to flop back to the other side
again).

I see there as being only two options here:
(a) Have a global index of some kind in order to allow annotators to
work on the CAS without regard for views.
(b) Require that to do any work with the CAS you need to work with a
view.  I believe that this is inconsisent with the OASIS architecture
proposal.

-Adam


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Thilo Goetz

Adam Lally wrote:

Now what you say about sofas is interesting.  Currently, an index knows
nothing of views or sofas.  The only thing that is checked when adding a
FS to an index is the FS's type.  Are you suggesting that there should
be special code that prevents me from adding an annotation that I
created in one view to the index repository of another view?



In fact I believe that code already exists and it's not that
complicated (in our current implementation anyway).  Each annotation
has a feature that is a reference to the Sofa, and the view has a
reference to its Sofa.  So I think this is just an integer comparison
between these two values.


Yes, but it's a check that is redundant in 99% of all cases.  We could 
also handle this at the iterator end of things, with an option that 
checks sofa/view membership.  We keep piling these things on, and we 
have enough problems selling UIMA performance as is.




This constaint is mentioned in the OASIS spec:  an "anchored view" is
a view that's tied to a Sofa, and it is a constraint that all
annotations that are members of an anchored view refer to that view's
Sofa.


A really simple approach would be to say that there are view-local index
definitions, and CAS-global index definitions.  For the view-local ones,
each view would have its own instance (and every view would have one).
For the CAS-global ones, there would be one instance in the CAS, shared
by all views.  However, that is just my current naive view of things.
Much more complicated schemes could be envisioned.



I'm not too worried about the specifiers.  A scheme like this would be
fine and fairly easy to add, if we first decide that this idea of
separate local/global index definitions is the way we want to go.


I am worried about our specifiers because of their complexity.  To this 
day, I have not fully understood the parameter settings in our 
specifiers, for example -- and I know I'm not the only one.  The more 
complexity we add, the higher the barrier of entry for a new UIMA user is.



Marshall Schor wrote:

Re: Need for "Global indexes"



What is the use case for the global view set of indexes? I can't recall
the use-case for this, beyond
being able to get all the data.   This thread has suggested other
utilities that can effectively
"merge" the results from other view's index instances. Are there other
use cases?


A hypothetical use case is that I want to get all Person mentions
(annotations) in the CAS, say because I'm going to populate a database
with their covered text and perhaps other feature values.

Of course, you could walk all views to do that.  But I'm suggesting
you shouldn't have to.  We could add a utility method to hide that
detail; I guess I'm OK with that.

Basicaly, this discussion is more about getting the concepts straight
than adding new functionality.  I'll say again:

(1) The CAS is the container for all of the analysis data (as per the
UIMA spec).  It must be possible to create FS directly on the CAS
and there must be some reasonable way to retrieve the FS in the CAS
without having to be concerened wtih views.


This seems to be an important point, and one that I still haven't really 
understood.  Why is this necessary?  An anchored view is the only way to 
contain a subject of analysis.  UIMA without sofas (in the conceptual 
sense) is nothing.  Why do I need to be able to access annotations 
without being concerned about views?  Conceptually and in an ideal 
world, that is.  Don't get me wrong, I'm not opposed to this.  I simply 
don't understand the motivation, and I would like to.


--Thilo


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Eddie Epstein
>
> * There is one "Global Index Repository" in the CAS (accessible by
> CAS.getGlobalIndexRepository() and CAS.addFsToGlobalIndexes())
>
> * Each view has its own Index Repository, containing only the indexes
> that are specific to that view. (accessible by
> CasView.getIndexRepository() and CasView.addFsToIndexes()).
>
> * There may be an additional method CAS.getCompleteIndexRepository()
> which returns an IndexRepository that contains ALL indexes in the
> entire CAS, including the global indexes as well as all indexes in all
> views.  However, I argued that this index repository should be
> read-only (i.e. not support addFS()), because adding an FS to all
> views in one fell swoop seemed like to dangerous an operation.
>

We had previously discussed that using the base CAS as a single global
view was not useful for applications because of potential collisions, and
therefore recommended that a collection of multi-view analytics that need
a single "global" view should create a named view for that purpose.
Doesn't that previous discussion read on the topic of global indexes?

Eddie


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Adam Lally

A collection of quotes from Thilo about global indexes.  After reading
all these I think I finally might be on the same wavelength... (we'll
see in a moment :)

On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

 
Global indexes should be shared.  That is
also the spirit of the OASIS draft, I think.  The draft spec doesn't
talk about indexes, true, but it certainly has been informed by our
implementation.
 
What I mean is that there is no way for a global index not to be part of
a view.  Or without the double negation: every global index is part of
every view.
 

The only rule of visibility is that one view can not access the
view-specific indexes of another view.  Everything else is always
visible.

All CAS-global indexes are visible from/belong to every
view, view-local indexes have one instance per view, and no global
instance.  That's what I meant, but as I said, much more sophisticated
schemes could be imagined.




So, the suggestion is that for an index definition that's declared
"global", there would be only one instance in the entire CAS.  But
that one instance would be in the Index Repository of all views.
Therefore, calling myView.addFsToIndexes(fs) would add fs to this
global index, whereupon it would be visible from all other views (via
myOtherView.getIndexRepository().getIndex(name).iterator().

I'm a little uncomfortable with how this makes it nearly transparent
whether an index is local or global.  I don't think this fits so well
with some basic ideas of Views, namely:

(a) There should be a straightforward operation that adds something to
a view without impacting other views

(b) There should be a straightforward operation that gets me the
members of a view.


I think of indexes as our implementation of view membership.  So I
want to think of operation (a) is being myView.addFsToIndexes(fs) and
operation (b) as myView.getIndexRepository().getIndexes() [although,
I'd like a more convenient way].

The global indexes don't fit so well there... because
myView.addFsToIndxes(fs) violates the "without impacting other views"
restriction in (a).

How about the slightly different thought (which I think we were at
least close to agreeing to yesterday).

* There is one "Global Index Repository" in the CAS (accessible by
CAS.getGlobalIndexRepository() and CAS.addFsToGlobalIndexes())

* Each view has its own Index Repository, containing only the indexes
that are specific to that view. (accessible by
CasView.getIndexRepository() and CasView.addFsToIndexes()).

* There may be an additional method CAS.getCompleteIndexRepository()
which returns an IndexRepository that contains ALL indexes in the
entire CAS, including the global indexes as well as all indexes in all
views.  However, I argued that this index repository should be
read-only (i.e. not support addFS()), because adding an FS to all
views in one fell swoop seemed like to dangerous an operation.

-Adam


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Adam Lally

A few quick comments here, then I'll deal with the big issues in another email.

On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

Marshall Schor wrote:
> In this discussion, I think some confusion arises from the use of
> "index" to mean both the index definition, and
> an instance (perhaps associated with a particular view) of that index
> definition.
>
> Also, in this discussion, the term CAS seems sometimes to be specific to
> what we might call the base-view, versus
> other specific "views" of the CAS.
>
> If we more clearly distinguish these, the conversation may be easier to
> follow.  I've tried to distinguish them below:

Ever since I've grasped the major concepts, I think we've been
communicating quite well ;-)



I agree, though being clearer on terminology can't hurt.  I've tried
to adhere to:

* "CAS" means the entire CAS.  It never means a specific view of the CAS.
* "Index Definition" means the declaration in the descriptor that
defines an index - giving it a label, kind of index, CAS type, and
sort keys.
* "Index" is an instance of an index definition - something that can
be retreived by a getIndex() call and from which you can get an
iterator.
* "Physical Index" is an actual data structure holding references to
FeatureStructures.  This  is transparent to the user but sometimes we
need to talk about it if we're concerned about performance.


One of the things Adam and I had agreed on (Adam correct me if I'm
wrong) was that the base CAS, as you call it, is *not* a view.


+1.  This is central to our API renaming - we are envisioning creating
two interfaces: CAS and CasView.  An instance of CAS refers to an
entire CAS, which may contain multiple CasViews.  It is not consistent
with that to say that "the CAS is a view" or that "a view is a CAS".


Now what you say about sofas is interesting.  Currently, an index knows
nothing of views or sofas.  The only thing that is checked when adding a
FS to an index is the FS's type.  Are you suggesting that there should
be special code that prevents me from adding an annotation that I
created in one view to the index repository of another view?



In fact I believe that code already exists and it's not that
complicated (in our current implementation anyway).  Each annotation
has a feature that is a reference to the Sofa, and the view has a
reference to its Sofa.  So I think this is just an integer comparison
between these two values.

This constaint is mentioned in the OASIS spec:  an "anchored view" is
a view that's tied to a Sofa, and it is a constraint that all
annotations that are members of an anchored view refer to that view's
Sofa.


A really simple approach would be to say that there are view-local index
definitions, and CAS-global index definitions.  For the view-local ones,
each view would have its own instance (and every view would have one).
For the CAS-global ones, there would be one instance in the CAS, shared
by all views.  However, that is just my current naive view of things.
Much more complicated schemes could be envisioned.



I'm not too worried about the specifiers.  A scheme like this would be
fine and fairly easy to add, if we first decide that this idea of
separate local/global index definitions is the way we want to go.



>> The only rule of visibility is that one view can not access the
>> view-specific indexes of another view.  Everything else is always
>> visible.
> I didn't follow this...

See above.  All CAS-global indexes are visible from/belong to every
view, view-local indexes have one instance per view, and no global
instance.  That's what I meant, but as I said, much more sophisticated
schemes could be imagined.



I think this is the key idea still to be nailed down, really.  Like
Marshall I don't think I completely understood what Thilo was
suggesting with the global indexes being visible from the views.  I
have a better understanding now but have some concerns.  This will be
the topic of my next email.


Marshall Schor wrote:

Re: Need for "Global indexes"



What is the use case for the global view set of indexes? I can't recall
the use-case for this, beyond
being able to get all the data.   This thread has suggested other
utilities that can effectively
"merge" the results from other view's index instances. Are there other
use cases?


A hypothetical use case is that I want to get all Person mentions
(annotations) in the CAS, say because I'm going to populate a database
with their covered text and perhaps other feature values.

Of course, you could walk all views to do that.  But I'm suggesting
you shouldn't have to.  We could add a utility method to hide that
detail; I guess I'm OK with that.

Basicaly, this discussion is more about getting the concepts straight
than adding new functionality.  I'll say again:

(1) The CAS is the container for all of the analysis data (as per the
UIMA spec).  It must be possible to create FS directly on the CAS
and there must be some reasonable way to retrieve the FS in the CAS

Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Thilo Goetz

Marshall Schor wrote:

Re: Need for "Global indexes"


What is the use case or the global view set of indexes? I can't recall 
the use-case for this, beyond
being able to get all the data.   This thread has suggested other 
utilities that can effectively
"merge" the results from other view's index instances. Are there other 
use cases?


I don't know of any either.  All I can think of is tooling and utilities 
such as serialization, and those might be expected to work with views.




We had once discussed a use case where some collection of parts 
(annotators) that worked
with views wanted to share some data that was global to their views.  We 
thought that
the best-practice way to do that was to have this collection of parts 
define another "view"
to serve as their "global-sharing-place", in preference to a 
system-provided
global-sharing-place because that would enable this collection of parts 
to be combined with
other parts in the future without having any accidental collisions in 
the global-sharing-space,

from other unknown users of this space.


See my reply to your other post.  From a CAS perspective, I don't think 
that's a problem.  I don't even want to start thinking about what a 
specifier for that would look like.




I guess I would vote to have the thing that gets all the FS in all views 
be just a utility

method.

I hope if we put our minds to it we can get this done for 2.1.  I'm
hoping after 2.1 we can go a good long time without breaking backwards
compatibility again.

+1 to that :-)


Just a word of caution here.  We not only need to agree on the 
specification, we also need to implement all this.  There are some 
non-trivial CAS changes under discussion right now.  Even if we can 
defer some internal clean-up to the next release, the functionality 
needs to be there.


--Thilo



Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-22 Thread Thilo Goetz

Marshall Schor wrote:
In this discussion, I think some confusion arises from the use of 
"index" to mean both the index definition, and
an instance (perhaps associated with a particular view) of that index 
definition.


Also, in this discussion, the term CAS seems sometimes to be specific to 
what we might call the base-view, versus

other specific "views" of the CAS.

If we more clearly distinguish these, the conversation may be easier to 
follow.  I've tried to distinguish them below:


Ever since I've grasped the major concepts, I think we've been 
communicating quite well ;-)



Logically, the part of the Index Repository which has the definitions is 
not duplicated;
only the actual index instances are.  We think there is a way to make 
the actual
creation of the index instances "lazy" - in the sense that for 
performance / overhead
reasons, they are not created until the first attempt to add a FS to 
that index instance

in that particular view.


I assume you are talking about the current implementation.  I'm not sure 
what this laziness would buy us.  An empty index consumes virtually no 
space.  Unless we have reason to believe that there are significant 
gains to be had, I would vote for simplicity and against optimization.



I didn't mean to suggest to have duplicate indexes.  What I meant to 
say was, each view should have its own annotation index.  
In fact, today, each view has its own complete set of index instances, 
one per each index definition.


And that is not a good thing.  Global indexes should be shared.  That is 
also the spirit of the OASIS draft, I think.  The draft spec doesn't 
talk about indexes, true, but it certainly has been informed by our 
implementation.


In the CAS, each of these annotation indexes can be accessed 
separately.  In fact, I think this is pretty much what you're saying 
as well.  I don't see a use case for a global merged annotation index, 
other than tooling and utilities.  And even for tooling, I think it 
makes sense to access the annotation for each view separately.  If we 
need to iterate over annotations from different views sorted by their 
offsets, irrespective of the sofa they point into, we can provide a 
utility function that does that on the fly.


Note however that this implies that one should never do 
addFsToIndexes() on the CAS with an annotation, as it would be added 
to all annotation indexes.  
I think this means not to do an "addFsToIndexes() with an Annotation on 
the Cas View which is the "base CAS".   The current design would 
disallow this because an Annotation (which has a reference to a Sofa) is 
only allowed to be

added to index instances that belong to the view which has that Sofa.


One of the things Adam and I had agreed on (Adam correct me if I'm 
wrong) was that the base CAS, as you call it, is *not* a view.  What 
Adam was proposing for backward compatibility was a notion of a "current 
view", which is directly accessible through the CAS APIs.  However, 
those are just convenience/compatibility APIs.  Conceptually, the 
current view is a view of its own and could (should for new code) be 
accessed through regular view APIs.


Now what you say about sofas is interesting.  Currently, an index knows 
nothing of views or sofas.  The only thing that is checked when adding a 
FS to an index is the FS's type.  Are you suggesting that there should 
be special code that prevents me from adding an annotation that I 
created in one view to the index repository of another view?


That might be desirable, but it will get complicated and expensive.  I 
think we need to document this point carefully and hope that users 
understand that they shouldn't be doing this.  It would be very hard to 
prevent all misuses of sofas/views.


My suggestion implies that the index repository itself is agnostic of 
views and sofas.  If you add an annotation to the wrong repository, 
it's your own fault.


So to summarize, I would suggest that annotation indexes, for example, 
only live in views, there is no global annotation index (neither 
conceptually, nor physically).  To access annotations from the CAS, 
you still need to access view-specific indexes.


Non-sofa indexes, on the other hand, only exist in the global namespace. 
I'm not sure what this means.  If it means an index over some 
non-Annotation type is not allowed to be part of a view, this seems to 
go against the idea of allowing "views" to hold subsets of 
FeatureStructures.   So I don't think that's a good idea here.


What I mean is that there is no way for a global index not to be part of 
a view.  Or without the double negation: every global index is part of 
every view.


Why not make this simpler by having a uniform approach: each view has 
its own set of index instances (drawn from perhaps a global set of index 
definitions, or perhaps some localized set of index definitions - that 
part to be worked out), whether or not the index is over Annotations or 
not.


From a CAS impleme

Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Marshall Schor

Re: Need for "Global indexes"

Adam Lally wrote:



>
> Moreover, I think the reverse direction should be true -- indexing an
> FS in a view's index repository DOES add it (at least conceptually) to
> indexes that apply to the CAS as a whole.  I liked this latter idea
> because it provided a way to get at all the FS in the CAS without
> having to be concerned with views.

I agree, and I hope that has been clear from my previous posts.  Any
view-specific index is visible from the CAS, in my approach.



OK, as I said above I think I was just stuck on whether or not the
thing that from the base CAS gives you a merged view of all the view
indexes was called an index, or whether it's just a utility method.

I'm using the terms "index definitions" and "index instances" here; we 
can have
one global set of index definitions  (or not :-) while having multiple 
index instances for those definitions, one per view, and
perhaps (a conceptual, maybe not real) one for the "base CAS" or "global 
view" or whatever we want to call it -

something used by people not concerned about views.

What is the use case or the global view set of indexes? I can't recall 
the use-case for this, beyond
being able to get all the data.   This thread has suggested other 
utilities that can effectively
"merge" the results from other view's index instances. Are there other 
use cases?


We had once discussed a use case where some collection of parts 
(annotators) that worked
with views wanted to share some data that was global to their views.  We 
thought that
the best-practice way to do that was to have this collection of parts 
define another "view"

to serve as their "global-sharing-place", in preference to a system-provided
global-sharing-place because that would enable this collection of parts 
to be combined with
other parts in the future without having any accidental collisions in 
the global-sharing-space,

from other unknown users of this space.

I guess I would vote to have the thing that gets all the FS in all views 
be just a utility

method.

I hope if we put our minds to it we can get this done for 2.1.  I'm
hoping after 2.1 we can go a good long time without breaking backwards
compatibility again.

+1 to that :-)

-Marshall



Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Marshall Schor
In this discussion, I think some confusion arises from the use of 
"index" to mean both the index definition, and
an instance (perhaps associated with a particular view) of that index 
definition.


Also, in this discussion, the term CAS seems sometimes to be specific to 
what we might call the base-view, versus

other specific "views" of the CAS.

If we more clearly distinguish these, the conversation may be easier to 
follow.  I've tried to distinguish them below:


Thilo Goetz wrote:

Adam Lally wrote:


I think this basically makes sense.  I want to clarify though, that
what we *do* currently have different indexes 
(i.e., we have different index instances of the common / shared index 
definitions)

for each view (for
example each view has its own annotation index, which holds  the
annotations relating to that view's sofa). This is done by replicating
the index repository for each view.
Logically, the part of the Index Repository which has the definitions is 
not duplicated;
only the actual index instances are.  We think there is a way to make 
the actual
creation of the index instances "lazy" - in the sense that for 
performance / overhead
reasons, they are not created until the first attempt to add a FS to 
that index instance

in that particular view.




Right.  I would like to change that in the course of introducing 
CasViews.




A key question is "do all views have the same set of index
_definitions_?"  Currently, yes - the component descriptors declare
index definitions without reference to views, and consequently, for
every view we create an instance of each defined index.  Your note
above, and Marshall's, argue that this shouldn't necessarily be the
case -- some indexes may make sense only for certain views (but also,
only for certain components, a further complication).  I think that
probably makes sense, but I'm not sure it's a critical thing to
implement now, if we haven't seen a real use case where it's a problem
to create instances of indexes in every view even if they're not used.


Hm, somehow, we need to distinguish between indexes that are global to 
all views, and those that are local to a view.  How do we do that?
I think you mean to distinguish between index definitions that should be 
in all views, and
those which should only be in (some) views. 




The other key idea here is the global index repository that contains
all of the indexes from all views -- we don't currently have anything
like that.  Take the annotation index as an example, and say there are
multiple views each with their own annotation index.  I also want to
enable operations on the CAS like "get me all annotations in all
views", or "get me all annotations of type Person in all views".  To
do that we also create an annotation index in the base CAS (the
"global namespace").  I think you could do such a thing in your
suggestion; if you had a global annotation index then whenever anyone
did view.addFsToIndexes(myAnnot) in any view, myAnnot would also be
added to the global annotation index (because you said the global
index is visible from the index repository of the view).  My idea was
a little different, and I guess maybe just an implementation detail.
Instead of actually adding myAnnot to a separate, global index, I
would just add it to it's own view's index.  Then, when someone asks
for an iterator off of the global annotation index, I would do a
dynamic merge of the annotation indexes in all views (the same way we
do merging of indexes across types).  But the effect is the same - we
have a global index that provides access to everything that was
indexed in any view.


I didn't mean to suggest to have duplicate indexes.  What I meant to 
say was, each view should have its own annotation index.  
In fact, today, each view has its own complete set of index instances, 
one per each index definition.
In the CAS, each of these annotation indexes can be accessed 
separately.  In fact, I think this is pretty much what you're saying 
as well.  I don't see a use case for a global merged annotation index, 
other than tooling and utilities.  And even for tooling, I think it 
makes sense to access the annotation for each view separately.  If we 
need to iterate over annotations from different views sorted by their 
offsets, irrespective of the sofa they point into, we can provide a 
utility function that does that on the fly.


Note however that this implies that one should never do 
addFsToIndexes() on the CAS with an annotation, as it would be added 
to all annotation indexes.  
I think this means not to do an "addFsToIndexes() with an Annotation on 
the Cas View which is the "base CAS".   The current design would 
disallow this because an Annotation (which has a reference to a Sofa) is 
only allowed to be
added to index instances that belong to the view which has that Sofa. 

My suggestion implies that the index repository itself is agnostic of 
views and sofas.  If you add an annotation to the wrong repository, 
it's yo

Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Adam Lally

On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

> (1) The CAS is the container for all of the analysis data (as per the
> UIMA spec).  It must be possible to create FS directly on the CAS
> and there must be some reasonable way to retrieve the FS in the CAS
> without having to be concerened wtih views.

Agreed.  It should be possible to say, on the global index repository:
give me all indexes.  This will include the global indexes, as well as
all view-specific indexes.  You can then iterate over all data in all
indexes, without knowing anything about views.



OK, that seems fine.  As you said we can provide a utility function to
get me all annotations of a particular type.  This seems functionally
equivalent to my idea of a *conceptual* annotation index that
logically contains all annotations in the CAS but would be efficiently
implemented by on-the-fly combination of the individual annoation
indexes.  The only difference is whether you want to call such a thing
an "index".  (We already do such on-the-fly merges so it didn't seem
like too much of a stretch to me.)


> (2) A CasView is a way of accessing a subset of FS in the CAS.  It
> must be possible
> to assert than an FS is a _member_ of a CasView, and there must be
> some reasonable way to retrieve the members of the CasView.

In the general CAS, we can only access those FSs that are in some index.
  If you need to be able to retrieve any FS whatsoever, you need to
define a bag index over all types.  I would propose to handle views the
same way.  A FS is a member of a view iff it's contained in one of the
indexes specific to the view.  The same FS may live in several indexes,
belonging to different views.  That seems in accordance with the spec
proposal.



Agreed, that's my interpretation as well.  Essentially I think of
indexes as a way to implement view membership.



A view to me is just a set of indexes; moreover, it's a subset of the
set of all indexes, which are exactly the indexes defined in the CAS.
When I add a FS to all those indexes, it will be added to all applicable
indexes, and that means all view indexes as well.


H... this is where I start to feel that this design of the index
mechanism (a view being a set of indexes which is a subset of the set
of all indexes in the CAS) is failing to mesh with how I think views
should work.

I think it should be clear when one is asserting that an object
belongs to a view (by specifically adding it to the indexes for that
view).  And I think that someone who doesn't care about views should
be able to index objects in the CAS and read them back later, and
views should be completely unaffected by this.  If it's too easy to
accidentally add objects to every view in the CAS without realizing
it, we're not doing a good job at making it possible to interact with
the CAS as just a collection of objects without being concerned with
views



Alternatively, we can
say adding an FS in the CAS means adding it to global, non-view indexes
only.  That would make sense, but it doesn't sync with the idea that the
CAS index repository contains all indexes, not just the global ones.
Maybe we need a special API for that, addFsToGlobalIndexes().  So maybe
getGlobalIndexRepository() should be called something else, to avoid
confusion.  getCompleteIndexRepository() or something.



I'm about +0.5 on that; we're getting somewhere.  I still feel like
getCompleteIndexRepository().addFS() would be a dangerous method just
asking to be misused.  Could we disable that operation?



>
> Moreover, I think the reverse direction should be true -- indexing an
> FS in a view's index repository DOES add it (at least conceptually) to
> indexes that apply to the CAS as a whole.  I liked this latter idea
> because it provided a way to get at all the FS in the CAS without
> having to be concerned with views.

I agree, and I hope that has been clear from my previous posts.  Any
view-specific index is visible from the CAS, in my approach.



OK, as I said above I think I was just stuck on whether or not the
thing that from the base CAS gives you a merged view of all the view
indexes was called an index, or whether it's just a utility method.



I am very much concerned with performance, and it needs to be a
consideration from the start.  We simply can't add every annotation to
two indexes by default.


I didn't mean to suggest that, sorry if I wasn't clear.  If you have a
"global" annotation index and a "local" (to a view) annotation index,
I don't want to ever add an annotation to both.  If you index it off
of the view, it only goes into the local index.  If you index it off
the base CAS, it only goes into the global index (in my way of
thinking).  I think there should be a way to easily get at the
contents of both indexes as if they were merged, but we'd do such
merging on-the-fly.

Each addFsToIndexes() operation should add the FS to either a single
view's indexes, or to the global indexes.

So given that, is there an issue wi

Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Thilo Goetz

Adam Lally wrote:

On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

I didn't mean to suggest to have duplicate indexes.  What I meant to say
was, each view should have its own annotation index.  In the CAS, each
of these annotation indexes can be accessed separately.  In fact, I
think this is pretty much what you're saying as well.  I don't see a use
case for a global merged annotation index, other than tooling and
utilities.  And even for tooling, I think it makes sense to access the
annotation for each view separately.


I think maybe we should take a step back and try to agree on a few
basic things that we want to be true of CASes and CasViews.  Here are
the ideas that I had, mostly drawing on the definition in the UIMA
spec proposal.

(1) The CAS is the container for all of the analysis data (as per the
UIMA spec).  It must be possible to create FS directly on the CAS
and there must be some reasonable way to retrieve the FS in the CAS
without having to be concerened wtih views.


Agreed.  It should be possible to say, on the global index repository: 
give me all indexes.  This will include the global indexes, as well as 
all view-specific indexes.  You can then iterate over all data in all 
indexes, without knowing anything about views.




(2) A CasView is a way of accessing a subset of FS in the CAS.  It
must be possible
to assert than an FS is a _member_ of a CasView, and there must be
some reasonable way to retrieve the members of the CasView.


In the general CAS, we can only access those FSs that are in some index. 
 If you need to be able to retrieve any FS whatsoever, you need to 
define a bag index over all types.  I would propose to handle views the 
same way.  A FS is a member of a view iff it's contained in one of the 
indexes specific to the view.  The same FS may live in several indexes, 
belonging to different views.  That seems in accordance with the spec 
proposal.




If we need to iterate over
annotations from different views sorted by their offsets, irrespective
of the sofa they point into, we can provide a utility function that does
that on the fly.


I agree that it doesn't make much sense that if I access annotations
irrespective of sofas, they would be sorted by begin, end.  However, I
still think I might just want to get all annotations (of some type)
and not care about the order.


You can do that under my proposal: just get all annotation indexes for 
all views and iterate over each of them in turn.  If we need a utility 
function for that, it's easy enough to do.






Note however that this implies that one should never do addFsToIndexes()
on the CAS with an annotation, as it would be added to all annotation
indexes.  My suggestion implies that the index repository itself is
agnostic of views and sofas.  If you add an annotation to the wrong
repository, it's your own fault.



This behavior doesn't mesh well with the 3 ideas above.  To me,
indexing an FS in the CAS just means that I want to be able to
retrieve this FS back out of the CAS later.  It does not mean that I'm
asserting it to be a member of any view.


A view to me is just a set of indexes; moreover, it's a subset of the 
set of all indexes, which are exactly the indexes defined in the CAS. 
When I add a FS to all those indexes, it will be added to all applicable 
indexes, and that means all view indexes as well.  Alternatively, we can 
say adding an FS in the CAS means adding it to global, non-view indexes 
only.  That would make sense, but it doesn't sync with the idea that the 
CAS index repository contains all indexes, not just the global ones. 
Maybe we need a special API for that, addFsToGlobalIndexes().  So maybe 
getGlobalIndexRepository() should be called something else, to avoid 
confusion.  getCompleteIndexRepository() or something.




Moreover, I think the reverse direction should be true -- indexing an
FS in a view's index repository DOES add it (at least conceptually) to
indexes that apply to the CAS as a whole.  I liked this latter idea
because it provided a way to get at all the FS in the CAS without
having to be concerned with views.


I agree, and I hope that has been clear from my previous posts.  Any 
view-specific index is visible from the CAS, in my approach.






So to summarize, I would suggest that annotation indexes, for example,
only live in views, there is no global annotation index (neither
conceptually, nor physically).  To access annotations from the CAS, you
still need to access view-specific indexes.

Non-sofa indexes, on the other hand, only exist in the global namespace.
  The only rule of visibility is that one view can not access the
view-specific indexes of another view.  Everything else is always 
visible.


So what I haven't figured out for myself is, what makes a sofa-index a
sofa-index?  Do we need a declaration, or can we figure this out
automatically?



I think it's a view-index, not necessarily a sofa-index (for now it
doesn't matter, but we may someday break t

Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Adam Lally

On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

I didn't mean to suggest to have duplicate indexes.  What I meant to say
was, each view should have its own annotation index.  In the CAS, each
of these annotation indexes can be accessed separately.  In fact, I
think this is pretty much what you're saying as well.  I don't see a use
case for a global merged annotation index, other than tooling and
utilities.  And even for tooling, I think it makes sense to access the
annotation for each view separately.


I think maybe we should take a step back and try to agree on a few
basic things that we want to be true of CASes and CasViews.  Here are
the ideas that I had, mostly drawing on the definition in the UIMA
spec proposal.

(1) The CAS is the container for all of the analysis data (as per the
UIMA spec).  It must be possible to create FS directly on the CAS
and there must be some reasonable way to retrieve the FS in the CAS
without having to be concerened wtih views.

(2) A CasView is a way of accessing a subset of FS in the CAS.  It
must be possible
to assert than an FS is a _member_ of a CasView, and there must be
some reasonable way to retrieve the members of the CasView.

(3) A CasView MAY also have a Sofa (such a CasView is called an
"anchored view") -- if it does this means that any annotation that is
a member of that view must refer to the view's Sofa.

I see indexes as providing a "reasonable" way to access either (a) FS
in the CAS as a whole or (b) the members of a view.


If we need to iterate over
annotations from different views sorted by their offsets, irrespective
of the sofa they point into, we can provide a utility function that does
that on the fly.


I agree that it doesn't make much sense that if I access annotations
irrespective of sofas, they would be sorted by begin, end.  However, I
still think I might just want to get all annotations (of some type)
and not care about the order.



Note however that this implies that one should never do addFsToIndexes()
on the CAS with an annotation, as it would be added to all annotation
indexes.  My suggestion implies that the index repository itself is
agnostic of views and sofas.  If you add an annotation to the wrong
repository, it's your own fault.



This behavior doesn't mesh well with the 3 ideas above.  To me,
indexing an FS in the CAS just means that I want to be able to
retrieve this FS back out of the CAS later.  It does not mean that I'm
asserting it to be a member of any view.

Moreover, I think the reverse direction should be true -- indexing an
FS in a view's index repository DOES add it (at least conceptually) to
indexes that apply to the CAS as a whole.  I liked this latter idea
because it provided a way to get at all the FS in the CAS without
having to be concerned with views.



So to summarize, I would suggest that annotation indexes, for example,
only live in views, there is no global annotation index (neither
conceptually, nor physically).  To access annotations from the CAS, you
still need to access view-specific indexes.

Non-sofa indexes, on the other hand, only exist in the global namespace.
  The only rule of visibility is that one view can not access the
view-specific indexes of another view.  Everything else is always visible.

So what I haven't figured out for myself is, what makes a sofa-index a
sofa-index?  Do we need a declaration, or can we figure this out
automatically?



I think it's a view-index, not necessarily a sofa-index (for now it
doesn't matter, but we may someday break the 1-1 correspondence
between views and sofas).  I think the most general design here would
be to allow a declaration saying which view(s) the index belongs to,
and/or whether it belongs to the CAS as a whole.  (I think it could be
both.)  In the absence of such a declaration, the index applies to all
views for backwards compatibility and I think maybe also applies to
the CAS as a whole.  The nice thing about the default being that it
applies to everything is that we can put off implementing
view-restricted indexes until later; I think adding them is more a
peformance optimization than anything else, elminating the creation of
unneeded indexes.

-Adam


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Thilo Goetz

Adam Lally wrote:


I think this basically makes sense.  I want to clarify though, that
what we *do* currently have different indexes for each view (for
example each view has its own annotation index, which holds  the
annotations relating to that view's sofa). This is done by replicating
the index repository for each view.


Right.  I would like to change that in the course of introducing CasViews.



A key question is "do all views have the same set of index
_definitions_?"  Currently, yes - the component descriptors declare
index definitions without reference to views, and consequently, for
every view we create an instance of each defined index.  Your note
above, and Marshall's, argue that this shouldn't necessarily be the
case -- some indexes may make sense only for certain views (but also,
only for certain components, a further complication).  I think that
probably makes sense, but I'm not sure it's a critical thing to
implement now, if we haven't seen a real use case where it's a problem
to create instances of indexes in every view even if they're not used.


Hm, somehow, we need to distinguish between indexes that are global to 
all views, and those that are local to a view.  How do we do that?




The other key idea here is the global index repository that contains
all of the indexes from all views -- we don't currently have anything
like that.  Take the annotation index as an example, and say there are
multiple views each with their own annotation index.  I also want to
enable operations on the CAS like "get me all annotations in all
views", or "get me all annotations of type Person in all views".  To
do that we also create an annotation index in the base CAS (the
"global namespace").  I think you could do such a thing in your
suggestion; if you had a global annotation index then whenever anyone
did view.addFsToIndexes(myAnnot) in any view, myAnnot would also be
added to the global annotation index (because you said the global
index is visible from the index repository of the view).  My idea was
a little different, and I guess maybe just an implementation detail.
Instead of actually adding myAnnot to a separate, global index, I
would just add it to it's own view's index.  Then, when someone asks
for an iterator off of the global annotation index, I would do a
dynamic merge of the annotation indexes in all views (the same way we
do merging of indexes across types).  But the effect is the same - we
have a global index that provides access to everything that was
indexed in any view.


I didn't mean to suggest to have duplicate indexes.  What I meant to say 
was, each view should have its own annotation index.  In the CAS, each 
of these annotation indexes can be accessed separately.  In fact, I 
think this is pretty much what you're saying as well.  I don't see a use 
case for a global merged annotation index, other than tooling and 
utilities.  And even for tooling, I think it makes sense to access the 
annotation for each view separately.  If we need to iterate over 
annotations from different views sorted by their offsets, irrespective 
of the sofa they point into, we can provide a utility function that does 
that on the fly.


Note however that this implies that one should never do addFsToIndexes() 
on the CAS with an annotation, as it would be added to all annotation 
indexes.  My suggestion implies that the index repository itself is 
agnostic of views and sofas.  If you add an annotation to the wrong 
repository, it's your own fault.


So to summarize, I would suggest that annotation indexes, for example, 
only live in views, there is no global annotation index (neither 
conceptually, nor physically).  To access annotations from the CAS, you 
still need to access view-specific indexes.


Non-sofa indexes, on the other hand, only exist in the global namespace. 
 The only rule of visibility is that one view can not access the 
view-specific indexes of another view.  Everything else is always visible.


So what I haven't figured out for myself is, what makes a sofa-index a 
sofa-index?  Do we need a declaration, or can we figure this out 
automatically?


--Thilo



Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Adam Lally

On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote:

I haven't thought this through yet, but here's how I see indexes and
their relation to views right now.  Let me know if this agrees with your
views, or how it differs.

The index repository is a set of indexes, at least right now.  All it
can do is to give you indexes.  The index repository of the CAS holds
all indexes, a view's repository a subset thereof.  An index is
retrieved by name (i.e., each index has at least one name).  Currently,
if there is more than one index with the same indexing spec, but
different names, all those names actually point to the same physical
index.  However, that choice is transparent to the user.  I assume this
needs to change.  If we have more than one view, and they all have
annotation indexes, those should be different indexes (at least
conceptually, but I think also physically).  So views create a simple
sort of name space: an index can either belong to the global namespace,
or to that of an view.  All indexes can be accessed from the CAS, but
only global indexes and the indexes for the given view can be accessed
from the index repository of that view.



I think this basically makes sense.  I want to clarify though, that
what we *do* currently have different indexes for each view (for
example each view has its own annotation index, which holds  the
annotations relating to that view's sofa). This is done by replicating
the index repository for each view.

A key question is "do all views have the same set of index
_definitions_?"  Currently, yes - the component descriptors declare
index definitions without reference to views, and consequently, for
every view we create an instance of each defined index.  Your note
above, and Marshall's, argue that this shouldn't necessarily be the
case -- some indexes may make sense only for certain views (but also,
only for certain components, a further complication).  I think that
probably makes sense, but I'm not sure it's a critical thing to
implement now, if we haven't seen a real use case where it's a problem
to create instances of indexes in every view even if they're not used.

The other key idea here is the global index repository that contains
all of the indexes from all views -- we don't currently have anything
like that.  Take the annotation index as an example, and say there are
multiple views each with their own annotation index.  I also want to
enable operations on the CAS like "get me all annotations in all
views", or "get me all annotations of type Person in all views".  To
do that we also create an annotation index in the base CAS (the
"global namespace").  I think you could do such a thing in your
suggestion; if you had a global annotation index then whenever anyone
did view.addFsToIndexes(myAnnot) in any view, myAnnot would also be
added to the global annotation index (because you said the global
index is visible from the index repository of the view).  My idea was
a little different, and I guess maybe just an implementation detail.
Instead of actually adding myAnnot to a separate, global index, I
would just add it to it's own view's index.  Then, when someone asks
for an iterator off of the global annotation index, I would do a
dynamic merge of the annotation indexes in all views (the same way we
do merging of indexes across types).  But the effect is the same - we
have a global index that provides access to everything that was
indexed in any view.

-Adam


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Thilo Goetz
I haven't thought this through yet, but here's how I see indexes and 
their relation to views right now.  Let me know if this agrees with your 
views, or how it differs.


The index repository is a set of indexes, at least right now.  All it 
can do is to give you indexes.  The index repository of the CAS holds 
all indexes, a view's repository a subset thereof.  An index is 
retrieved by name (i.e., each index has at least one name).  Currently, 
if there is more than one index with the same indexing spec, but 
different names, all those names actually point to the same physical 
index.  However, that choice is transparent to the user.  I assume this 
needs to change.  If we have more than one view, and they all have 
annotation indexes, those should be different indexes (at least 
conceptually, but I think also physically).  So views create a simple 
sort of name space: an index can either belong to the global namespace, 
or to that of an view.  All indexes can be accessed from the CAS, but 
only global indexes and the indexes for the given view can be accessed 
from the index repository of that view.


--Thilo


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-20 Thread Adam Lally

On 12/19/06, Marshall Schor <[EMAIL PROTECTED]> wrote:

If we think of a CasView as a way of accessing a subset of the data
in the CAS, what are the pluses and minuses of having every view
have the same (shared) index definitions?  Would it make more sense
to have each view have its own non-shared set of indexes / definitions?



Maybe... we might extend the index descriptor format to allow
specifying a set of view names to which the index applies.  And in the
absence of such a specification, the index might apply only to view of
the component's declared input and output sofas.  For "sofa-unaware"
annotators (or whatever we're calling them this week ;)  this would
mean that the index only applies to the one view that they operate on
(which is specified by sofa mappings).  Although I'm concerned what
happens if sofa mapping becomes dynamic.

All in all, without a concrete use case where there is currently a
significant performance issue, I would put off adding this feature.



But some components need specific indexes (and type priorities :-)
in order to correctly iterate through sets of FSs.  In this case, the
component part is closely associated with the index specification.

For better modularity - if I had a component operating on a particular
view, needing a particular index specification, these might be
associated to the component - and having such an index as a "global"
thing might lead to unwanted "collisions" in the index "name-space",
although this could be minimized by having some uniqueness to the
index name.  So if I called the indexed "ComponentAsIndex", it would
make more sense if this was associated only with Component A, and not
globally.  This doesn't quite match associated the index with just one
view, I admit.



Component-specific index also seem like a good idea (to do someday).
One reason is to allow an optimization for remote annotators.  There's
no reason to actually build the index on the client side if it's only
needed by a remote annotator, if the index isn't serialized to the
remote node.  We need only keep a list of indexed FS, and build the
index on the remote node as we do already.

Also we can deal with name collisions - two annotators could declare
different indexes with the same label, but since they are specific to
the component that is OK.  When each component executes
IndexRepository.getIndex(label), it would get the index that it itself
had declared.  This could be implemented the same way we are currently
handling Sofa mapping - the CAS "knows" what annotator is currently
processing it.  Of course if two annotators declared indexes over the
same type (or where one type is an ancestor of the other) with the
same sort keys, they should be merged into one index in the
implementation, even if they have different labels.

-Adam


CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-19 Thread Marshall Schor

If we think of a CasView as a way of accessing a subset of the data
in the CAS, what are the pluses and minuses of having every view
have the same (shared) index definitions?  Would it make more sense
to have each view have its own non-shared set of indexes / definitions?

Pluses:
 - A view which wanted to only index one kind of thing would not need
   to create instances of all the other indexes (which would be unused).

Minuses: 
 - more complexity?


Other topics around indexes include how to think about what the
index is logically associated with.  Using the DB analogy - indexes
are "extra" - only serving to speed things up.  In this view, they are
associated with "assemblers" who are doing fine-tuning, space/time
trade-offs. 


But some components need specific indexes (and type priorities :-)
in order to correctly iterate through sets of FSs.  In this case, the
component part is closely associated with the index specification.

For better modularity - if I had a component operating on a particular
view, needing a particular index specification, these might be
associated to the component - and having such an index as a "global"
thing might lead to unwanted "collisions" in the index "name-space",
although this could be minimized by having some uniqueness to the
index name.  So if I called the indexed "ComponentAsIndex", it would
make more sense if this was associated only with Component A, and not
globally.  This doesn't quite match associated the index with just one
view, I admit. 


-Marshall