Re: CAS and CasView redesign - question if all views should share thesame indexes?
Thilo Goetz wrote: Marshall Schor wrote: Thilo Goetz wrote: From a performance perspective, I'd vote for having the filtering on the iterator side of thing, where it already is. If one annotator decides it needs a "filtered index" over annotations, that can affect the performance of all other annotators as well, because then all annotations not only go into the regular annotation index, but the additional index as well. Wouldn't the performance be better with the filtering on the indexing side, if the #writes/updates << # read accesses to the filtered set? No, because the way I see it, no filtering would ever be necessary. If you have a different annotation index for each anchored view, you don't need to do any filtering at indexing time, nor at access time. Good point :-) I was assuming, just for the purpose of exploring (not advocating :-) ) having only one index-set, and "add-to-indexes" would go through all of them and the "filter" would only add the item to the right view. Your design point was assuming having different indexes per "anchored-view. -Marshall
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Marshall Schor wrote: Thilo Goetz wrote: From a performance perspective, I'd vote for having the filtering on the iterator side of thing, where it already is. If one annotator decides it needs a "filtered index" over annotations, that can affect the performance of all other annotators as well, because then all annotations not only go into the regular annotation index, but the additional index as well. Wouldn't the performance be better with the filtering on the indexing side, if the #writes/updates << # read accesses to the filtered set? No, because the way I see it, no filtering would ever be necessary. If you have a different annotation index for each anchored view, you don't need to do any filtering at indexing time, nor at access time. --Thilo
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Thilo Goetz wrote: Marshall Schor wrote: Adam Lally wrote: On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote: If we had filtering predicates as part of an index specification, then we could create indexes over subsets of types quite arbitrarily. Could this more general mechanism serve this purpose better than views? I'm not sure what you mean, "subets of types". Do you mean "subsets of objects (FeatureStructures)", as in a filter that checks arbitrary feature values to decide whether an object gets added to the index? Yes. Could be... this sounds like it's saying that an index is a way to optimize what could be implemented by an annotator using a filter over all FS in the CAS followed by a sort. Right. From a performance perspective, I'd vote for having the filtering on the iterator side of thing, where it already is. If one annotator decides it needs a "filtered index" over annotations, that can affect the performance of all other annotators as well, because then all annotations not only go into the regular annotation index, but the additional index as well. Wouldn't the performance be better with the filtering on the indexing side, if the #writes/updates << # read accesses to the filtered set? I'm thinking that the best thing is to have clear, documented performance expectations, and let the developer choose. -Marshall
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Marshall Schor wrote: Adam Lally wrote: On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote: If we had filtering predicates as part of an index specification, then we could create indexes over subsets of types quite arbitrarily. Could this more general mechanism serve this purpose better than views? I'm not sure what you mean, "subets of types". Do you mean "subsets of objects (FeatureStructures)", as in a filter that checks arbitrary feature values to decide whether an object gets added to the index? Yes. Could be... this sounds like it's saying that an index is a way to optimize what could be implemented by an annotator using a filter over all FS in the CAS followed by a sort. Right. From a performance perspective, I'd vote for having the filtering on the iterator side of thing, where it already is. If one annotator decides it needs a "filtered index" over annotations, that can affect the performance of all other annotators as well, because then all annotations not only go into the regular annotation index, but the additional index as well. --Thilo
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Adam Lally wrote: On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote: If we had filtering predicates as part of an index specification, then we could create indexes over subsets of types quite arbitrarily. Could this more general mechanism serve this purpose better than views? I'm not sure what you mean, "subets of types". Do you mean "subsets of objects (FeatureStructures)", as in a filter that checks arbitrary feature values to decide whether an object gets added to the index? Yes. Could be... this sounds like it's saying that an index is a way to optimize what could be implemented by an annotator using a filter over all FS in the CAS followed by a sort. Right. > Going back to my hypothetical annotator that created an annotation off > the base CAS by calling CAS.createAnnotation(begin, end, Sofa). In > our current implementation this isn't useful because the annotation > has to be indexed to be retrievable, and the only way to index it is > to add it to a view. Are there any other options we could consider This doesn't seem correct: an annotation doesn't have to be indexed to be retrievable - it could be referenced by some chain of FS, with the starting FS of course being indexed So I could have an FSArray for instance, of Annotations, and index the FSArray. Yes, of course; I glossed over that detail. I don't think it really affects my point, though. The only way to index the FSArray containing my annotation would be to add it to a view, which I don't want to do. Is there a way to make my Annotation accessible from the base CAS without having to go through a view first? I guess I missed the point ... Indexes are sometimes used as a performance optimization, but other times they're part of a component's logic - as when a component depends on a particular sorting order. The annotator could do the sorting itself. But I'll correct my statement to say that indexes are both a performance optimization and a convenience. I think you may be technically correct, but my point was that the users tend to think of indexes differently (more than just optimizations and conveniences) - they think of them as part of their component logic. I'm thinking of the users that use the special iterators over Annotation types, and depend on things like Type Priorities for correct operation of their components. -Marshall
Re: CAS and CasView redesign - question if all views should share thesame indexes?
On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote: If we had filtering predicates as part of an index specification, then we could create indexes over subsets of types quite arbitrarily. Could this more general mechanism serve this purpose better than views? I'm not sure what you mean, "subets of types". Do you mean "subsets of objects (FeatureStructures)", as in a filter that checks arbitrary feature values to decide whether an object gets added to the index? Could be... this sounds like it's saying that an index is a way to optimize what could be implemented by an annotator using a filter over all FS in the CAS followed by a sort. > Going back to my hypothetical annotator that created an annotation off > the base CAS by calling CAS.createAnnotation(begin, end, Sofa). In > our current implementation this isn't useful because the annotation > has to be indexed to be retrievable, and the only way to index it is > to add it to a view. Are there any other options we could consider This doesn't seem correct: an annotation doesn't have to be indexed to be retrievable - it could be referenced by some chain of FS, with the starting FS of course being indexed So I could have an FSArray for instance, of Annotations, and index the FSArray. Yes, of course; I glossed over that detail. I don't think it really affects my point, though. The only way to index the FSArray containing my annotation would be to add it to a view, which I don't want to do. Is there a way to make my Annotation accessible from the base CAS without having to go through a view first? Indexes are sometimes used as a performance optimization, but other times they're part of a component's logic - as when a component depends on a particular sorting order. The annotator could do the sorting itself. But I'll correct my statement to say that indexes are both a performance optimization and a convenience. -Adam
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Adam Lally wrote: On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: Adam Lally wrote: > (1) The CAS is the container for all of the analysis data (as per the > UIMA spec). It must be possible to create FS directly on the CAS > and there must be some reasonable way to retrieve the FS in the CAS > without having to be concerened with views. This seems to be an important point, and one that I still haven't really understood. Why is this necessary? An anchored view is the only way to contain a subject of analysis. UIMA without sofas (in the conceptual sense) is nothing. Why do I need to be able to access annotations without being concerned about views? Conceptually and in an ideal world, that is. Don't get me wrong, I'm not opposed to this. I simply don't understand the motivation, and I would like to. That's a fair question... One thing I want to clarify is that UIMA without views doesn't mean UIMA without Sofas. You should be able to access the Sofas (all of them) directly from the CAS. They're just FeatureStructures after all, and our current implementation does have a Sofa index, though it's hidden at the moment. So one way of working with the CAS without views might be for an annotator to look through the Sofa index for a Sofa it wants to analyze and create some annotations over it (I suggested a CAS.createAnnotation(begin, end, Sofa) method for this purpose.) Views are a way that we think is useful to organize feature structures in the CAS, and one key way to organize them is to collect all the annotations referring to a single sofa into one (anchored) view. If we had filtering predicates as part of an index specification, then we could create indexes over subsets of types quite arbitrarily. Could this more general mechanism serve this purpose better than views? But is this the only way to do things in the UIMA standard? That proved to be a tough sell to the people who worked on the UIMA spec proposal who were thinking not just about our implementation but also about other UIM frameworks/systems that do things differently. So the state of things for the UIMA spec proposal right now is that views are an optional way of doing things. Now on top of that we have to figure out what to do with indexes, which aren't part of the UIMA spec at the moment. In our current implementation indexes only operate on views. Maybe its OK to leave it that way for now, but I thought it was worth exploring if there's a way to have indexes work on over the CAS as a whole, as well. Going back to my hypothetical annotator that created an annotation off the base CAS by calling CAS.createAnnotation(begin, end, Sofa). In our current implementation this isn't useful because the annotation has to be indexed to be retrievable, and the only way to index it is to add it to a view. Are there any other options we could consider This doesn't seem correct: an annotation doesn't have to be indexed to be retrievable - it could be referenced by some chain of FS, with the starting FS of course being indexed So I could have an FSArray for instance, of Annotations, and index the FSArray. If we can't or don't want to change the fact that indexes only operate on views, we could provide an iterator that walks the heap and returns everything regardless of whether it's indexed. Then we'd be saying - neither views nor indexes are required -- they're a performance optimization. Indexes are sometimes used as a performance optimization, but other times they're part of a component's logic - as when a component depends on a particular sorting order. -Marshall
Re: CAS and CasView redesign - question if all views should share thesame indexes?
On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: Eddie Epstein wrote: > Doesn't that previous discussion read on the topic of global indexes? Is it my brain, or this sentence, that doesn't make any sense ;-) Could you explain? Must be Eddie's Southern US dialect. ;) I'm not familiar with that use of "read on" either. From context I'm guessing it's supposed to mean the same thing as "speak to" as in, "has relevance to". -Adam
Re: CAS and CasView redesign - question if all views should share thesame indexes?
On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: Adam Lally wrote: > (1) The CAS is the container for all of the analysis data (as per the > UIMA spec). It must be possible to create FS directly on the CAS > and there must be some reasonable way to retrieve the FS in the CAS > without having to be concerened with views. This seems to be an important point, and one that I still haven't really understood. Why is this necessary? An anchored view is the only way to contain a subject of analysis. UIMA without sofas (in the conceptual sense) is nothing. Why do I need to be able to access annotations without being concerned about views? Conceptually and in an ideal world, that is. Don't get me wrong, I'm not opposed to this. I simply don't understand the motivation, and I would like to. That's a fair question... One thing I want to clarify is that UIMA without views doesn't mean UIMA without Sofas. You should be able to access the Sofas (all of them) directly from the CAS. They're just FeatureStructures after all, and our current implementation does have a Sofa index, though it's hidden at the moment. So one way of working with the CAS without views might be for an annotator to look through the Sofa index for a Sofa it wants to analyze and create some annotations over it (I suggested a CAS.createAnnotation(begin, end, Sofa) method for this purpose.) Views are a way that we think is useful to organize feature structures in the CAS, and one key way to organize them is to collect all the annotations referring to a single sofa into one (anchored) view. But is this the only way to do things in the UIMA standard? That proved to be a tough sell to the people who worked on the UIMA spec proposal who were thinking not just about our implementation but also about other UIM frameworks/systems that do things differently. So the state of things for the UIMA spec proposal right now is that views are an optional way of doing things. Now on top of that we have to figure out what to do with indexes, which aren't part of the UIMA spec at the moment. In our current implementation indexes only operate on views. Maybe its OK to leave it that way for now, but I thought it was worth exploring if there's a way to have indexes work on over the CAS as a whole, as well. Going back to my hypothetical annotator that created an annotation off the base CAS by calling CAS.createAnnotation(begin, end, Sofa). In our current implementation this isn't useful because the annotation has to be indexed to be retrievable, and the only way to index it is to add it to a view. Are there any other options we could consider? If we can't or don't want to change the fact that indexes only operate on views, we could provide an iterator that walks the heap and returns everything regardless of whether it's indexed. Then we'd be saying - neither views nor indexes are required -- they're a performance optimization. -Adam
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Eddie Epstein wrote: We had previously discussed that using the base CAS as a single global view was not useful for applications because of potential collisions, and therefore recommended that a collection of multi-view analytics that need a single "global" view should create a named view for that purpose. So far I'm with you. Marshall had mentioned this as well. I kind of like the idea, I'm just wondering how complex this will be to specify. To me, the core idea is that a view holds a subset of all indexes. Those indexes could be shared by other views, if that makes sense. Doesn't that previous discussion read on the topic of global indexes? Is it my brain, or this sentence, that doesn't make any sense ;-) Could you explain? --Thilo
Re: CAS and CasView redesign - question if all views should share thesame indexes?
We had previously discussed that using the base CAS as a single global view was not useful for applications because of potential collisions, and therefore recommended that a collection of multi-view analytics that need a single "global" view should create a named view for that purpose. Doesn't that previous discussion read on the topic of global indexes? I remember that discussion, but I guess I'm flip-flopping. It's true global indexes would need to be used with some care; annotators can't assume no one else is writing to them. Used appropriately, I don't see this as likely to cause a problem (but if you want to argue otherwise, maybe you can convince me to flop back to the other side again). I see there as being only two options here: (a) Have a global index of some kind in order to allow annotators to work on the CAS without regard for views. (b) Require that to do any work with the CAS you need to work with a view. I believe that this is inconsisent with the OASIS architecture proposal. -Adam
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Adam Lally wrote: Now what you say about sofas is interesting. Currently, an index knows nothing of views or sofas. The only thing that is checked when adding a FS to an index is the FS's type. Are you suggesting that there should be special code that prevents me from adding an annotation that I created in one view to the index repository of another view? In fact I believe that code already exists and it's not that complicated (in our current implementation anyway). Each annotation has a feature that is a reference to the Sofa, and the view has a reference to its Sofa. So I think this is just an integer comparison between these two values. Yes, but it's a check that is redundant in 99% of all cases. We could also handle this at the iterator end of things, with an option that checks sofa/view membership. We keep piling these things on, and we have enough problems selling UIMA performance as is. This constaint is mentioned in the OASIS spec: an "anchored view" is a view that's tied to a Sofa, and it is a constraint that all annotations that are members of an anchored view refer to that view's Sofa. A really simple approach would be to say that there are view-local index definitions, and CAS-global index definitions. For the view-local ones, each view would have its own instance (and every view would have one). For the CAS-global ones, there would be one instance in the CAS, shared by all views. However, that is just my current naive view of things. Much more complicated schemes could be envisioned. I'm not too worried about the specifiers. A scheme like this would be fine and fairly easy to add, if we first decide that this idea of separate local/global index definitions is the way we want to go. I am worried about our specifiers because of their complexity. To this day, I have not fully understood the parameter settings in our specifiers, for example -- and I know I'm not the only one. The more complexity we add, the higher the barrier of entry for a new UIMA user is. Marshall Schor wrote: Re: Need for "Global indexes" What is the use case for the global view set of indexes? I can't recall the use-case for this, beyond being able to get all the data. This thread has suggested other utilities that can effectively "merge" the results from other view's index instances. Are there other use cases? A hypothetical use case is that I want to get all Person mentions (annotations) in the CAS, say because I'm going to populate a database with their covered text and perhaps other feature values. Of course, you could walk all views to do that. But I'm suggesting you shouldn't have to. We could add a utility method to hide that detail; I guess I'm OK with that. Basicaly, this discussion is more about getting the concepts straight than adding new functionality. I'll say again: (1) The CAS is the container for all of the analysis data (as per the UIMA spec). It must be possible to create FS directly on the CAS and there must be some reasonable way to retrieve the FS in the CAS without having to be concerened wtih views. This seems to be an important point, and one that I still haven't really understood. Why is this necessary? An anchored view is the only way to contain a subject of analysis. UIMA without sofas (in the conceptual sense) is nothing. Why do I need to be able to access annotations without being concerned about views? Conceptually and in an ideal world, that is. Don't get me wrong, I'm not opposed to this. I simply don't understand the motivation, and I would like to. --Thilo
Re: CAS and CasView redesign - question if all views should share thesame indexes?
> > * There is one "Global Index Repository" in the CAS (accessible by > CAS.getGlobalIndexRepository() and CAS.addFsToGlobalIndexes()) > > * Each view has its own Index Repository, containing only the indexes > that are specific to that view. (accessible by > CasView.getIndexRepository() and CasView.addFsToIndexes()). > > * There may be an additional method CAS.getCompleteIndexRepository() > which returns an IndexRepository that contains ALL indexes in the > entire CAS, including the global indexes as well as all indexes in all > views. However, I argued that this index repository should be > read-only (i.e. not support addFS()), because adding an FS to all > views in one fell swoop seemed like to dangerous an operation. > We had previously discussed that using the base CAS as a single global view was not useful for applications because of potential collisions, and therefore recommended that a collection of multi-view analytics that need a single "global" view should create a named view for that purpose. Doesn't that previous discussion read on the topic of global indexes? Eddie
Re: CAS and CasView redesign - question if all views should share thesame indexes?
A collection of quotes from Thilo about global indexes. After reading all these I think I finally might be on the same wavelength... (we'll see in a moment :) On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: Global indexes should be shared. That is also the spirit of the OASIS draft, I think. The draft spec doesn't talk about indexes, true, but it certainly has been informed by our implementation. What I mean is that there is no way for a global index not to be part of a view. Or without the double negation: every global index is part of every view. The only rule of visibility is that one view can not access the view-specific indexes of another view. Everything else is always visible. All CAS-global indexes are visible from/belong to every view, view-local indexes have one instance per view, and no global instance. That's what I meant, but as I said, much more sophisticated schemes could be imagined. So, the suggestion is that for an index definition that's declared "global", there would be only one instance in the entire CAS. But that one instance would be in the Index Repository of all views. Therefore, calling myView.addFsToIndexes(fs) would add fs to this global index, whereupon it would be visible from all other views (via myOtherView.getIndexRepository().getIndex(name).iterator(). I'm a little uncomfortable with how this makes it nearly transparent whether an index is local or global. I don't think this fits so well with some basic ideas of Views, namely: (a) There should be a straightforward operation that adds something to a view without impacting other views (b) There should be a straightforward operation that gets me the members of a view. I think of indexes as our implementation of view membership. So I want to think of operation (a) is being myView.addFsToIndexes(fs) and operation (b) as myView.getIndexRepository().getIndexes() [although, I'd like a more convenient way]. The global indexes don't fit so well there... because myView.addFsToIndxes(fs) violates the "without impacting other views" restriction in (a). How about the slightly different thought (which I think we were at least close to agreeing to yesterday). * There is one "Global Index Repository" in the CAS (accessible by CAS.getGlobalIndexRepository() and CAS.addFsToGlobalIndexes()) * Each view has its own Index Repository, containing only the indexes that are specific to that view. (accessible by CasView.getIndexRepository() and CasView.addFsToIndexes()). * There may be an additional method CAS.getCompleteIndexRepository() which returns an IndexRepository that contains ALL indexes in the entire CAS, including the global indexes as well as all indexes in all views. However, I argued that this index repository should be read-only (i.e. not support addFS()), because adding an FS to all views in one fell swoop seemed like to dangerous an operation. -Adam
Re: CAS and CasView redesign - question if all views should share thesame indexes?
A few quick comments here, then I'll deal with the big issues in another email. On 12/22/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: Marshall Schor wrote: > In this discussion, I think some confusion arises from the use of > "index" to mean both the index definition, and > an instance (perhaps associated with a particular view) of that index > definition. > > Also, in this discussion, the term CAS seems sometimes to be specific to > what we might call the base-view, versus > other specific "views" of the CAS. > > If we more clearly distinguish these, the conversation may be easier to > follow. I've tried to distinguish them below: Ever since I've grasped the major concepts, I think we've been communicating quite well ;-) I agree, though being clearer on terminology can't hurt. I've tried to adhere to: * "CAS" means the entire CAS. It never means a specific view of the CAS. * "Index Definition" means the declaration in the descriptor that defines an index - giving it a label, kind of index, CAS type, and sort keys. * "Index" is an instance of an index definition - something that can be retreived by a getIndex() call and from which you can get an iterator. * "Physical Index" is an actual data structure holding references to FeatureStructures. This is transparent to the user but sometimes we need to talk about it if we're concerned about performance. One of the things Adam and I had agreed on (Adam correct me if I'm wrong) was that the base CAS, as you call it, is *not* a view. +1. This is central to our API renaming - we are envisioning creating two interfaces: CAS and CasView. An instance of CAS refers to an entire CAS, which may contain multiple CasViews. It is not consistent with that to say that "the CAS is a view" or that "a view is a CAS". Now what you say about sofas is interesting. Currently, an index knows nothing of views or sofas. The only thing that is checked when adding a FS to an index is the FS's type. Are you suggesting that there should be special code that prevents me from adding an annotation that I created in one view to the index repository of another view? In fact I believe that code already exists and it's not that complicated (in our current implementation anyway). Each annotation has a feature that is a reference to the Sofa, and the view has a reference to its Sofa. So I think this is just an integer comparison between these two values. This constaint is mentioned in the OASIS spec: an "anchored view" is a view that's tied to a Sofa, and it is a constraint that all annotations that are members of an anchored view refer to that view's Sofa. A really simple approach would be to say that there are view-local index definitions, and CAS-global index definitions. For the view-local ones, each view would have its own instance (and every view would have one). For the CAS-global ones, there would be one instance in the CAS, shared by all views. However, that is just my current naive view of things. Much more complicated schemes could be envisioned. I'm not too worried about the specifiers. A scheme like this would be fine and fairly easy to add, if we first decide that this idea of separate local/global index definitions is the way we want to go. >> The only rule of visibility is that one view can not access the >> view-specific indexes of another view. Everything else is always >> visible. > I didn't follow this... See above. All CAS-global indexes are visible from/belong to every view, view-local indexes have one instance per view, and no global instance. That's what I meant, but as I said, much more sophisticated schemes could be imagined. I think this is the key idea still to be nailed down, really. Like Marshall I don't think I completely understood what Thilo was suggesting with the global indexes being visible from the views. I have a better understanding now but have some concerns. This will be the topic of my next email. Marshall Schor wrote: Re: Need for "Global indexes" What is the use case for the global view set of indexes? I can't recall the use-case for this, beyond being able to get all the data. This thread has suggested other utilities that can effectively "merge" the results from other view's index instances. Are there other use cases? A hypothetical use case is that I want to get all Person mentions (annotations) in the CAS, say because I'm going to populate a database with their covered text and perhaps other feature values. Of course, you could walk all views to do that. But I'm suggesting you shouldn't have to. We could add a utility method to hide that detail; I guess I'm OK with that. Basicaly, this discussion is more about getting the concepts straight than adding new functionality. I'll say again: (1) The CAS is the container for all of the analysis data (as per the UIMA spec). It must be possible to create FS directly on the CAS and there must be some reasonable way to retrieve the FS in the CAS
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Marshall Schor wrote: Re: Need for "Global indexes" What is the use case or the global view set of indexes? I can't recall the use-case for this, beyond being able to get all the data. This thread has suggested other utilities that can effectively "merge" the results from other view's index instances. Are there other use cases? I don't know of any either. All I can think of is tooling and utilities such as serialization, and those might be expected to work with views. We had once discussed a use case where some collection of parts (annotators) that worked with views wanted to share some data that was global to their views. We thought that the best-practice way to do that was to have this collection of parts define another "view" to serve as their "global-sharing-place", in preference to a system-provided global-sharing-place because that would enable this collection of parts to be combined with other parts in the future without having any accidental collisions in the global-sharing-space, from other unknown users of this space. See my reply to your other post. From a CAS perspective, I don't think that's a problem. I don't even want to start thinking about what a specifier for that would look like. I guess I would vote to have the thing that gets all the FS in all views be just a utility method. I hope if we put our minds to it we can get this done for 2.1. I'm hoping after 2.1 we can go a good long time without breaking backwards compatibility again. +1 to that :-) Just a word of caution here. We not only need to agree on the specification, we also need to implement all this. There are some non-trivial CAS changes under discussion right now. Even if we can defer some internal clean-up to the next release, the functionality needs to be there. --Thilo
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Marshall Schor wrote: In this discussion, I think some confusion arises from the use of "index" to mean both the index definition, and an instance (perhaps associated with a particular view) of that index definition. Also, in this discussion, the term CAS seems sometimes to be specific to what we might call the base-view, versus other specific "views" of the CAS. If we more clearly distinguish these, the conversation may be easier to follow. I've tried to distinguish them below: Ever since I've grasped the major concepts, I think we've been communicating quite well ;-) Logically, the part of the Index Repository which has the definitions is not duplicated; only the actual index instances are. We think there is a way to make the actual creation of the index instances "lazy" - in the sense that for performance / overhead reasons, they are not created until the first attempt to add a FS to that index instance in that particular view. I assume you are talking about the current implementation. I'm not sure what this laziness would buy us. An empty index consumes virtually no space. Unless we have reason to believe that there are significant gains to be had, I would vote for simplicity and against optimization. I didn't mean to suggest to have duplicate indexes. What I meant to say was, each view should have its own annotation index. In fact, today, each view has its own complete set of index instances, one per each index definition. And that is not a good thing. Global indexes should be shared. That is also the spirit of the OASIS draft, I think. The draft spec doesn't talk about indexes, true, but it certainly has been informed by our implementation. In the CAS, each of these annotation indexes can be accessed separately. In fact, I think this is pretty much what you're saying as well. I don't see a use case for a global merged annotation index, other than tooling and utilities. And even for tooling, I think it makes sense to access the annotation for each view separately. If we need to iterate over annotations from different views sorted by their offsets, irrespective of the sofa they point into, we can provide a utility function that does that on the fly. Note however that this implies that one should never do addFsToIndexes() on the CAS with an annotation, as it would be added to all annotation indexes. I think this means not to do an "addFsToIndexes() with an Annotation on the Cas View which is the "base CAS". The current design would disallow this because an Annotation (which has a reference to a Sofa) is only allowed to be added to index instances that belong to the view which has that Sofa. One of the things Adam and I had agreed on (Adam correct me if I'm wrong) was that the base CAS, as you call it, is *not* a view. What Adam was proposing for backward compatibility was a notion of a "current view", which is directly accessible through the CAS APIs. However, those are just convenience/compatibility APIs. Conceptually, the current view is a view of its own and could (should for new code) be accessed through regular view APIs. Now what you say about sofas is interesting. Currently, an index knows nothing of views or sofas. The only thing that is checked when adding a FS to an index is the FS's type. Are you suggesting that there should be special code that prevents me from adding an annotation that I created in one view to the index repository of another view? That might be desirable, but it will get complicated and expensive. I think we need to document this point carefully and hope that users understand that they shouldn't be doing this. It would be very hard to prevent all misuses of sofas/views. My suggestion implies that the index repository itself is agnostic of views and sofas. If you add an annotation to the wrong repository, it's your own fault. So to summarize, I would suggest that annotation indexes, for example, only live in views, there is no global annotation index (neither conceptually, nor physically). To access annotations from the CAS, you still need to access view-specific indexes. Non-sofa indexes, on the other hand, only exist in the global namespace. I'm not sure what this means. If it means an index over some non-Annotation type is not allowed to be part of a view, this seems to go against the idea of allowing "views" to hold subsets of FeatureStructures. So I don't think that's a good idea here. What I mean is that there is no way for a global index not to be part of a view. Or without the double negation: every global index is part of every view. Why not make this simpler by having a uniform approach: each view has its own set of index instances (drawn from perhaps a global set of index definitions, or perhaps some localized set of index definitions - that part to be worked out), whether or not the index is over Annotations or not. From a CAS impleme
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Re: Need for "Global indexes" Adam Lally wrote: > > Moreover, I think the reverse direction should be true -- indexing an > FS in a view's index repository DOES add it (at least conceptually) to > indexes that apply to the CAS as a whole. I liked this latter idea > because it provided a way to get at all the FS in the CAS without > having to be concerned with views. I agree, and I hope that has been clear from my previous posts. Any view-specific index is visible from the CAS, in my approach. OK, as I said above I think I was just stuck on whether or not the thing that from the base CAS gives you a merged view of all the view indexes was called an index, or whether it's just a utility method. I'm using the terms "index definitions" and "index instances" here; we can have one global set of index definitions (or not :-) while having multiple index instances for those definitions, one per view, and perhaps (a conceptual, maybe not real) one for the "base CAS" or "global view" or whatever we want to call it - something used by people not concerned about views. What is the use case or the global view set of indexes? I can't recall the use-case for this, beyond being able to get all the data. This thread has suggested other utilities that can effectively "merge" the results from other view's index instances. Are there other use cases? We had once discussed a use case where some collection of parts (annotators) that worked with views wanted to share some data that was global to their views. We thought that the best-practice way to do that was to have this collection of parts define another "view" to serve as their "global-sharing-place", in preference to a system-provided global-sharing-place because that would enable this collection of parts to be combined with other parts in the future without having any accidental collisions in the global-sharing-space, from other unknown users of this space. I guess I would vote to have the thing that gets all the FS in all views be just a utility method. I hope if we put our minds to it we can get this done for 2.1. I'm hoping after 2.1 we can go a good long time without breaking backwards compatibility again. +1 to that :-) -Marshall
Re: CAS and CasView redesign - question if all views should share thesame indexes?
In this discussion, I think some confusion arises from the use of "index" to mean both the index definition, and an instance (perhaps associated with a particular view) of that index definition. Also, in this discussion, the term CAS seems sometimes to be specific to what we might call the base-view, versus other specific "views" of the CAS. If we more clearly distinguish these, the conversation may be easier to follow. I've tried to distinguish them below: Thilo Goetz wrote: Adam Lally wrote: I think this basically makes sense. I want to clarify though, that what we *do* currently have different indexes (i.e., we have different index instances of the common / shared index definitions) for each view (for example each view has its own annotation index, which holds the annotations relating to that view's sofa). This is done by replicating the index repository for each view. Logically, the part of the Index Repository which has the definitions is not duplicated; only the actual index instances are. We think there is a way to make the actual creation of the index instances "lazy" - in the sense that for performance / overhead reasons, they are not created until the first attempt to add a FS to that index instance in that particular view. Right. I would like to change that in the course of introducing CasViews. A key question is "do all views have the same set of index _definitions_?" Currently, yes - the component descriptors declare index definitions without reference to views, and consequently, for every view we create an instance of each defined index. Your note above, and Marshall's, argue that this shouldn't necessarily be the case -- some indexes may make sense only for certain views (but also, only for certain components, a further complication). I think that probably makes sense, but I'm not sure it's a critical thing to implement now, if we haven't seen a real use case where it's a problem to create instances of indexes in every view even if they're not used. Hm, somehow, we need to distinguish between indexes that are global to all views, and those that are local to a view. How do we do that? I think you mean to distinguish between index definitions that should be in all views, and those which should only be in (some) views. The other key idea here is the global index repository that contains all of the indexes from all views -- we don't currently have anything like that. Take the annotation index as an example, and say there are multiple views each with their own annotation index. I also want to enable operations on the CAS like "get me all annotations in all views", or "get me all annotations of type Person in all views". To do that we also create an annotation index in the base CAS (the "global namespace"). I think you could do such a thing in your suggestion; if you had a global annotation index then whenever anyone did view.addFsToIndexes(myAnnot) in any view, myAnnot would also be added to the global annotation index (because you said the global index is visible from the index repository of the view). My idea was a little different, and I guess maybe just an implementation detail. Instead of actually adding myAnnot to a separate, global index, I would just add it to it's own view's index. Then, when someone asks for an iterator off of the global annotation index, I would do a dynamic merge of the annotation indexes in all views (the same way we do merging of indexes across types). But the effect is the same - we have a global index that provides access to everything that was indexed in any view. I didn't mean to suggest to have duplicate indexes. What I meant to say was, each view should have its own annotation index. In fact, today, each view has its own complete set of index instances, one per each index definition. In the CAS, each of these annotation indexes can be accessed separately. In fact, I think this is pretty much what you're saying as well. I don't see a use case for a global merged annotation index, other than tooling and utilities. And even for tooling, I think it makes sense to access the annotation for each view separately. If we need to iterate over annotations from different views sorted by their offsets, irrespective of the sofa they point into, we can provide a utility function that does that on the fly. Note however that this implies that one should never do addFsToIndexes() on the CAS with an annotation, as it would be added to all annotation indexes. I think this means not to do an "addFsToIndexes() with an Annotation on the Cas View which is the "base CAS". The current design would disallow this because an Annotation (which has a reference to a Sofa) is only allowed to be added to index instances that belong to the view which has that Sofa. My suggestion implies that the index repository itself is agnostic of views and sofas. If you add an annotation to the wrong repository, it's yo
Re: CAS and CasView redesign - question if all views should share thesame indexes?
On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: > (1) The CAS is the container for all of the analysis data (as per the > UIMA spec). It must be possible to create FS directly on the CAS > and there must be some reasonable way to retrieve the FS in the CAS > without having to be concerened wtih views. Agreed. It should be possible to say, on the global index repository: give me all indexes. This will include the global indexes, as well as all view-specific indexes. You can then iterate over all data in all indexes, without knowing anything about views. OK, that seems fine. As you said we can provide a utility function to get me all annotations of a particular type. This seems functionally equivalent to my idea of a *conceptual* annotation index that logically contains all annotations in the CAS but would be efficiently implemented by on-the-fly combination of the individual annoation indexes. The only difference is whether you want to call such a thing an "index". (We already do such on-the-fly merges so it didn't seem like too much of a stretch to me.) > (2) A CasView is a way of accessing a subset of FS in the CAS. It > must be possible > to assert than an FS is a _member_ of a CasView, and there must be > some reasonable way to retrieve the members of the CasView. In the general CAS, we can only access those FSs that are in some index. If you need to be able to retrieve any FS whatsoever, you need to define a bag index over all types. I would propose to handle views the same way. A FS is a member of a view iff it's contained in one of the indexes specific to the view. The same FS may live in several indexes, belonging to different views. That seems in accordance with the spec proposal. Agreed, that's my interpretation as well. Essentially I think of indexes as a way to implement view membership. A view to me is just a set of indexes; moreover, it's a subset of the set of all indexes, which are exactly the indexes defined in the CAS. When I add a FS to all those indexes, it will be added to all applicable indexes, and that means all view indexes as well. H... this is where I start to feel that this design of the index mechanism (a view being a set of indexes which is a subset of the set of all indexes in the CAS) is failing to mesh with how I think views should work. I think it should be clear when one is asserting that an object belongs to a view (by specifically adding it to the indexes for that view). And I think that someone who doesn't care about views should be able to index objects in the CAS and read them back later, and views should be completely unaffected by this. If it's too easy to accidentally add objects to every view in the CAS without realizing it, we're not doing a good job at making it possible to interact with the CAS as just a collection of objects without being concerned with views Alternatively, we can say adding an FS in the CAS means adding it to global, non-view indexes only. That would make sense, but it doesn't sync with the idea that the CAS index repository contains all indexes, not just the global ones. Maybe we need a special API for that, addFsToGlobalIndexes(). So maybe getGlobalIndexRepository() should be called something else, to avoid confusion. getCompleteIndexRepository() or something. I'm about +0.5 on that; we're getting somewhere. I still feel like getCompleteIndexRepository().addFS() would be a dangerous method just asking to be misused. Could we disable that operation? > > Moreover, I think the reverse direction should be true -- indexing an > FS in a view's index repository DOES add it (at least conceptually) to > indexes that apply to the CAS as a whole. I liked this latter idea > because it provided a way to get at all the FS in the CAS without > having to be concerned with views. I agree, and I hope that has been clear from my previous posts. Any view-specific index is visible from the CAS, in my approach. OK, as I said above I think I was just stuck on whether or not the thing that from the base CAS gives you a merged view of all the view indexes was called an index, or whether it's just a utility method. I am very much concerned with performance, and it needs to be a consideration from the start. We simply can't add every annotation to two indexes by default. I didn't mean to suggest that, sorry if I wasn't clear. If you have a "global" annotation index and a "local" (to a view) annotation index, I don't want to ever add an annotation to both. If you index it off of the view, it only goes into the local index. If you index it off the base CAS, it only goes into the global index (in my way of thinking). I think there should be a way to easily get at the contents of both indexes as if they were merged, but we'd do such merging on-the-fly. Each addFsToIndexes() operation should add the FS to either a single view's indexes, or to the global indexes. So given that, is there an issue wi
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Adam Lally wrote: On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: I didn't mean to suggest to have duplicate indexes. What I meant to say was, each view should have its own annotation index. In the CAS, each of these annotation indexes can be accessed separately. In fact, I think this is pretty much what you're saying as well. I don't see a use case for a global merged annotation index, other than tooling and utilities. And even for tooling, I think it makes sense to access the annotation for each view separately. I think maybe we should take a step back and try to agree on a few basic things that we want to be true of CASes and CasViews. Here are the ideas that I had, mostly drawing on the definition in the UIMA spec proposal. (1) The CAS is the container for all of the analysis data (as per the UIMA spec). It must be possible to create FS directly on the CAS and there must be some reasonable way to retrieve the FS in the CAS without having to be concerened wtih views. Agreed. It should be possible to say, on the global index repository: give me all indexes. This will include the global indexes, as well as all view-specific indexes. You can then iterate over all data in all indexes, without knowing anything about views. (2) A CasView is a way of accessing a subset of FS in the CAS. It must be possible to assert than an FS is a _member_ of a CasView, and there must be some reasonable way to retrieve the members of the CasView. In the general CAS, we can only access those FSs that are in some index. If you need to be able to retrieve any FS whatsoever, you need to define a bag index over all types. I would propose to handle views the same way. A FS is a member of a view iff it's contained in one of the indexes specific to the view. The same FS may live in several indexes, belonging to different views. That seems in accordance with the spec proposal. If we need to iterate over annotations from different views sorted by their offsets, irrespective of the sofa they point into, we can provide a utility function that does that on the fly. I agree that it doesn't make much sense that if I access annotations irrespective of sofas, they would be sorted by begin, end. However, I still think I might just want to get all annotations (of some type) and not care about the order. You can do that under my proposal: just get all annotation indexes for all views and iterate over each of them in turn. If we need a utility function for that, it's easy enough to do. Note however that this implies that one should never do addFsToIndexes() on the CAS with an annotation, as it would be added to all annotation indexes. My suggestion implies that the index repository itself is agnostic of views and sofas. If you add an annotation to the wrong repository, it's your own fault. This behavior doesn't mesh well with the 3 ideas above. To me, indexing an FS in the CAS just means that I want to be able to retrieve this FS back out of the CAS later. It does not mean that I'm asserting it to be a member of any view. A view to me is just a set of indexes; moreover, it's a subset of the set of all indexes, which are exactly the indexes defined in the CAS. When I add a FS to all those indexes, it will be added to all applicable indexes, and that means all view indexes as well. Alternatively, we can say adding an FS in the CAS means adding it to global, non-view indexes only. That would make sense, but it doesn't sync with the idea that the CAS index repository contains all indexes, not just the global ones. Maybe we need a special API for that, addFsToGlobalIndexes(). So maybe getGlobalIndexRepository() should be called something else, to avoid confusion. getCompleteIndexRepository() or something. Moreover, I think the reverse direction should be true -- indexing an FS in a view's index repository DOES add it (at least conceptually) to indexes that apply to the CAS as a whole. I liked this latter idea because it provided a way to get at all the FS in the CAS without having to be concerned with views. I agree, and I hope that has been clear from my previous posts. Any view-specific index is visible from the CAS, in my approach. So to summarize, I would suggest that annotation indexes, for example, only live in views, there is no global annotation index (neither conceptually, nor physically). To access annotations from the CAS, you still need to access view-specific indexes. Non-sofa indexes, on the other hand, only exist in the global namespace. The only rule of visibility is that one view can not access the view-specific indexes of another view. Everything else is always visible. So what I haven't figured out for myself is, what makes a sofa-index a sofa-index? Do we need a declaration, or can we figure this out automatically? I think it's a view-index, not necessarily a sofa-index (for now it doesn't matter, but we may someday break t
Re: CAS and CasView redesign - question if all views should share thesame indexes?
On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: I didn't mean to suggest to have duplicate indexes. What I meant to say was, each view should have its own annotation index. In the CAS, each of these annotation indexes can be accessed separately. In fact, I think this is pretty much what you're saying as well. I don't see a use case for a global merged annotation index, other than tooling and utilities. And even for tooling, I think it makes sense to access the annotation for each view separately. I think maybe we should take a step back and try to agree on a few basic things that we want to be true of CASes and CasViews. Here are the ideas that I had, mostly drawing on the definition in the UIMA spec proposal. (1) The CAS is the container for all of the analysis data (as per the UIMA spec). It must be possible to create FS directly on the CAS and there must be some reasonable way to retrieve the FS in the CAS without having to be concerened wtih views. (2) A CasView is a way of accessing a subset of FS in the CAS. It must be possible to assert than an FS is a _member_ of a CasView, and there must be some reasonable way to retrieve the members of the CasView. (3) A CasView MAY also have a Sofa (such a CasView is called an "anchored view") -- if it does this means that any annotation that is a member of that view must refer to the view's Sofa. I see indexes as providing a "reasonable" way to access either (a) FS in the CAS as a whole or (b) the members of a view. If we need to iterate over annotations from different views sorted by their offsets, irrespective of the sofa they point into, we can provide a utility function that does that on the fly. I agree that it doesn't make much sense that if I access annotations irrespective of sofas, they would be sorted by begin, end. However, I still think I might just want to get all annotations (of some type) and not care about the order. Note however that this implies that one should never do addFsToIndexes() on the CAS with an annotation, as it would be added to all annotation indexes. My suggestion implies that the index repository itself is agnostic of views and sofas. If you add an annotation to the wrong repository, it's your own fault. This behavior doesn't mesh well with the 3 ideas above. To me, indexing an FS in the CAS just means that I want to be able to retrieve this FS back out of the CAS later. It does not mean that I'm asserting it to be a member of any view. Moreover, I think the reverse direction should be true -- indexing an FS in a view's index repository DOES add it (at least conceptually) to indexes that apply to the CAS as a whole. I liked this latter idea because it provided a way to get at all the FS in the CAS without having to be concerned with views. So to summarize, I would suggest that annotation indexes, for example, only live in views, there is no global annotation index (neither conceptually, nor physically). To access annotations from the CAS, you still need to access view-specific indexes. Non-sofa indexes, on the other hand, only exist in the global namespace. The only rule of visibility is that one view can not access the view-specific indexes of another view. Everything else is always visible. So what I haven't figured out for myself is, what makes a sofa-index a sofa-index? Do we need a declaration, or can we figure this out automatically? I think it's a view-index, not necessarily a sofa-index (for now it doesn't matter, but we may someday break the 1-1 correspondence between views and sofas). I think the most general design here would be to allow a declaration saying which view(s) the index belongs to, and/or whether it belongs to the CAS as a whole. (I think it could be both.) In the absence of such a declaration, the index applies to all views for backwards compatibility and I think maybe also applies to the CAS as a whole. The nice thing about the default being that it applies to everything is that we can put off implementing view-restricted indexes until later; I think adding them is more a peformance optimization than anything else, elminating the creation of unneeded indexes. -Adam
Re: CAS and CasView redesign - question if all views should share thesame indexes?
Adam Lally wrote: I think this basically makes sense. I want to clarify though, that what we *do* currently have different indexes for each view (for example each view has its own annotation index, which holds the annotations relating to that view's sofa). This is done by replicating the index repository for each view. Right. I would like to change that in the course of introducing CasViews. A key question is "do all views have the same set of index _definitions_?" Currently, yes - the component descriptors declare index definitions without reference to views, and consequently, for every view we create an instance of each defined index. Your note above, and Marshall's, argue that this shouldn't necessarily be the case -- some indexes may make sense only for certain views (but also, only for certain components, a further complication). I think that probably makes sense, but I'm not sure it's a critical thing to implement now, if we haven't seen a real use case where it's a problem to create instances of indexes in every view even if they're not used. Hm, somehow, we need to distinguish between indexes that are global to all views, and those that are local to a view. How do we do that? The other key idea here is the global index repository that contains all of the indexes from all views -- we don't currently have anything like that. Take the annotation index as an example, and say there are multiple views each with their own annotation index. I also want to enable operations on the CAS like "get me all annotations in all views", or "get me all annotations of type Person in all views". To do that we also create an annotation index in the base CAS (the "global namespace"). I think you could do such a thing in your suggestion; if you had a global annotation index then whenever anyone did view.addFsToIndexes(myAnnot) in any view, myAnnot would also be added to the global annotation index (because you said the global index is visible from the index repository of the view). My idea was a little different, and I guess maybe just an implementation detail. Instead of actually adding myAnnot to a separate, global index, I would just add it to it's own view's index. Then, when someone asks for an iterator off of the global annotation index, I would do a dynamic merge of the annotation indexes in all views (the same way we do merging of indexes across types). But the effect is the same - we have a global index that provides access to everything that was indexed in any view. I didn't mean to suggest to have duplicate indexes. What I meant to say was, each view should have its own annotation index. In the CAS, each of these annotation indexes can be accessed separately. In fact, I think this is pretty much what you're saying as well. I don't see a use case for a global merged annotation index, other than tooling and utilities. And even for tooling, I think it makes sense to access the annotation for each view separately. If we need to iterate over annotations from different views sorted by their offsets, irrespective of the sofa they point into, we can provide a utility function that does that on the fly. Note however that this implies that one should never do addFsToIndexes() on the CAS with an annotation, as it would be added to all annotation indexes. My suggestion implies that the index repository itself is agnostic of views and sofas. If you add an annotation to the wrong repository, it's your own fault. So to summarize, I would suggest that annotation indexes, for example, only live in views, there is no global annotation index (neither conceptually, nor physically). To access annotations from the CAS, you still need to access view-specific indexes. Non-sofa indexes, on the other hand, only exist in the global namespace. The only rule of visibility is that one view can not access the view-specific indexes of another view. Everything else is always visible. So what I haven't figured out for myself is, what makes a sofa-index a sofa-index? Do we need a declaration, or can we figure this out automatically? --Thilo
Re: CAS and CasView redesign - question if all views should share thesame indexes?
On 12/21/06, Thilo Goetz <[EMAIL PROTECTED]> wrote: I haven't thought this through yet, but here's how I see indexes and their relation to views right now. Let me know if this agrees with your views, or how it differs. The index repository is a set of indexes, at least right now. All it can do is to give you indexes. The index repository of the CAS holds all indexes, a view's repository a subset thereof. An index is retrieved by name (i.e., each index has at least one name). Currently, if there is more than one index with the same indexing spec, but different names, all those names actually point to the same physical index. However, that choice is transparent to the user. I assume this needs to change. If we have more than one view, and they all have annotation indexes, those should be different indexes (at least conceptually, but I think also physically). So views create a simple sort of name space: an index can either belong to the global namespace, or to that of an view. All indexes can be accessed from the CAS, but only global indexes and the indexes for the given view can be accessed from the index repository of that view. I think this basically makes sense. I want to clarify though, that what we *do* currently have different indexes for each view (for example each view has its own annotation index, which holds the annotations relating to that view's sofa). This is done by replicating the index repository for each view. A key question is "do all views have the same set of index _definitions_?" Currently, yes - the component descriptors declare index definitions without reference to views, and consequently, for every view we create an instance of each defined index. Your note above, and Marshall's, argue that this shouldn't necessarily be the case -- some indexes may make sense only for certain views (but also, only for certain components, a further complication). I think that probably makes sense, but I'm not sure it's a critical thing to implement now, if we haven't seen a real use case where it's a problem to create instances of indexes in every view even if they're not used. The other key idea here is the global index repository that contains all of the indexes from all views -- we don't currently have anything like that. Take the annotation index as an example, and say there are multiple views each with their own annotation index. I also want to enable operations on the CAS like "get me all annotations in all views", or "get me all annotations of type Person in all views". To do that we also create an annotation index in the base CAS (the "global namespace"). I think you could do such a thing in your suggestion; if you had a global annotation index then whenever anyone did view.addFsToIndexes(myAnnot) in any view, myAnnot would also be added to the global annotation index (because you said the global index is visible from the index repository of the view). My idea was a little different, and I guess maybe just an implementation detail. Instead of actually adding myAnnot to a separate, global index, I would just add it to it's own view's index. Then, when someone asks for an iterator off of the global annotation index, I would do a dynamic merge of the annotation indexes in all views (the same way we do merging of indexes across types). But the effect is the same - we have a global index that provides access to everything that was indexed in any view. -Adam
Re: CAS and CasView redesign - question if all views should share thesame indexes?
I haven't thought this through yet, but here's how I see indexes and their relation to views right now. Let me know if this agrees with your views, or how it differs. The index repository is a set of indexes, at least right now. All it can do is to give you indexes. The index repository of the CAS holds all indexes, a view's repository a subset thereof. An index is retrieved by name (i.e., each index has at least one name). Currently, if there is more than one index with the same indexing spec, but different names, all those names actually point to the same physical index. However, that choice is transparent to the user. I assume this needs to change. If we have more than one view, and they all have annotation indexes, those should be different indexes (at least conceptually, but I think also physically). So views create a simple sort of name space: an index can either belong to the global namespace, or to that of an view. All indexes can be accessed from the CAS, but only global indexes and the indexes for the given view can be accessed from the index repository of that view. --Thilo
Re: CAS and CasView redesign - question if all views should share thesame indexes?
On 12/19/06, Marshall Schor <[EMAIL PROTECTED]> wrote: If we think of a CasView as a way of accessing a subset of the data in the CAS, what are the pluses and minuses of having every view have the same (shared) index definitions? Would it make more sense to have each view have its own non-shared set of indexes / definitions? Maybe... we might extend the index descriptor format to allow specifying a set of view names to which the index applies. And in the absence of such a specification, the index might apply only to view of the component's declared input and output sofas. For "sofa-unaware" annotators (or whatever we're calling them this week ;) this would mean that the index only applies to the one view that they operate on (which is specified by sofa mappings). Although I'm concerned what happens if sofa mapping becomes dynamic. All in all, without a concrete use case where there is currently a significant performance issue, I would put off adding this feature. But some components need specific indexes (and type priorities :-) in order to correctly iterate through sets of FSs. In this case, the component part is closely associated with the index specification. For better modularity - if I had a component operating on a particular view, needing a particular index specification, these might be associated to the component - and having such an index as a "global" thing might lead to unwanted "collisions" in the index "name-space", although this could be minimized by having some uniqueness to the index name. So if I called the indexed "ComponentAsIndex", it would make more sense if this was associated only with Component A, and not globally. This doesn't quite match associated the index with just one view, I admit. Component-specific index also seem like a good idea (to do someday). One reason is to allow an optimization for remote annotators. There's no reason to actually build the index on the client side if it's only needed by a remote annotator, if the index isn't serialized to the remote node. We need only keep a list of indexed FS, and build the index on the remote node as we do already. Also we can deal with name collisions - two annotators could declare different indexes with the same label, but since they are specific to the component that is OK. When each component executes IndexRepository.getIndex(label), it would get the index that it itself had declared. This could be implemented the same way we are currently handling Sofa mapping - the CAS "knows" what annotator is currently processing it. Of course if two annotators declared indexes over the same type (or where one type is an ancestor of the other) with the same sort keys, they should be merged into one index in the implementation, even if they have different labels. -Adam
CAS and CasView redesign - question if all views should share thesame indexes?
If we think of a CasView as a way of accessing a subset of the data in the CAS, what are the pluses and minuses of having every view have the same (shared) index definitions? Would it make more sense to have each view have its own non-shared set of indexes / definitions? Pluses: - A view which wanted to only index one kind of thing would not need to create instances of all the other indexes (which would be unused). Minuses: - more complexity? Other topics around indexes include how to think about what the index is logically associated with. Using the DB analogy - indexes are "extra" - only serving to speed things up. In this view, they are associated with "assemblers" who are doing fine-tuning, space/time trade-offs. But some components need specific indexes (and type priorities :-) in order to correctly iterate through sets of FSs. In this case, the component part is closely associated with the index specification. For better modularity - if I had a component operating on a particular view, needing a particular index specification, these might be associated to the component - and having such an index as a "global" thing might lead to unwanted "collisions" in the index "name-space", although this could be minimized by having some uniqueness to the index name. So if I called the indexed "ComponentAsIndex", it would make more sense if this was associated only with Component A, and not globally. This doesn't quite match associated the index with just one view, I admit. -Marshall