Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Thilo Goetz
I haven't thought this through yet, but here's how I see indexes and 
their relation to views right now.  Let me know if this agrees with your 
views, or how it differs.


The index repository is a set of indexes, at least right now.  All it 
can do is to give you indexes.  The index repository of the CAS holds 
all indexes, a view's repository a subset thereof.  An index is 
retrieved by name (i.e., each index has at least one name).  Currently, 
if there is more than one index with the same indexing spec, but 
different names, all those names actually point to the same physical 
index.  However, that choice is transparent to the user.  I assume this 
needs to change.  If we have more than one view, and they all have 
annotation indexes, those should be different indexes (at least 
conceptually, but I think also physically).  So views create a simple 
sort of name space: an index can either belong to the global namespace, 
or to that of an view.  All indexes can be accessed from the CAS, but 
only global indexes and the indexes for the given view can be accessed 
from the index repository of that view.


--Thilo


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Adam Lally

On 12/21/06, Thilo Goetz [EMAIL PROTECTED] wrote:

I haven't thought this through yet, but here's how I see indexes and
their relation to views right now.  Let me know if this agrees with your
views, or how it differs.

The index repository is a set of indexes, at least right now.  All it
can do is to give you indexes.  The index repository of the CAS holds
all indexes, a view's repository a subset thereof.  An index is
retrieved by name (i.e., each index has at least one name).  Currently,
if there is more than one index with the same indexing spec, but
different names, all those names actually point to the same physical
index.  However, that choice is transparent to the user.  I assume this
needs to change.  If we have more than one view, and they all have
annotation indexes, those should be different indexes (at least
conceptually, but I think also physically).  So views create a simple
sort of name space: an index can either belong to the global namespace,
or to that of an view.  All indexes can be accessed from the CAS, but
only global indexes and the indexes for the given view can be accessed
from the index repository of that view.



I think this basically makes sense.  I want to clarify though, that
what we *do* currently have different indexes for each view (for
example each view has its own annotation index, which holds  the
annotations relating to that view's sofa). This is done by replicating
the index repository for each view.

A key question is do all views have the same set of index
_definitions_?  Currently, yes - the component descriptors declare
index definitions without reference to views, and consequently, for
every view we create an instance of each defined index.  Your note
above, and Marshall's, argue that this shouldn't necessarily be the
case -- some indexes may make sense only for certain views (but also,
only for certain components, a further complication).  I think that
probably makes sense, but I'm not sure it's a critical thing to
implement now, if we haven't seen a real use case where it's a problem
to create instances of indexes in every view even if they're not used.

The other key idea here is the global index repository that contains
all of the indexes from all views -- we don't currently have anything
like that.  Take the annotation index as an example, and say there are
multiple views each with their own annotation index.  I also want to
enable operations on the CAS like get me all annotations in all
views, or get me all annotations of type Person in all views.  To
do that we also create an annotation index in the base CAS (the
global namespace).  I think you could do such a thing in your
suggestion; if you had a global annotation index then whenever anyone
did view.addFsToIndexes(myAnnot) in any view, myAnnot would also be
added to the global annotation index (because you said the global
index is visible from the index repository of the view).  My idea was
a little different, and I guess maybe just an implementation detail.
Instead of actually adding myAnnot to a separate, global index, I
would just add it to it's own view's index.  Then, when someone asks
for an iterator off of the global annotation index, I would do a
dynamic merge of the annotation indexes in all views (the same way we
do merging of indexes across types).  But the effect is the same - we
have a global index that provides access to everything that was
indexed in any view.

-Adam


Re: Eclise Annotation Editor

2006-12-21 Thread Jörn Kottmann


All the code is owned by my employer Calcucare GmbH  
(www.calcucare.com). I think we have to sign the CCLA too.


CCLA and ICLA are now signed and send via facsimile.

How show we proceed now ?

I can prepare the code at sourceforge for moving to apache this would  
be:


+ changing the license form cpl to apache license
+ clean code from eclipse source code
+ adapt your code guideline
+ make a last release at sourceforge
+ clean-up code

Jörn



Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Thilo Goetz

Adam Lally wrote:
Thilo's stuff snipped

I think this basically makes sense.  I want to clarify though, that
what we *do* currently have different indexes for each view (for
example each view has its own annotation index, which holds  the
annotations relating to that view's sofa). This is done by replicating
the index repository for each view.


Right.  I would like to change that in the course of introducing CasViews.



A key question is do all views have the same set of index
_definitions_?  Currently, yes - the component descriptors declare
index definitions without reference to views, and consequently, for
every view we create an instance of each defined index.  Your note
above, and Marshall's, argue that this shouldn't necessarily be the
case -- some indexes may make sense only for certain views (but also,
only for certain components, a further complication).  I think that
probably makes sense, but I'm not sure it's a critical thing to
implement now, if we haven't seen a real use case where it's a problem
to create instances of indexes in every view even if they're not used.


Hm, somehow, we need to distinguish between indexes that are global to 
all views, and those that are local to a view.  How do we do that?




The other key idea here is the global index repository that contains
all of the indexes from all views -- we don't currently have anything
like that.  Take the annotation index as an example, and say there are
multiple views each with their own annotation index.  I also want to
enable operations on the CAS like get me all annotations in all
views, or get me all annotations of type Person in all views.  To
do that we also create an annotation index in the base CAS (the
global namespace).  I think you could do such a thing in your
suggestion; if you had a global annotation index then whenever anyone
did view.addFsToIndexes(myAnnot) in any view, myAnnot would also be
added to the global annotation index (because you said the global
index is visible from the index repository of the view).  My idea was
a little different, and I guess maybe just an implementation detail.
Instead of actually adding myAnnot to a separate, global index, I
would just add it to it's own view's index.  Then, when someone asks
for an iterator off of the global annotation index, I would do a
dynamic merge of the annotation indexes in all views (the same way we
do merging of indexes across types).  But the effect is the same - we
have a global index that provides access to everything that was
indexed in any view.


I didn't mean to suggest to have duplicate indexes.  What I meant to say 
was, each view should have its own annotation index.  In the CAS, each 
of these annotation indexes can be accessed separately.  In fact, I 
think this is pretty much what you're saying as well.  I don't see a use 
case for a global merged annotation index, other than tooling and 
utilities.  And even for tooling, I think it makes sense to access the 
annotation for each view separately.  If we need to iterate over 
annotations from different views sorted by their offsets, irrespective 
of the sofa they point into, we can provide a utility function that does 
that on the fly.


Note however that this implies that one should never do addFsToIndexes() 
on the CAS with an annotation, as it would be added to all annotation 
indexes.  My suggestion implies that the index repository itself is 
agnostic of views and sofas.  If you add an annotation to the wrong 
repository, it's your own fault.


So to summarize, I would suggest that annotation indexes, for example, 
only live in views, there is no global annotation index (neither 
conceptually, nor physically).  To access annotations from the CAS, you 
still need to access view-specific indexes.


Non-sofa indexes, on the other hand, only exist in the global namespace. 
 The only rule of visibility is that one view can not access the 
view-specific indexes of another view.  Everything else is always visible.


So what I haven't figured out for myself is, what makes a sofa-index a 
sofa-index?  Do we need a declaration, or can we figure this out 
automatically?


--Thilo



Re: Backwards compatibility for CAS API redesign

2006-12-21 Thread Adam Lally

On 12/21/06, Thilo Goetz [EMAIL PROTECTED] wrote:

 The idea is that a CAS has a current view (best term I can think of
 for it right now).  Any methods on the CAS that are view-oriented will
 apply to the current view.  This includes but is not limited to:
 getSofa()
 getDocumentText()
 getIndexRepository()
 addFsToIndexes()
 createAnnotation(int begin, int end) //needs to know which Sofa to refer to

It seems to me that this makes the CAS a view, maybe a deprecated one ;-)



Well, the current view isn't fixed.  For each annotator that's called
the current view may actually be a different physical view.  That's
why I think the better mental model is of the CAS having several views
and at any given time one is designated as the current view.



 Note that this approach also allows single-sofa application code to
 work.  We have a lot of code that does:
 AnalysisEngine ae = ...
 CAS cas = ae.newCAS();
 cas.setDocumentText(someString);
 ae.process(cas);

 and I think it would be really nice if this continues to work.

Very true, if this should cease to work, it would break a lot of code.
+1 to preserving this functionality.



Excellent.  There haven't been nearly enough +1's in this thread so far. :)



 /**
 * Gets the global index repository, which provides access to all indexed FS
 * in the entire CAS.
 */
 FSIndexRepository CAS.getGlobalIndexRepository()

 /**
 * Gets the index repository for the current view.
 */
 FSIndexRepository CAS.getIndexRepository()

And what about addFsToIndexes()?  I guess it should be local to the
current view.


Yes, for backwards compatibility to work we would need
CAS.addFsToIndexes() to apply to the current view only.



What I'm not so sure about is, do we need
addToAllIndexes()?  It doesn't make sense anyway to add annotations to
indexes of other views.


We need to sort out the meaning of global indexes over on the other
thread before we can come to a final answer here.  But, I was hoping
that if we have CAS.getGlobalIndexRepository() we'd also have
CAS.addFsToGlobalIndexes(), just for consistency of naming.



I'll just say this once, because I know I won't get through with
changing it: to me, the term view in this context has different
associations from what we mean by it.  When I hear indexes and views, I
think databases.  In DBs, a view is just a different way to look at your
data, and not necessarily a filter.  Our views are always filters, and
don't make the data accessible in any different way than it was before.
  On the other hand, our use of the term index is not DB conformant
either, so maybe I should just get this association out of my head.  I
do wonder if other people have the same issue, though.



Duly noted. :)  Maybe documentation can help... in the chapter that
introduces the CAS we  can point out that our definitions are not
consistent with how those terms in used in databases.

-Adam


Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Thilo Goetz

Adam Lally wrote:

On 12/21/06, Thilo Goetz [EMAIL PROTECTED] wrote:

I didn't mean to suggest to have duplicate indexes.  What I meant to say
was, each view should have its own annotation index.  In the CAS, each
of these annotation indexes can be accessed separately.  In fact, I
think this is pretty much what you're saying as well.  I don't see a use
case for a global merged annotation index, other than tooling and
utilities.  And even for tooling, I think it makes sense to access the
annotation for each view separately.


I think maybe we should take a step back and try to agree on a few
basic things that we want to be true of CASes and CasViews.  Here are
the ideas that I had, mostly drawing on the definition in the UIMA
spec proposal.

(1) The CAS is the container for all of the analysis data (as per the
UIMA spec).  It must be possible to create FS directly on the CAS
and there must be some reasonable way to retrieve the FS in the CAS
without having to be concerened wtih views.


Agreed.  It should be possible to say, on the global index repository: 
give me all indexes.  This will include the global indexes, as well as 
all view-specific indexes.  You can then iterate over all data in all 
indexes, without knowing anything about views.




(2) A CasView is a way of accessing a subset of FS in the CAS.  It
must be possible
to assert than an FS is a _member_ of a CasView, and there must be
some reasonable way to retrieve the members of the CasView.


In the general CAS, we can only access those FSs that are in some index. 
 If you need to be able to retrieve any FS whatsoever, you need to 
define a bag index over all types.  I would propose to handle views the 
same way.  A FS is a member of a view iff it's contained in one of the 
indexes specific to the view.  The same FS may live in several indexes, 
belonging to different views.  That seems in accordance with the spec 
proposal.


snip

If we need to iterate over
annotations from different views sorted by their offsets, irrespective
of the sofa they point into, we can provide a utility function that does
that on the fly.


I agree that it doesn't make much sense that if I access annotations
irrespective of sofas, they would be sorted by begin, end.  However, I
still think I might just want to get all annotations (of some type)
and not care about the order.


You can do that under my proposal: just get all annotation indexes for 
all views and iterate over each of them in turn.  If we need a utility 
function for that, it's easy enough to do.






Note however that this implies that one should never do addFsToIndexes()
on the CAS with an annotation, as it would be added to all annotation
indexes.  My suggestion implies that the index repository itself is
agnostic of views and sofas.  If you add an annotation to the wrong
repository, it's your own fault.



This behavior doesn't mesh well with the 3 ideas above.  To me,
indexing an FS in the CAS just means that I want to be able to
retrieve this FS back out of the CAS later.  It does not mean that I'm
asserting it to be a member of any view.


A view to me is just a set of indexes; moreover, it's a subset of the 
set of all indexes, which are exactly the indexes defined in the CAS. 
When I add a FS to all those indexes, it will be added to all applicable 
indexes, and that means all view indexes as well.  Alternatively, we can 
say adding an FS in the CAS means adding it to global, non-view indexes 
only.  That would make sense, but it doesn't sync with the idea that the 
CAS index repository contains all indexes, not just the global ones. 
Maybe we need a special API for that, addFsToGlobalIndexes().  So maybe 
getGlobalIndexRepository() should be called something else, to avoid 
confusion.  getCompleteIndexRepository() or something.




Moreover, I think the reverse direction should be true -- indexing an
FS in a view's index repository DOES add it (at least conceptually) to
indexes that apply to the CAS as a whole.  I liked this latter idea
because it provided a way to get at all the FS in the CAS without
having to be concerned with views.


I agree, and I hope that has been clear from my previous posts.  Any 
view-specific index is visible from the CAS, in my approach.






So to summarize, I would suggest that annotation indexes, for example,
only live in views, there is no global annotation index (neither
conceptually, nor physically).  To access annotations from the CAS, you
still need to access view-specific indexes.

Non-sofa indexes, on the other hand, only exist in the global namespace.
  The only rule of visibility is that one view can not access the
view-specific indexes of another view.  Everything else is always 
visible.


So what I haven't figured out for myself is, what makes a sofa-index a
sofa-index?  Do we need a declaration, or can we figure this out
automatically?



I think it's a view-index, not necessarily a sofa-index (for now it
doesn't matter, but we may someday 

[jira] Closed: (UIMA-135) Remove Entity View mode from DocumentAnalyzer

2006-12-21 Thread Adam Lally (JIRA)
 [ http://issues.apache.org/jira/browse/UIMA-135?page=all ]

Adam Lally closed UIMA-135.
---

Resolution: Fixed

Changed entity view mode to use a user-supplied EntityResolver object, rather 
than depend on an IBM-specific typesystem.

 Remove Entity View mode from DocumentAnalyzer
 -

 Key: UIMA-135
 URL: http://issues.apache.org/jira/browse/UIMA-135
 Project: UIMA
  Issue Type: Task
  Components: Tools
Reporter: Adam Lally
 Assigned To: Adam Lally
 Fix For: 2.1


 The DocumentAnalyzer's entity view mode is currently broken, and it only ever 
 worked for annotators that used an IBM-proprietary type system.  We need to 
 remove this mode and leave the ability for IBM to add such capability in its 
 own derivative of the DocumentAnalyzer.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: CAS and CasView redesign - question if all views should share thesame indexes?

2006-12-21 Thread Marshall Schor

Re: Need for Global indexes

Adam Lally wrote:

snip


 Moreover, I think the reverse direction should be true -- indexing an
 FS in a view's index repository DOES add it (at least conceptually) to
 indexes that apply to the CAS as a whole.  I liked this latter idea
 because it provided a way to get at all the FS in the CAS without
 having to be concerned with views.

I agree, and I hope that has been clear from my previous posts.  Any
view-specific index is visible from the CAS, in my approach.



OK, as I said above I think I was just stuck on whether or not the
thing that from the base CAS gives you a merged view of all the view
indexes was called an index, or whether it's just a utility method.

I'm using the terms index definitions and index instances here; we 
can have
one global set of index definitions  (or not :-) while having multiple 
index instances for those definitions, one per view, and
perhaps (a conceptual, maybe not real) one for the base CAS or global 
view or whatever we want to call it -

something used by people not concerned about views.

What is the use case or the global view set of indexes? I can't recall 
the use-case for this, beyond
being able to get all the data.   This thread has suggested other 
utilities that can effectively
merge the results from other view's index instances. Are there other 
use cases?


We had once discussed a use case where some collection of parts 
(annotators) that worked
with views wanted to share some data that was global to their views.  We 
thought that
the best-practice way to do that was to have this collection of parts 
define another view

to serve as their global-sharing-place, in preference to a system-provided
global-sharing-place because that would enable this collection of parts 
to be combined with
other parts in the future without having any accidental collisions in 
the global-sharing-space,

from other unknown users of this space.

I guess I would vote to have the thing that gets all the FS in all views 
be just a utility

method.

I hope if we put our minds to it we can get this done for 2.1.  I'm
hoping after 2.1 we can go a good long time without breaking backwards
compatibility again.

+1 to that :-)

-Marshall