On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote: > On 7/9/2015 6:52 PM, Petr Baudis wrote: > <snip...> > > https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3 > > > I didn't figure out how to edit that wiki page, > Due to spammers, we had to turn off public editing. However, I can add you > to a > list ( to do this, you have to "register" for a user id on the wiki, and then > send me offline what that Id is ), but even without being on the list, > there's a > comment button which (I think) lets you add comments at the bottom. > > but a mental summary > > of the things I find currently irritating about UIMA and would love to > > see changed formed in my mind, so I thought I could contribute it for > > discussion. > Great! > > > > * UIMAfit is not part of core UIMA and UIMA-AS is not part of core > > UIMA. It seems to me that UIMA-AS is doing things a bit differently > > than what the original UIMA idea of doing scaleout was. The two > > things don't play well together. I'd love a way to easily take > > my plain UIMA pipeline and scale it out, ideally without any code > > changes, *and* avoid the terrible XML config files. > Any specifics of what to change here would be helpful. UIMA-AS was designed > to > enable scale-out without changing the core UIMA pipeline or it's XML > descriptor. THe additional information for UIMA-AS scaleout was put into a > separate xml descriptor which "embeds" the original plain UIMA one.
I'm sure Richard would be able to explain this better, but I think one of the core issues is that UIMA-AS embeds the XML descriptor instead of the AnalysisEngineDescription. So when I want to use it together with AnalysisEngineDescription built with UIMAfit instead, it's time to start making crazy workarounds like https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/component/SimpleService.java?name=14aeba50c8c1&r=14aeba50c8c18ea4d14c0d099f43c049f806d9db > > * Connected with the above - I'd love .addToIndexes() to just > > disappear. Right now, the paradigm is that you build an annotation > > in an annotator, and the moment it gets saved in a CAS, it becomes > > basically read-only. > You certainly can modify any of an Annotation's features subsequently. > I'm guessing you're referring to another idea - adding additional features > that were > not initially defined in the UIMA type system. Sorry for the confusion, but that's not quite what I had in mind. I literally believe that right now, in order to modify value of a feature, you need to first remove it from an index, change the value, then re-add it back. Is that a misconception? > UIMA sets up the types and > features once at the start of the pipeline run (from a merge of all the > component's type systems), and locks down the type system. Other frameworks > sometimes allow an unlocked type system, where you could add (after a Feature > Structure is created) additional features. This is usually done by keeping a > list of feature-name <-> feature-value pairs (such as your code snippet does, > below). We're thinking of including this capability in the version 3, with a > bit of a twist - the intent would be to keep the "compilable" aspect of > "locked-down" type/features (for high performance), while adding (for those > use > cases that want it) the other style of dynamically added additional features > (at > some cost in performance). Still, this would be awesome and I'd totally make use of it! (The code in my original email I guess conflates demonstration of two issues - the addToIndex and lack of variable-sized lists, i.e. the java collection support issue. Even if you decide generic collection / map support would be too tricky, at least supporting variable-sized lists would help a lot...) > > * I wondered about storing (arbitrary) graphs in the CAS, but the > > issues above make this really impractical. If you also think about > > integrating microformats, you need to think about how to do this. > We have had users store arbitrary graphs in the CAS, but, yes, it is not so > efficient. The main element UIMA has for collections of references (to > FeatureStructures) are the FSArray and FSList. As you point out the FSArray > is > fixed length. The FSList supports dynamic adding/removing etc. using the > standard link-list technology. However, because UIMA data in the CAS > (currently) is not garbage collected, you have to be careful when using this > technique. ...oh, never mind. After using UIMA heavily for well over a year, I managed not to learn that FSList exists at all! Thanks for this pointer. I think that's a bug for the UIMA Tutorial, which mentions FSArray but not FSList. :-) (Another pain point here - I always ache when I need to work with FSArray or I guess FSList, since it does not carry the type information that is in the typesystem - I need to manually typecast all the time and hope I don't make a mistake.) > The above proposal to allow the common Java Collection objects (like > ArrayList, > and Maps) as things in the CAS, plus garbage collection,should make it much > more > convenient to store and work with graphs in the CAS. > > > > * Complex pipelines are a bit clumsy. I think the biggest obvious > > problem is lack of signalling to CAS merger that input CASes have > > been exhausted. Having an "isLast" barrier sounds simple as long > > as you have only a single CAS multiplier paired with the CAS merger, > > but when this assumption breaks down, things start to deteriorate. > > However, I realize complex pipelines are a niche area. > It would be nice to hear some ideas here. (After reading Eddie Epstein's email and coming back to some more of his emails to me, I realize that the isLast hack I'm using is needless if I would instead use the "process-parent-last" flag of CASMultiplier. I'm learning a lot from interacting here! I guess that shows we could always make use of more good UIMA code examples...) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton