Re: UIMAj3 ideas
On 16.07.2015, at 18:52, Petr Baudis pa...@ucw.cz wrote: On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote: * UIMAfit is not part of core UIMA and UIMA-AS is not part of core UIMA. It seems to me that UIMA-AS is doing things a bit differently than what the original UIMA idea of doing scaleout was. The two things don't play well together. I'd love a way to easily take my plain UIMA pipeline and scale it out, ideally without any code changes, *and* avoid the terrible XML config files. Any specifics of what to change here would be helpful. UIMA-AS was designed to enable scale-out without changing the core UIMA pipeline or it's XML descriptor. THe additional information for UIMA-AS scaleout was put into a separate xml descriptor which embeds the original plain UIMA one. I'm sure Richard would be able to explain this better, but I think one of the core issues is that UIMA-AS embeds the XML descriptor instead of the AnalysisEngineDescription. So when I want to use it together with AnalysisEngineDescription built with UIMAfit instead, it's time to start making crazy workarounds like Afaik, there is no API in UIMA-AS that allows inject an AnalysisEngineDescription into an UIMA-AS descriptor. UIMA-AS forces one to use an import, so the AED needs to be serialized and then imported again by UIMA-AS... or I just never found the right method call or missed when it was added. In fact, I didn't even find an API to programmatically create a UIMA-AS descriptor and at the time saw myself forced to implement a AsDeploymentDescription.java myself. See: https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/ * Connected with the above - I'd love .addToIndexes() to just disappear. Right now, the paradigm is that you build an annotation in an annotator, and the moment it gets saved in a CAS, it becomes basically read-only. You certainly can modify any of an Annotation's features subsequently. I'm guessing you're referring to another idea - adding additional features that were not initially defined in the UIMA type system. Sorry for the confusion, but that's not quite what I had in mind. I literally believe that right now, in order to modify value of a feature, you need to first remove it from an index, change the value, then re-add it back. Is that a misconception? Well, yes and no. Yes, it was required for the case where the value that you changed was on a feature that was part of some index. No, it should no longer be required as measures have been implemented to handle this automatically. See: The curious case of the zombie annotation aka UIMA-4049 https://issues.apache.org/jira/browse/UIMA-4049 I think that's a bug for the UIMA Tutorial, which mentions FSArray but not FSList. :-) Then I should tell you also about the uimaFIT FSCollectionFactory which contains all kinds of helpers to manage FSArray and FSList ;) Btw. there is also ArrayFS which is the CAS version of FSArray :P (Another pain point here - I always ache when I need to work with FSArray or I guess FSList, since it does not carry the type information that is in the typesystem - I need to manually typecast all the time and hope I don't make a mistake.) Did you know that uimaFIT JCasUtil.select() can also be applied to FSList and FSArray to avoid casting? for (Token t : JCasUtil.select(sentence.getTokens(), Token.class) { ... } CasUtil.select() can work also on ArrayFS Cheerio, -- Richard
Travel funding for ApacheCon EU Budapest - need to act today!
From the Apache Travel assistance committee: HI All, This is a reminder that currently applications are open for Travel Assistance to go to ApacheCon EU Budapest this coming September/October. Applications close tomorrow night so if you have not applied yet and intend to do so, please act now! For those that have submitted talks for this event and have not heard back as to whether or not they will be accepted or not; and you intend to apply for assistance based on getting your talks accepted — please DO apply for assistance now anyway, should your talk not be accepted, your assistance application can be cancelled. See apache.org/travel http://apache.org/travel for more info. See https://cwiki.apache.org/confluence/display/TAC/Application+Criteria for more about the process. Thanks and hope to see you all in Budapest! Gav… (On behalf of the Travel Assistance Committee)
Re: UIMAj3 ideas
On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote: On 7/9/2015 6:52 PM, Petr Baudis wrote: snip... https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3 I didn't figure out how to edit that wiki page, Due to spammers, we had to turn off public editing. However, I can add you to a list ( to do this, you have to register for a user id on the wiki, and then send me offline what that Id is ), but even without being on the list, there's a comment button which (I think) lets you add comments at the bottom. but a mental summary of the things I find currently irritating about UIMA and would love to see changed formed in my mind, so I thought I could contribute it for discussion. Great! * UIMAfit is not part of core UIMA and UIMA-AS is not part of core UIMA. It seems to me that UIMA-AS is doing things a bit differently than what the original UIMA idea of doing scaleout was. The two things don't play well together. I'd love a way to easily take my plain UIMA pipeline and scale it out, ideally without any code changes, *and* avoid the terrible XML config files. Any specifics of what to change here would be helpful. UIMA-AS was designed to enable scale-out without changing the core UIMA pipeline or it's XML descriptor. THe additional information for UIMA-AS scaleout was put into a separate xml descriptor which embeds the original plain UIMA one. I'm sure Richard would be able to explain this better, but I think one of the core issues is that UIMA-AS embeds the XML descriptor instead of the AnalysisEngineDescription. So when I want to use it together with AnalysisEngineDescription built with UIMAfit instead, it's time to start making crazy workarounds like https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/component/SimpleService.java?name=14aeba50c8c1r=14aeba50c8c18ea4d14c0d099f43c049f806d9db * Connected with the above - I'd love .addToIndexes() to just disappear. Right now, the paradigm is that you build an annotation in an annotator, and the moment it gets saved in a CAS, it becomes basically read-only. You certainly can modify any of an Annotation's features subsequently. I'm guessing you're referring to another idea - adding additional features that were not initially defined in the UIMA type system. Sorry for the confusion, but that's not quite what I had in mind. I literally believe that right now, in order to modify value of a feature, you need to first remove it from an index, change the value, then re-add it back. Is that a misconception? UIMA sets up the types and features once at the start of the pipeline run (from a merge of all the component's type systems), and locks down the type system. Other frameworks sometimes allow an unlocked type system, where you could add (after a Feature Structure is created) additional features. This is usually done by keeping a list of feature-name - feature-value pairs (such as your code snippet does, below). We're thinking of including this capability in the version 3, with a bit of a twist - the intent would be to keep the compilable aspect of locked-down type/features (for high performance), while adding (for those use cases that want it) the other style of dynamically added additional features (at some cost in performance). Still, this would be awesome and I'd totally make use of it! (The code in my original email I guess conflates demonstration of two issues - the addToIndex and lack of variable-sized lists, i.e. the java collection support issue. Even if you decide generic collection / map support would be too tricky, at least supporting variable-sized lists would help a lot...) * I wondered about storing (arbitrary) graphs in the CAS, but the issues above make this really impractical. If you also think about integrating microformats, you need to think about how to do this. We have had users store arbitrary graphs in the CAS, but, yes, it is not so efficient. The main element UIMA has for collections of references (to FeatureStructures) are the FSArray and FSList. As you point out the FSArray is fixed length. The FSList supports dynamic adding/removing etc. using the standard link-list technology. However, because UIMA data in the CAS (currently) is not garbage collected, you have to be careful when using this technique. ...oh, never mind. After using UIMA heavily for well over a year, I managed not to learn that FSList exists at all! Thanks for this pointer. I think that's a bug for the UIMA Tutorial, which mentions FSArray but not FSList. :-) (Another pain point here - I always ache when I need to work with FSArray or I guess FSList, since it does not carry the type information that is in the typesystem - I
Re: UIMAj3 ideas
Hi! On Fri, Jul 10, 2015 at 10:28:08AM -0400, Eddie Epstein wrote: Good comments which will likely generate lots of responses. For now please see comments on scaleout below. On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis pa...@ucw.cz wrote: * UIMAfit is not part of core UIMA and UIMA-AS is not part of core UIMA. It seems to me that UIMA-AS is doing things a bit differently than what the original UIMA idea of doing scaleout was. The two things don't play well together. I'd love a way to easily take my plain UIMA pipeline and scale it out, ideally without any code changes, *and* avoid the terrible XML config files. Not clear what you are referring to as the original UIMA idea of doing scaleout, the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS is also an embeddable framework that offers flexible vertical (multi-threading) and horizontal (multi-process) options for deploying an arbitrary pipeline. Admittedly scaleout with UIMA-AS is complicated and the minimal support for process management make it difficult to do scaleout simply. In what ways do you think UIMA-AS is inconsistent with UIMA or UIMA scaleout? Well, my impression after delving into some UIMA internals was that the original idea was to use the Analysis Structure Broker to control the pipeline flow and it would seem natural that when doing scale-out, one would simply provide a different ASB. Its javadoc even reads The Analysis Structure Broker (codeASB/code) is the component responsible for the details of communicating with Analysis Engines that may potentially be distributed across different physical machines. Of course, maybe I got it wrong. DUCC is full cluster management application that will scaleout a plain UIMA pipeline with no code changes, assuming that the application code is threadsafe. But a typical pipeline with a single collection reader creating input CASes and a single cas consumer will limit scaleout performance pretty quickly. DUCC makes it easyto eliminate the input data bottleneck. DUCC sample apps show one approach to eliminating the output bottleneck. Have you looked at DUCC? I use UIMA pipeline for question answering, where each question currently takes ~30s (single-threaded) to process (a lot of it spent waiting on databases), so I don't think I'd hit such a bottleneck. I did spend a few tens of minutes looking at DUCC, but I got the impression that it's not really trivial to set up. One of my goals is to minimize setup hassles for anyone who wants to run my software - ideally, they should be able to just compile and run. If I started to use DUCC, I'm not sure to what degree I could preserve this, but at least it's another element in the already steep learning curve for anyone who wants to tinker with the system. (Then there's this whole issue of UIMA-AS vs. UIMAfit and in-memory resource sharing - though from one of your previous emails, I got the impression that I could run multiple AEs in threads of a single java process; but I guess at that point I was already decided that I want to try something less complex.) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: UIMAj3 ideas
Richard, There is an API in UIMA for generating Analysis Engine Descriptors as well as Aggregates and Type System descriptions. I use that API to generate the xml descriptor at runtime after the configuration has been completed. I wrote my own logic to track the delegates of an Aggregate descriptor in order to propagate updates to/from delegates to allow the user to dynamically specify Analysis Engine parameters. I also merged the scale out parameters for UIMA-AS into the Analysis Engine object for ease of configuration. In addition I wrote my own code to generate the deployment descriptor from the programmatic parameters provided. The resulting XML is what the framework uses to generate the Spring Bean file you mentioned. That being said the existing API definitely has a learning curve which was part of the motivation for creating Leo. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jul 16, 2015, at 1:51 PM, Richard Eckart de Castilho r...@apache.org wrote: Hi Thomas, On 16.07.2015, at 21:42, Thomas Ginter thomas.gin...@utah.edu wrote: Have you looked into using Leo? It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all. Furthermore it is available via Maven so your code can compile an run. Did you find an API in UIMA AS to handle the programmatic generation of descriptors, or did you implement that yourself in Leo (as I had tried to in DKPro Lab)? If I remember correctly, then UIMA AS loaded plain XML descriptor files, transforms them to a Spring Bean file using XSLT and then used Spring to instantiate it. But I may have missed something. Cheers, -- Richard
Re: UIMAj3 ideas
Hi! On Thu, Jul 16, 2015 at 07:42:58PM +, Thomas Ginter wrote: Have you looked into using Leo? It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all. Furthermore it is available via Maven so your code can compile an run. http://department-of-veterans-affairs.github.io/Leo/userguide.html I had a look, but got the impression that I'd have to rewrite most of my pipeline generation code, and it's not small code. Also, it's not clear to me from Leo's docs whether and/or how it supports CAS multipliers and mergers, there seem to be no references to that. This impression might have been wrong, but overally I'd just welcome if I could stick with stock UIMA for scaleout at least in the form of multi-threading without cluster scaleout (which I think many UIMA users would welcome, and much smaller percentage wants to deploy to a cluster), that's what I was trying to say originally. -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: UIMAj3 ideas
On Thu, Jul 16, 2015 at 08:00:35PM +0200, Richard Eckart de Castilho wrote: On 16.07.2015, at 18:52, Petr Baudis pa...@ucw.cz wrote: Sorry for the confusion, but that's not quite what I had in mind. I literally believe that right now, in order to modify value of a feature, you need to first remove it from an index, change the value, then re-add it back. Is that a misconception? Well, yes and no. Yes, it was required for the case where the value that you changed was on a feature that was part of some index. No, it should no longer be required as measures have been implemented to handle this automatically. See: The curious case of the zombie annotation aka UIMA-4049 https://issues.apache.org/jira/browse/UIMA-4049 That's great to hear! However, when reading the bug report and looking closely at that part of the release notes, I think it should no longer be required isn't quite precise as changing indexed features might cause an exception to be thrown by an iterator that goes through these at the same time (so the fix for that is to use a snapshot iterator, and that sounds reasonable, more so when JCasUtil gets support for them - sorry if it did and I missed it, I'm still stuck on UIMA 2.6 for now anyway until the next release with fixed CasCopier). I think that's a bug for the UIMA Tutorial, which mentions FSArray but not FSList. :-) Then I should tell you also about the uimaFIT FSCollectionFactory which contains all kinds of helpers to manage FSArray and FSList ;) Btw. there is also ArrayFS which is the CAS version of FSArray :P .. Did you know that uimaFIT JCasUtil.select() can also be applied to FSList and FSArray to avoid casting? for (Token t : JCasUtil.select(sentence.getTokens(), Token.class) { ... } CasUtil.select() can work also on ArrayFS So many great news! Thanks so much for these. We'll certainly start using them in new code. :-) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: UIMAj3 ideas
Hi Petr, Have you looked into using Leo? It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all. Furthermore it is available via Maven so your code can compile an run. http://department-of-veterans-affairs.github.io/Leo/userguide.html The only catch to running UIMA-AS is making sure the broker is running. A manual step that we have not yet automated. Other than that it can scale most pipelines with the notable exception of pipelines that have really large resources. As for ideas for UIMA 3 I would love to see a much simpler CAS system that didn’t require a pre-definition of types before execution. Such as a very simple abstract base class that defines an “annotation” and is then extended in order to create/use a new type. It seems like the basic location based indexes could still be provided that way as well as the option of extending to provide custom indexes. If the CAS was implemented as a base set of very simple Java objects we would also have more serialization options. Possibly even making it possible for the user to plug in a different serializer if required such as protobuff. Just a thought. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jul 16, 2015, at 10:25 AM, Petr Baudis pa...@ucw.cz wrote: Hi! On Fri, Jul 10, 2015 at 10:28:08AM -0400, Eddie Epstein wrote: Good comments which will likely generate lots of responses. For now please see comments on scaleout below. On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis pa...@ucw.cz wrote: * UIMAfit is not part of core UIMA and UIMA-AS is not part of core UIMA. It seems to me that UIMA-AS is doing things a bit differently than what the original UIMA idea of doing scaleout was. The two things don't play well together. I'd love a way to easily take my plain UIMA pipeline and scale it out, ideally without any code changes, *and* avoid the terrible XML config files. Not clear what you are referring to as the original UIMA idea of doing scaleout, the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS is also an embeddable framework that offers flexible vertical (multi-threading) and horizontal (multi-process) options for deploying an arbitrary pipeline. Admittedly scaleout with UIMA-AS is complicated and the minimal support for process management make it difficult to do scaleout simply. In what ways do you think UIMA-AS is inconsistent with UIMA or UIMA scaleout? Well, my impression after delving into some UIMA internals was that the original idea was to use the Analysis Structure Broker to control the pipeline flow and it would seem natural that when doing scale-out, one would simply provide a different ASB. Its javadoc even reads The Analysis Structure Broker (codeASB/code) is the component responsible for the details of communicating with Analysis Engines that may potentially be distributed across different physical machines. Of course, maybe I got it wrong. DUCC is full cluster management application that will scaleout a plain UIMA pipeline with no code changes, assuming that the application code is threadsafe. But a typical pipeline with a single collection reader creating input CASes and a single cas consumer will limit scaleout performance pretty quickly. DUCC makes it easyto eliminate the input data bottleneck. DUCC sample apps show one approach to eliminating the output bottleneck. Have you looked at DUCC? I use UIMA pipeline for question answering, where each question currently takes ~30s (single-threaded) to process (a lot of it spent waiting on databases), so I don't think I'd hit such a bottleneck. I did spend a few tens of minutes looking at DUCC, but I got the impression that it's not really trivial to set up. One of my goals is to minimize setup hassles for anyone who wants to run my software - ideally, they should be able to just compile and run. If I started to use DUCC, I'm not sure to what degree I could preserve this, but at least it's another element in the already steep learning curve for anyone who wants to tinker with the system. (Then there's this whole issue of UIMA-AS vs. UIMAfit and in-memory resource sharing - though from one of your previous emails, I got the impression that I could run multiple AEs in threads of a single java process; but I guess at that point I was already decided that I want to try something less complex.) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: UIMAj3 ideas
Hi Thomas, On 16.07.2015, at 21:42, Thomas Ginter thomas.gin...@utah.edu wrote: Have you looked into using Leo? It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all. Furthermore it is available via Maven so your code can compile an run. Did you find an API in UIMA AS to handle the programmatic generation of descriptors, or did you implement that yourself in Leo (as I had tried to in DKPro Lab)? If I remember correctly, then UIMA AS loaded plain XML descriptor files, transforms them to a Spring Bean file using XSLT and then used Spring to instantiate it. But I may have missed something. Cheers, -- Richard
Re: UIMAj3 ideas
On 16.07.2015, at 23:10, Jaroslaw Cwiklik uim...@gmail.com wrote: The UIMA-AS *does* have an API to generate deployment descriptors although its not documented. Its an internal API for now and most likely will be documented in the next release of UIMA-AS. The API is implemented by DeploymentDescriptorFactory.java. in the uimaj-as-core project. Cool :) *thumbs up* -- Richard
Re: UIMAj3 ideas
Thomas, On 16.07.2015, at 22:56, Thomas Ginter thomas.gin...@utah.edu wrote: There is an API in UIMA for generating Analysis Engine Descriptors as well as Aggregates and Type System descriptions. I use that API to generate the xml descriptor at runtime after the configuration has been completed. I wrote my own logic to track the delegates of an Aggregate descriptor in order to propagate updates to/from delegates to allow the user to dynamically specify Analysis Engine parameters. I also merged the scale out parameters for UIMA-AS into the Analysis Engine object for ease of configuration. we're using the plain UIMA APIs for AED and friends in uimaFIT too - those APIs being not too user-friendly and XML being a pain was the major motivation to come up with uimaFIT. However, uimaFIT doesn't aspire to drive UIMA AS, just to make the core UIMA descriptors easier to handle. In addition I wrote my own code to generate the deployment descriptor from the programmatic parameters provided. The resulting XML is what the framework uses to generate the Spring Bean file you mentioned. So what you say confirms my findings. I never found a corresponding API for UIMA deployment descriptors in UIMA AS. It would have been great if UIMA AS had provided at least some basic API for deployment descriptors parallel to what UIMA offers for engines and aggregates. That being said the existing API definitely has a learning curve which was part of the motivation for creating Leo. Same for uimaFIT ;) Cheers, -- Richard
Re: UIMAj3 ideas
The UIMA-AS *does* have an API to generate deployment descriptors although its not documented. Its an internal API for now and most likely will be documented in the next release of UIMA-AS. The API is implemented by DeploymentDescriptorFactory.java. in the uimaj-as-core project. Jerry On Thu, Jul 16, 2015 at 4:56 PM, Thomas Ginter thomas.gin...@utah.edu wrote: Richard, There is an API in UIMA for generating Analysis Engine Descriptors as well as Aggregates and Type System descriptions. I use that API to generate the xml descriptor at runtime after the configuration has been completed. I wrote my own logic to track the delegates of an Aggregate descriptor in order to propagate updates to/from delegates to allow the user to dynamically specify Analysis Engine parameters. I also merged the scale out parameters for UIMA-AS into the Analysis Engine object for ease of configuration. In addition I wrote my own code to generate the deployment descriptor from the programmatic parameters provided. The resulting XML is what the framework uses to generate the Spring Bean file you mentioned. That being said the existing API definitely has a learning curve which was part of the motivation for creating Leo. Thanks, Thomas Ginter 801-448-7676 thomas.gin...@utah.edu On Jul 16, 2015, at 1:51 PM, Richard Eckart de Castilho r...@apache.org wrote: Hi Thomas, On 16.07.2015, at 21:42, Thomas Ginter thomas.gin...@utah.edu wrote: Have you looked into using Leo? It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all. Furthermore it is available via Maven so your code can compile an run. Did you find an API in UIMA AS to handle the programmatic generation of descriptors, or did you implement that yourself in Leo (as I had tried to in DKPro Lab)? If I remember correctly, then UIMA AS loaded plain XML descriptor files, transforms them to a Spring Bean file using XSLT and then used Spring to instantiate it. But I may have missed something. Cheers, -- Richard
looking for more informative exception messages when parsing invalid Ruta script
Hello, When using Ruta in a non-Workbench setup (in my case, Maven), I don't manage to catch Ruta script errors in a meaningful way. Here is an example: aaa\. - MyAnnotation; // fails because of escaped dot The thrown error is quite uninformative: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.uima.ruta.parser.RutaParser.emitErrorMessage( RutaParser.java:365) at org.apache.uima.ruta.parser.RutaParser.reportError(RutaParser.java:386) at org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken( BaseRecognizer.java:603) at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115) at org.apache.uima.ruta.parser.RutaParser.file_input(RutaParser.java:680) at org.apache.uima.ruta.engine.RutaEngine.loadScript(RutaEngine.java:1058) at org.apache.uima.ruta.engine.RutaEngine.initializeScript( RutaEngine.java:743) ... Here is the code to reproduce: https://github.com/renaud/annotate_ruta_example/tree/ruta_error_message However, if I paste that script line in the Ruta Workbench, it nicely underlines it in red at the exact location, and even says Mismatched input. How do I achieve the same programatically (from Java)? Thanks, Renaud